消除 ?, , ?, ? 以及来自 Java 字符串的其他此类表情符号/图像/标志

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49510006/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 03:02:21  来源:igfitidea点击:

Remove ?, , ? , ? and other such emojis/images/signs from Java strings

javastringemoji

提问by riorio

I have some strings with all kinds of different emojis/images/signs in them.

我有一些字符串,其中包含各种不同的表情符号/图像/标志。

Not all the strings are in English -- some of them are in other non-Latin languages, for example:

并非所有字符串都是英文的——其中一些是其他非拉丁语言,例如:

▓ railway??
→ Cats and dogs
I'm on 
Apples ? 
? Vi sign
? I'm the king ? 
Corée ? du Nord ?  (French)
 gj?r at b?de ?╗ (Norwegian)
Star me ★
Star ? once more
早上好 ? (Chinese)
Καλημ?ρα ? (Greek)
another ? sign ?
добрай ран?цы ? (Belarus)
? ??? ?????? ? (Hindi)
? ? ? ? Let's get together ★. We shall meet at 12/10/2018 10:00 AM at Tony's.?

...and many more of these.

...还有更多这些。

I would like to get rid of all these signs/images and to keep only the letters (and punctuation) in the different languages.

我想去掉所有这些标志/图像,只保留不同语言的字母(和标点符号)。

I tried to clean the signs using the EmojiParser library:

我尝试使用EmojiParser 库清理标志:

String withoutEmojis = EmojiParser.removeAllEmojis(input);

The problem is that EmojiParser is not able to remove the majority of the signs. The ? sign is the only one I found till now that it removed. Other signs such as ? ? ★ ? ? ? ? ? ? ? ? are not removed.

问题是 EmojiParser 无法删除大部分符号。这 ?标志是迄今为止我发现的唯一一个被移除的标志。其他标志如 ? ? ★ ? ? ? ? ? ? ? ? 不会被移除。

Is there a way to remove all these signs from the input strings and keeping only the letters and punctuation in the different languages?

有没有办法从输入字符串中删除所有这些符号并只保留不同语言的字母和标点符号?

采纳答案by Nick Bull

Instead of blacklisting some elements, how about creating a whitelist of the characters you do wish to keep? This way you don't need to worry about every new emoji being added.

与其将某些元素列入黑名单,不如创建一份您希望保留的角色的白名单?这样您就无需担心添加每个新表情符号。

String characterFilter = "[^\p{L}\p{M}\p{N}\p{P}\p{Z}\p{Cf}\p{Cs}\s]";
String emotionless = aString.replaceAll(characterFilter,"");

So:

所以:

  • [\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]is a range representing all numeric (\\p{N}), letter (\\p{L}), mark (\\p{M}), punctuation (\\p{P}), whitespace/separator (\\p{Z}), other formatting (\\p{Cf}) and other characters above U+FFFFin Unicode (\\p{Cs}), and newline (\\s) characters. \\p{L}specificallyincludes the characters from other alphabets such as Cyrillic, Latin, Kanji, etc.
  • The ^in the regex character set negates the match.
  • [\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]是一个范围,表示所有数字 ( \\p{N})、字母 ( \\p{L})、标记 ( \\p{M})、标点符号 ( \\p{P})、空格/分隔符 ( \\p{Z})、其他格式 ( \\p{Cf}) 和以上U+FFFFUnicode ( \\p{Cs}) 中的其他字符,以及换行符 ( \\s) 字符。\\p{L}具体包括来自其他字母的字符,如西里尔字母、拉丁字母、汉字等。
  • ^在正则表达式字符集否定匹配。

Example:

例子:

String str = "hello world _# 皆さん、こんにちは! 私はジョンと申します。";
System.out.print(str.replaceAll("[^\p{L}\p{M}\p{N}\p{P}\p{Z}\p{Cf}\p{Cs}\s]",""));
// Output:
//   "hello world _# 皆さん、こんにちは! 私はジョンと申します。"

If you need more information, check out the Java documentationfor regexes.

如果您需要更多信息,请查看正则表达式的 Java文档

回答by Karol Dowbecki

Based on Full Emoji List, v11.0you have 1644 different Unicode code points to remove. For example ?is on this list as U+2705.

基于完整的表情符号列表,v11.0你有 1644 个不同的 Unicode 代码点要删除。例如?在此列表中为U+2705

Having the full list of emojis you need to filter them out using code points. Iterating over single charor bytewon't work as single code point can span multiple bytes. Because Java uses UTF-16 emojis will usually take two chars.

拥有完整的表情符号列表,您需要使用代码点将它们过滤掉。迭代单个charbyte无法工作,因为单个代码点可以跨越多个字节。因为 Java 使用 UTF-16 表情符号通常需要两个chars。

String input = "ab?cd";
for (int i = 0; i < input.length();) {
  int cp = input.codePointAt(i);
  // filter out if matches
  i += Character.charCount(cp); 
}

Mapping from Unicode code point U+2705to Java intis straightforward:

从 Unicode 代码点U+2705到 Java 的映射int很简单:

int viSign = 0x2705;

or since Java supports Unicode Strings:

或者因为 Java 支持 Unicode 字符串:

int viSign = "?".codePointAt(0);

回答by Daniel Wagner

I'm not super into Java, so I won't try to write example code inline, but the way I would do this is to check what Unicode calls "the general category" of each character. There are a couple letter and punctuation categories.

我对 Java 不是特别感兴趣,所以我不会尝试编写内联示例代码,但我这样做的方法是检查 Unicode 将每个字符称为“一般类别”的内容。有几个字母和标点符号类别。

You can use Character.getTypeto find the general category of a given character. You should probably retain those characters that fall in these general categories:

您可以使用Character.getType来查找给定字符的一般类别。您可能应该保留属于这些一般类别的那些字符:

COMBINING_SPACING_MARK
CONNECTOR_PUNCTUATION
CURRENCY_SYMBOL
DASH_PUNCTUATION
DECIMAL_DIGIT_NUMBER
ENCLOSING_MARK
END_PUNCTUATION
FINAL_QUOTE_PUNCTUATION
FORMAT
INITIAL_QUOTE_PUNCTUATION
LETTER_NUMBER
LINE_SEPARATOR
LOWERCASE_LETTER
MATH_SYMBOL
MODIFIER_LETTER
MODIFIER_SYMBOL
NON_SPACING_MARK
OTHER_LETTER
OTHER_NUMBER
OTHER_PUNCTUATION
PARAGRAPH_SEPARATOR
SPACE_SEPARATOR
START_PUNCTUATION
TITLECASE_LETTER
UPPERCASE_LETTER

(All of the characters you listed as specifically wanting to remove have general category OTHER_SYMBOL, which I did not include in the above category whitelist.)

(您列出的所有特别想要删除的字符都有一般类别OTHER_SYMBOL,我没有将其包含在上述类别白名单中。)

回答by Marcos Zolnowski

I gave some examples below, and thought that Latin is enough, but...

我在下面举了一些例子,并认为拉丁语就足够了,但是......

Is there a way to remove all these signs from the input string and keeping only the letters & punctuation in the different languages?

有没有办法从输入字符串中删除所有这些符号并只保留不同语言的字母和标点符号?

After editing, developed a new solution, using the Character.getTypemethod, and that appears to be the best shot at this.

编辑后,使用该Character.getType方法开发了一个新的解决方案,这似乎是最好的解决方案。

package zmarcos.emoji;

import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

public class TestEmoji {

    public static void main(String[] args) {
        String[] arr = {"Remove ?, , ? , ? and other such signs from Java string",
            "→ Cats and dogs",
            "I'm on ",
            "Apples ? ",
            "? Vi sign",
            "? I'm the king ? ",
            "Star me ★",
            "Star ? once more",
            "早上好 ?",
            "Καλημ?ρα ?"};
        System.out.println("---only letters and spaces alike---\n");
        for (String input : arr) {
            int[] filtered = input.codePoints().filter((cp) -> Character.isLetter(cp) || Character.isWhitespace(cp)).toArray();
            String result = new String(filtered, 0, filtered.length);
            System.out.println(input);
            System.out.println(result);
        }

        System.out.println("\n---unicode blocks white---\n");
        Set<Character.UnicodeBlock> whiteList = new HashSet<>();
        whiteList.add(Character.UnicodeBlock.BASIC_LATIN);
        for (String input : arr) {
            int[] filtered = input.codePoints().filter((cp) -> whiteList.contains(Character.UnicodeBlock.of(cp))).toArray();
            String result = new String(filtered, 0, filtered.length);
            System.out.println(input);
            System.out.println(result);
        }

        System.out.println("\n---unicode blocks black---\n");
        Set<Character.UnicodeBlock> blackList = new HashSet<>();        
        blackList.add(Character.UnicodeBlock.EMOTICONS);
        blackList.add(Character.UnicodeBlock.MISCELLANEOUS_TECHNICAL);
        blackList.add(Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS);
        blackList.add(Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS_AND_ARROWS);
        blackList.add(Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS);
        blackList.add(Character.UnicodeBlock.ALCHEMICAL_SYMBOLS);
        blackList.add(Character.UnicodeBlock.TRANSPORT_AND_MAP_SYMBOLS);
        blackList.add(Character.UnicodeBlock.GEOMETRIC_SHAPES);
        blackList.add(Character.UnicodeBlock.DINGBATS);
        for (String input : arr) {
            int[] filtered = input.codePoints().filter((cp) -> !blackList.contains(Character.UnicodeBlock.of(cp))).toArray();
            String result = new String(filtered, 0, filtered.length);
            System.out.println(input);
            System.out.println(result);
        }
        System.out.println("\n---category---\n");
        int[] category = {Character.COMBINING_SPACING_MARK, Character.COMBINING_SPACING_MARK, Character.CONNECTOR_PUNCTUATION, /*Character.CONTROL,*/ Character.CURRENCY_SYMBOL,
            Character.DASH_PUNCTUATION, Character.DECIMAL_DIGIT_NUMBER, Character.ENCLOSING_MARK, Character.END_PUNCTUATION, Character.FINAL_QUOTE_PUNCTUATION,
            /*Character.FORMAT,*/ Character.INITIAL_QUOTE_PUNCTUATION, Character.LETTER_NUMBER, Character.LINE_SEPARATOR, Character.LOWERCASE_LETTER,
            /*Character.MATH_SYMBOL,*/ Character.MODIFIER_LETTER, /*Character.MODIFIER_SYMBOL,*/ Character.NON_SPACING_MARK, Character.OTHER_LETTER, Character.OTHER_NUMBER,
            Character.OTHER_PUNCTUATION, /*Character.OTHER_SYMBOL,*/ Character.PARAGRAPH_SEPARATOR, /*Character.PRIVATE_USE,*/
            Character.SPACE_SEPARATOR, Character.START_PUNCTUATION, /*Character.SURROGATE,*/ Character.TITLECASE_LETTER, /*Character.UNASSIGNED,*/ Character.UPPERCASE_LETTER};
        Arrays.sort(category);
        for (String input : arr) {
            int[] filtered = input.codePoints().filter((cp) -> Arrays.binarySearch(category, Character.getType(cp)) >= 0).toArray();
            String result = new String(filtered, 0, filtered.length);
            System.out.println(input);
            System.out.println(result);
        }
    }

}

Output:

输出:

---only letters and spaces alike---

Remove ?, , ? , ? and other such signs from Java string
Remove      and other such signs from Java string
→ Cats and dogs
 Cats and dogs
I'm on 
Im on 
Apples ? 
Apples  
? Vi sign
 Vi sign
? I'm the king ? 
 Im the king  
Star me ★
Star me 
Star ? once more
Star  once more
早上好 ?
早上好 
Καλημ?ρα ?
Καλημ?ρα 

---unicode blocks white---

Remove ?, , ? , ? and other such signs from Java string
Remove , ,  ,  and other such signs from Java string
→ Cats and dogs
 Cats and dogs
I'm on 
I'm on 
Apples ? 
Apples  
? Vi sign
 Vi sign
? I'm the king ? 
 I'm the king  
Star me ★
Star me 
Star ? once more
Star  once more
早上好 ?

Καλημ?ρα ?


---unicode blocks black---

Remove ?, , ? , ? and other such signs from Java string
Remove , ,  ,  and other such signs from Java string
→ Cats and dogs
→ Cats and dogs
I'm on 
I'm on 
Apples ? 
Apples  
? Vi sign
 Vi sign
? I'm the king ? 
 I'm the king  
Star me ★
Star me 
Star ? once more
Star  once more
早上好 ?
早上好 
Καλημ?ρα ?
Καλημ?ρα 

---category---

Remove ?, , ? , ? and other such signs from Java string
Remove , ,  ,  and other such signs from Java string
→ Cats and dogs
 Cats and dogs
I'm on 
I'm on 
Apples ? 
Apples  
? Vi sign
 Vi sign
? I'm the king ? 
 I'm the king  
Star me ★
Star me 
Star ? once more
Star  once more
早上好 ?
早上好 
Καλημ?ρα ?
Καλημ?ρα 

The code works by streaming the String to code-points. Then using lambdas to filter characters into a intarray, then we convert the array to String.

该代码通过将字符串流式传输到代码点来工作。然后使用 lambdas 将字符过滤成int数组,然后我们将数组转换为 String。

The letters and spacesare using using the Character methods to filter, not good with punctuation. Failed attempt.

字母和空格使用使用字符方法进行筛选,没有标点符号好。失败的尝试

The unicode blocks whitefilter using the unicode blocks the programmer specifies as allowed. Failed attempt.

所述的unicode块白色使用Unicode块程序员指定过滤器所允许的。失败的尝试

The unicode blocks blackfilter using the unicode blocks the programmer specifies as not allowed. Failed attempt.

所述的unicode块黑色滤波器使用Unicode块程序员指定为不允许的。失败的尝试

The categoryfilter using the static method Character.getType. The programmer can define in the categoryarray what types are allowed. WORKS.

类别使用静态方法过滤Character.getType。程序员可以在category数组中定义允许的类型。作品

回答by Daniel F

ICU4J is your friend.

ICU4J 是您的朋友。

UCharacter.hasBinaryProperty(UProperty.EMOJI);

Remember to keep your version of icu4j up to date and note this will only filter out official Unicode emoji, not symbol characters. Combine with filtering out other character types as desired.

请记住使您的 icu4j 版本保持最新,并注意这只会过滤掉官方的 Unicode 表情符号,而不是符号字符。根据需要结合过滤掉其他字符类型。

More information: http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UProperty.html#EMOJI

更多信息:http: //icu-project.org/apiref/icu4j/com/ibm/icu/lang/UProperty.html#EMOJI

回答by Atwood Mandelbrot-Spolsky

Use a jQuery plugin called RM-Emoji. Here's how it works:

使用名为 RM-Emoji 的 jQuery 插件。这是它的工作原理:

$('#text').remove('emoji').fast()

This is the fast mode that may miss some emojis as it uses heuristic algorithms for finding emojis in text. Use the .full()method to scan entire string and remove all emojis guaranteed.

这是一种可能会遗漏一些表情符号的快速模式,因为它使用启发式算法在文本中查找表情符号。使用该.full()方法扫描整个字符串并删除所有保证的表情符号。

回答by liheyuan

Try this project simple-emoji-4j

试试这个项目simple-emoji-4j

Compatible with Emoji 12.0 (2018.10.15)

兼容表情符号 12.0 (2018.10.15)

Simple with:

简单:

EmojiUtils.removeEmoji(str)