Java 从字符串中提取所有表情符号的正则表达式是什么？

Question

提问by vishalaksh

I have a String encoded in UTF-8. For example:

我有一个以 UTF-8 编码的字符串。例如：

Thats a nice joke

I have to extract all the emojis present in the sentence. And the emoji could be any

我必须提取句子中存在的所有表情符号。表情符号可以是任何

When this sentence is viewed in terminal using command less text.txtit is viewed as:

当在终端中使用命令查看这句话时，less text.txt它被视为：

Thats a nice joke <U+1F606><U+1F606><U+1F606> <U+1F61B>

This is the corresponding UTF code for the emoji. All the codes for emojis can be found at emojitracker.

这是表情符号对应的 UTF 代码。表情符号的所有代码都可以在emojitracker找到。

For the purpose of finding all the occurances, I used a regular expression pattern (<U\+\w+?>)but it didnt work for the UTF-8 encoded string.

为了找到所有出现的情况，我使用了正则表达式模式，(<U\+\w+?>)但它不适用于 UTF-8 编码的字符串。

Following is my code:

以下是我的代码：

    String s="Thats a nice joke  ";
    Pattern pattern = Pattern.compile("(<U\+\w+?>)");
    Matcher matcher = pattern.matcher(s);
    List<String> matchList = new ArrayList<String>();

    while (matcher.find()) {
        matchList.add(matcher.group());
    }

    for(int i=0;i<matchList.size();i++){
        System.out.println(matchList.get(i));

    }

This pdfsays Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs. So I want to capture any character lying within this range.

这个pdf说Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs。所以我想捕捉这个范围内的任何角色。

Answer 1

采纳答案by T.J. Crowder

the pdf that you just mentionedsays Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs. So lets say I want to capture any character lying within this range. Now what to do?

您刚刚提到的 pdf说 Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs。所以假设我想捕捉这个范围内的任何角色。现在该怎么办？

Okay, but I will just note that the emoji in your question are outside that range! :-)

好的，但我会注意到您问题中的表情符号超出了该范围！:-)

The fact that these are above 0xFFFFcomplicates things, because Java strings store UTF-16. So we can't just use one simple character class for it. We're going to have surrogate pairs. (More: http://www.unicode.org/faq/utf_bom.html)

上面这些的事实0xFFFF使事情变得复杂，因为 Java 字符串存储 UTF-16。所以我们不能只使用一个简单的字符类。我们将有代理对。（更多：http: //www.unicode.org/faq/utf_bom.html）

U+1F300 in UTF-16 ends up being the pair \uD83C\uDF00; U+1F5FF ends up being \uD83D\uDDFF. Note that the first character went up, we cross at least one boundary. So we have to know what ranges of surrogate pairs we're looking for.

UTF-16 中的 U+1F300 最终成为一对\uD83C\uDF00；U+1F5FF 最终是\uD83D\uDDFF. 请注意，第一个字符上升了，我们至少跨越了一个边界。所以我们必须知道我们正在寻找的代理对的范围。

Not being steeped in knowledge about the inner workings of UTF-16, I wrote a program to find out (source at the end — I'd double-check it if I were you, rather than trusting me). It tells me we're looking for \uD83Cfollowed by anything in the range \uDF00-\uDFFF(inclusive), or \uD83Dfollowed by anything in the range \uDC00-\uDDFF(inclusive).

由于没有深入了解 UTF-16 的内部工作原理，我编写了一个程序来找出答案（最后的来源 - 如果我是你，我会仔细检查它，而不是相信我）。它告诉我我们正在寻找\uD83C后跟范围内的任何内容\uDF00-\uDFFF（包括），或\uD83D后跟范围内的任何内容\uDC00-\uDDFF（包括）。

So armed with that knowledge, in theory we could now write a pattern:

有了这些知识，理论上我们现在可以编写一个模式：

// This is wrong, keep reading
Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])");

That's an alternation of two non-capturing groups, the first group for the pairs starting with \uD83C, and the second group for the pairs starting with \uD83D.

这是两个非捕获组的交替，第一组用于以开头的对\uD83C，第二组用于以开头的对\uD83D。

But that fails(doesn't find anything). I'm fairly sure it's because we're trying to specify halfof a surrogate pair in various places:

但这失败了（找不到任何东西）。我很确定这是因为我们试图在不同的地方指定代理对的一半：

Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])");
// Half of a pair --------------^------^------^-----------^------^------^

We can't just split up surrogate pairs like that, they're called surrogate pairsfor a reason. :-)

我们不能像那样拆分代理对，它们被称为代理对是有原因的。:-)

Consequently, I don't think we can use regular expressions (or indeed, any string-based approach) for this at all. I think we have to search through chararrays.

因此，我认为我们根本不能为此使用正则表达式（或者实际上，任何基于字符串的方法）。我认为我们必须搜索char数组。

chararrays hold UTF-16 values, so we canfind those half-pairs in the data if we look for it the hard way:

char数组保存 UTF-16 值，因此如果我们通过艰难的方式寻找它，我们可以在数据中找到那些半对：

String s = new StringBuilder()
                .append("Thats a nice joke ")
                .appendCodePoint(0x1F606)
                .appendCodePoint(0x1F606)
                .appendCodePoint(0x1F606)
                .append(" ")
                .appendCodePoint(0x1F61B)
                .toString();
char[] chars = s.toCharArray();
int index;
char ch1;
char ch2;

index = 0;
while (index < chars.length - 1) { // -1 because we're looking for two-char-long things
    ch1 = chars[index];
    if ((int)ch1 == 0xD83C) {
        ch2 = chars[index+1];
        if ((int)ch2 >= 0xDF00 && (int)ch2 <= 0xDFFF) {
            System.out.println("Found emoji at index " + index);
            index += 2;
            continue;
        }
    }
    else if ((int)ch1 == 0xD83D) {
        ch2 = chars[index+1];
        if ((int)ch2 >= 0xDC00 && (int)ch2 <= 0xDDFF) {
            System.out.println("Found emoji at index " + index);
            index += 2;
            continue;
        }
    }
    ++index;
}

Obviously that's just debug-level code, but it does the job. (In your given string, with its emoji, of course it won't find anything as they're outside the range. But if you change the upper bound on the second pair to 0xDEFFinstead of 0xDDFF, it will. No idea if that would also include non-emojis, though.)

显然，这只是调试级别的代码，但它可以完成工作。（在给定的字符串中，带有它的表情符号，当然它不会找到任何东西，因为它们超出了范围。但是如果你将第二对的上限改为0xDEFF而不是0xDDFF，它会。不知道这是否也会不过，包括非表情符号。）

Source of my program to find out what the surrogate ranges were:

我的程序来源，用于找出代理范围是什么：

public class FindRanges {

    public static void main(String[] args) {
        char last0 = '\uD83C \uDF00-\uDFFF
\uD83D \uDC00-\uDDFF';
        char last1 = 'public class SimpleEscaper extends UnicodeEscaper
{
    @Override
    protected char[] escape(int codePoint)
    {
        if (0x1f000 >= codePoint && codePoint <= 0x1ffff)
        {
            return Integer.toHexString(codePoint).toCharArray();
        }

        return Character.toChars(codePoint);
    }
}
';
        for (int x = 0x1F300; x <= 0x1F5FF; ++x) {
            char[] chars = new StringBuilder().appendCodePoint(x).toString().toCharArray();
            if (chars[0] != last0) {
                if (last0 != 'public class SplitByUnicode {
    public static void main(String[] argv) throws Exception {
        String string = "Thats a nice joke  ";
        System.out.println("Original String:"+string);
        String regexPattern = "[\uD83C-\uDBFF\uDC00-\uDFFF]+";
        byte[] utf8 = string.getBytes("UTF-8");

        String string1 = new String(utf8, "UTF-8");

        Pattern pattern = Pattern.compile(regexPattern);
        Matcher matcher = pattern.matcher(string1);
        List<String> matchList = new ArrayList<String>();

        while (matcher.find()) {
            matchList.add(matcher.group());
        }

        for(int i=0;i<matchList.size();i++){
            System.out.println(i+":"+matchList.get(i));

        }
    }
}
') {
                    System.out.println("-\u" + Integer.toHexString((int)last1).toUpperCase());
                }
                System.out.print("\u" + Integer.toHexString((int)chars[0]).toUpperCase() + " \u" + Integer.toHexString((int)chars[1]).toUpperCase());
                last0 = chars[0];
            }
            last1 = chars[1];
        }
        if (last0 != '
Original String:Thats a nice joke  
0:
1:
') {
            System.out.println("-\u" + Integer.toHexString((int)last1).toUpperCase());
        }
    }
}

Output:

输出：

    String s="Thats a nice joke  ";
    Pattern pattern = Pattern.compile("[\ud83c\udc00-\ud83c\udfff]|[\ud83d\udc00-\ud83d\udfff]|[\u2600-\u27ff]",
                                      Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(s);
    List<String> matchList = new ArrayList<String>();

    while (matcher.find()) {
        matchList.add(matcher.group());
    }

    for(int i=0;i<matchList.size();i++){
        System.out.println(matchList.get(i));
    }

Answer 2

回答by Mr.C

Assuming that you are asking for standard Unicode emoji ranges (there are different blocks by vendor) you may consider these three ranges:

假设您要求标准的 Unicode 表情符号范围（供应商有不同的块），您可以考虑这三个范围：

0x20a0 - 0x32ff
0x1f000 - 0x1ffff
0xfe4e5 - 0xfe4ee

0x20a0 - 0x32ff
0x1f000 - 0x1ffff
0xfe4e5 - 0xfe4ee

Besides all the thoughtful explanation that T.J.Crowder has shared with us, needs to be said that beginning with Java 7 is possible to match UTF-16 encoded surrogate pairs with ease.

除了 TJCrowder 与我们分享的所有深思熟虑的解释之外，需要说明的是，从 Java 7 开始可以轻松匹配 UTF-16 编码的代理对。

Take a look at the docs:

看一下文档：

http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

A Unicode character can also be represented in a regular-expression by using its Hex notation(hexadecimal code point value) directly as described in construct \x{...}, for example a supplementary character U+2011F can be specified as \x{2011F}, instead of two consecutive Unicode escape sequences of the surrogate pair \uD840\uDD1F.

Unicode 字符也可以通过直接使用其十六进制表示法（十六进制代码点值）来表示在正则表达式中，如构造 \x{...} 中所述，例如补充字符 U+2011F 可以指定为 \x {2011F}，而不是代理对 \uD840\uDD1F 的两个连续 Unicode 转义序列。

Nevertheless, if you cannot switch to Java 7, you can extend the valuable UnicodeEscaperprovided by Guava.

不过，如果你不能切换到 Java 7，你可以扩展Guava 提供的有价值的UnicodeEscaper。

Here an implementation for the sake of example:

这里是一个为了示例的实现：

public static String mysqlSafe(String input) {
  if (input == null) return null;
    StringBuilder sb = new StringBuilder();

    for (int i = 0; i < input.length(); i++) {
      if (i < (input.length() - 1)) { // Emojis are two characters long in java, e.g. a rocket emoji is "\uD83D\uDE80";
        if (Character.isSurrogatePair(input.charAt(i), input.charAt(i + 1))) {
          i += 1; //also skip the second character of the emoji
          continue;
        }
      }
      sb.append(input.charAt(i));
    }

  return sb.toString();
}

Answer 3

回答by Karan Ashar

Had a similar problem. The following served me well and matches surrogate pairs

有类似的问题。以下对我很有帮助，并且匹配代理对

String input = "A string with a \uD83D\uDC66\uD83C\uDFFFfew emojis!";
String result = EmojiParser.removeAllEmojis(input);

Output is:

输出是：

<dependency>
  <groupId>com.vdurmont</groupId>
  <artifactId>emoji-java</artifactId>
  <version>3.1.3</version>
</dependency>

Found the regex from https://stackoverflow.com/a/24071599/915972

从https://stackoverflow.com/a/24071599/915972找到正则表达式

Answer 4

回答by Shi Xiangyang

you can do it like this

你可以这样做

compile 'com.vdurmont:emoji-java:3.1.3'

Answer 5

回答by Mike

This worked for me in java 8:

这在 Java 8 中对我有用：

String emojiText = "A ,  and a  became friends. For 's birthday party, they all had s, s, s and .";

EmojiUtils.removeAllEmojis(emojiText);//returns "A ,  and a  became friends. For 's birthday party, they all had s, s, s and .

Answer 6

回答by gidim

Using emoji-javai've wrote a simple method that removes all emojis including fitzpatrick modifiers. Requires an external library but easier to maintain than those monster regexes.

使用emoji-java我写了一个简单的方法来删除所有表情符号，包括fitzpatrick 修饰符。需要一个外部库，但比那些怪物正则表达式更容易维护。

Use:

用：

(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff]|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\udde6-\uddff]|[\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|[\ud83c[\ude32-\ude3a]|[\ud83c[\ude50-\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff])

emoji-java maven installation:

emoji-java maven 安装：

private static String remove_Emojis(String name)
{  

    //we will store all the letters in this array
    ArrayList<Character> nonEmoji = new ArrayList<>();

     // and when we rebuild the name we will put it in here
    String newName = "";


    // we are going to loop through checking each character to see if its an emoji or not
    for (int i = 0; i < name.length(); i++) 
     {

        if (Character.isLetterOrDigit(name.charAt(i)))
        {
            nonEmoji.add(name.charAt(i));
        } 

         else 
          {
             // this is just a 2nd check in case the other method didn't allow some letter
            if (Build.VERSION.SDK_INT > 18)
            {
                if (Character.isAlphabetic(name.charAt(i))) 
                {
                    nonEmoji.add(name.charAt(i));
                }
            }
        }


        if (name.charAt(i) == ' ')// may want to consider adding or '-' or '\''
        {
            nonEmoji.add(i);// just add it
        }

        if (name.charAt(i) == '@' && !name.contains(" "))// I put this in for email addresses
        {
            nonEmoji.add('@');
        }
    }

    // finally just loop through building it back out
    for (int i = 0; i < nonEmoji.size(); i++) {

        newName += nonEmoji.get(i);
    }

    return newName;
}

gradle:

等级：

##代码##

EDIT: previously submitted answer was pulled into emoji-java source code.

编辑：先前提交的答案被拉入 emoji-java 源代码。

Answer 7

回答by Chaitanya

You may also use emoji4jlibrary.

您也可以使用emoji4j库。

##代码##

Answer 8

回答by Eric Nakagawa - Parse Dev Adv

The best regex for extracting ALL emoji is this:

提取所有表情符号的最佳正则表达式是：

##代码##

It identifies many single-char emoji that the other answers do not account for. For more information about how this regex works, take a look at this post. https://medium.com/@thekevinscott/emojis-in-javascript-f693d0eb79fb#.enomgcu63

它识别了许多其他答案没有考虑的单字符表情符号。有关此正则表达式如何工作的更多信息，请查看这篇文章。https://medium.com/@thekevinscott/emojis-in-javascript-f693d0eb79fb#.enomgcu63

Answer 9

回答by Andrew Moreau

This is what I use to remove emojis and so far it has shown to allow all other alphabets.

这是我用来删除表情符号的方法，到目前为止，它已显示允许使用所有其他字母。

##代码##

Answer 10

回答by Vensent Wang

There are two ways to solve this sticky problem.

有两种方法可以解决这个棘手的问题。

The first one is Using third-party libs like emoji-javaand emoji4j. These are mentioned above. You can easily use the method containsEmojior removesEmoji, etc. And in your own Apps, you need to keep update with these libs.

第一个是使用第三方库，如emoji-java和 emoji4j。这些都是上面提到的。您可以轻松使用containsEmoji或removesEmoji等方法。并且在您自己的应用程序中，您需要不断更新这些库。

As for me, I want to find a simple solution to solve this problem.

至于我，我想找到一个简单的解决方案来解决这个问题。

After a whole day of searching, I've found a magic regex:

经过一整天的搜索，我找到了一个神奇的正则表达式：

"(?:[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83E\uDD00-\uD83E\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]\uFE0F?|[\u2700-\u27BF]\uFE0F?|\u24C2\uFE0F?|[\uD83C\uDDE6-\uD83C\uDDFF]{1,2}|[\uD83C\uDD70\uD83C\uDD71\uD83C\uDD7E\uD83C\uDD7F\uD83C\uDD8E\uD83C\uDD91-\uD83C\uDD9A]\uFE0F?|[\u0023\u002A\u0030-\u0039]\uFE0F?\u20E3|[\u2194-\u2199\u21A9-\u21AA]\uFE0F?|[\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55]\uFE0F?|[\u2934\u2935]\uFE0F?|[\u3030\u303D]\uFE0F?|[\u3297\u3299]\uFE0F?|[\uD83C\uDE01\uD83C\uDE02\uD83C\uDE1A\uD83C\uDE2F\uD83C\uDE32-\uD83C\uDE3A\uD83C\uDE50\uD83C\uDE51]\uFE0F?|[\u203C\u2049]\uFE0F?|[\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE]\uFE0F?|[\u00A9\u00AE]\uFE0F?|[\u2122\u2139]\uFE0F?|\uD83C\uDC04\uFE0F?|\uD83C\uDCCF\uFE0F?|[\u231A\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA]\uFE0F?)"

which I have tested OK in Java. It perfectly solved my problem.

我已经在 Java 中测试过了。它完美地解决了我的问题。

You can view this on the Github page:

你可以在 Github 页面上查看：

https://github.com/zly394/EmojiRegex

Notes:

笔记：

The answer which provided by @Eric Nakagawa contains some errors, which cannot be operated properly.

@Eric Nakagawa 提供的答案包含一些错误，无法正常操作。

Java 从字符串中提取所有表情符号的正则表达式是什么？

提问by vishalaksh

采纳答案by T.J. Crowder

回答by Mr.C

回答by Karan Ashar

回答by Shi Xiangyang

回答by Mike

回答by gidim

回答by Chaitanya

回答by Eric Nakagawa - Parse Dev Adv

回答by Andrew Moreau

回答by Vensent Wang

相关推荐

最近更新

标签

Java 从字符串中提取所有表情符号的正则表达式是什么？

提问by vishalaksh

采纳答案by T.J. Crowder

回答by Mr.C

回答by Karan Ashar

回答by Shi Xiangyang

回答by Mike

回答by gidim

回答by Chaitanya

回答by Eric Nakagawa - Parse Dev Adv

回答by Andrew Moreau

回答by Vensent Wang

相关推荐

Java 什么是 org.eclipse.wst.common.component 以及如何将其用于 ant

Java 随机均匀分布

Java 詹金斯无法启动

Java 如何通过键名从 mongoDB 检索值？

相关推荐

最近更新

标签