Java 从字符串中提取所有表情符号的正则表达式是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24840667/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the regex to extract all the emojis from a string?
提问by vishalaksh
I have a String encoded in UTF-8. For example:
我有一个以 UTF-8 编码的字符串。例如:
Thats a nice joke
I have to extract all the emojis present in the sentence. And the emoji could be any
我必须提取句子中存在的所有表情符号。表情符号可以是任何
When this sentence is viewed in terminal using command less text.txt
it is viewed as:
当在终端中使用命令查看这句话时,less text.txt
它被视为:
Thats a nice joke <U+1F606><U+1F606><U+1F606> <U+1F61B>
This is the corresponding UTF code for the emoji. All the codes for emojis can be found at emojitracker.
这是表情符号对应的 UTF 代码。表情符号的所有代码都可以在emojitracker找到。
For the purpose of finding all the occurances, I used a regular expression pattern (<U\+\w+?>)
but it didnt work for the UTF-8 encoded string.
为了找到所有出现的情况,我使用了正则表达式模式,(<U\+\w+?>)
但它不适用于 UTF-8 编码的字符串。
Following is my code:
以下是我的代码:
String s="Thats a nice joke ";
Pattern pattern = Pattern.compile("(<U\+\w+?>)");
Matcher matcher = pattern.matcher(s);
List<String> matchList = new ArrayList<String>();
while (matcher.find()) {
matchList.add(matcher.group());
}
for(int i=0;i<matchList.size();i++){
System.out.println(matchList.get(i));
}
This pdfsays Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs
. So I want to capture any character lying within this range.
这个pdf说Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs
。所以我想捕捉这个范围内的任何角色。
采纳答案by T.J. Crowder
the pdf that you just mentionedsays Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs. So lets say I want to capture any character lying within this range. Now what to do?
您刚刚提到的 pdf说 Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs。所以假设我想捕捉这个范围内的任何角色。现在该怎么办?
Okay, but I will just note that the emoji in your question are outside that range! :-)
好的,但我会注意到您问题中的表情符号超出了该范围!:-)
The fact that these are above 0xFFFF
complicates things, because Java strings store UTF-16. So we can't just use one simple character class for it. We're going to have surrogate pairs. (More: http://www.unicode.org/faq/utf_bom.html)
上面这些的事实0xFFFF
使事情变得复杂,因为 Java 字符串存储 UTF-16。所以我们不能只使用一个简单的字符类。我们将有代理对。(更多:http: //www.unicode.org/faq/utf_bom.html)
U+1F300 in UTF-16 ends up being the pair \uD83C\uDF00
; U+1F5FF ends up being \uD83D\uDDFF
. Note that the first character went up, we cross at least one boundary. So we have to know what ranges of surrogate pairs we're looking for.
UTF-16 中的 U+1F300 最终成为一对\uD83C\uDF00
;U+1F5FF 最终是\uD83D\uDDFF
. 请注意,第一个字符上升了,我们至少跨越了一个边界。所以我们必须知道我们正在寻找的代理对的范围。
Not being steeped in knowledge about the inner workings of UTF-16, I wrote a program to find out (source at the end — I'd double-check it if I were you, rather than trusting me). It tells me we're looking for \uD83C
followed by anything in the range \uDF00-\uDFFF
(inclusive), or \uD83D
followed by anything in the range \uDC00-\uDDFF
(inclusive).
由于没有深入了解 UTF-16 的内部工作原理,我编写了一个程序来找出答案(最后的来源 - 如果我是你,我会仔细检查它,而不是相信我)。它告诉我我们正在寻找\uD83C
后跟范围内的任何内容\uDF00-\uDFFF
(包括),或\uD83D
后跟范围内的任何内容\uDC00-\uDDFF
(包括)。
So armed with that knowledge, in theory we could now write a pattern:
有了这些知识,理论上我们现在可以编写一个模式:
// This is wrong, keep reading
Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])");
That's an alternation of two non-capturing groups, the first group for the pairs starting with \uD83C
, and the second group for the pairs starting with \uD83D
.
这是两个非捕获组的交替,第一组用于以 开头的对\uD83C
,第二组用于以 开头的对\uD83D
。
But that fails(doesn't find anything). I'm fairly sure it's because we're trying to specify halfof a surrogate pair in various places:
但这失败了(找不到任何东西)。我很确定这是因为我们试图在不同的地方指定代理对的一半:
Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])");
// Half of a pair --------------^------^------^-----------^------^------^
We can't just split up surrogate pairs like that, they're called surrogate pairsfor a reason. :-)
我们不能像那样拆分代理对,它们被称为代理对是有原因的。:-)
Consequently, I don't think we can use regular expressions (or indeed, any string-based approach) for this at all. I think we have to search through char
arrays.
因此,我认为我们根本不能为此使用正则表达式(或者实际上,任何基于字符串的方法)。我认为我们必须搜索char
数组。
char
arrays hold UTF-16 values, so we canfind those half-pairs in the data if we look for it the hard way:
char
数组保存 UTF-16 值,因此如果我们通过艰难的方式寻找它,我们可以在数据中找到那些半对:
String s = new StringBuilder()
.append("Thats a nice joke ")
.appendCodePoint(0x1F606)
.appendCodePoint(0x1F606)
.appendCodePoint(0x1F606)
.append(" ")
.appendCodePoint(0x1F61B)
.toString();
char[] chars = s.toCharArray();
int index;
char ch1;
char ch2;
index = 0;
while (index < chars.length - 1) { // -1 because we're looking for two-char-long things
ch1 = chars[index];
if ((int)ch1 == 0xD83C) {
ch2 = chars[index+1];
if ((int)ch2 >= 0xDF00 && (int)ch2 <= 0xDFFF) {
System.out.println("Found emoji at index " + index);
index += 2;
continue;
}
}
else if ((int)ch1 == 0xD83D) {
ch2 = chars[index+1];
if ((int)ch2 >= 0xDC00 && (int)ch2 <= 0xDDFF) {
System.out.println("Found emoji at index " + index);
index += 2;
continue;
}
}
++index;
}
Obviously that's just debug-level code, but it does the job. (In your given string, with its emoji, of course it won't find anything as they're outside the range. But if you change the upper bound on the second pair to 0xDEFF
instead of 0xDDFF
, it will. No idea if that would also include non-emojis, though.)
显然,这只是调试级别的代码,但它可以完成工作。(在给定的字符串中,带有它的表情符号,当然它不会找到任何东西,因为它们超出了范围。但是如果你将第二对的上限改为0xDEFF
而不是0xDDFF
,它会。不知道这是否也会不过,包括非表情符号。)
Source of my program to find out what the surrogate ranges were:
我的程序来源,用于找出代理范围是什么:
public class FindRanges {
public static void main(String[] args) {
char last0 = '\uD83C \uDF00-\uDFFF
\uD83D \uDC00-\uDDFF
';
char last1 = 'public class SimpleEscaper extends UnicodeEscaper
{
@Override
protected char[] escape(int codePoint)
{
if (0x1f000 >= codePoint && codePoint <= 0x1ffff)
{
return Integer.toHexString(codePoint).toCharArray();
}
return Character.toChars(codePoint);
}
}
';
for (int x = 0x1F300; x <= 0x1F5FF; ++x) {
char[] chars = new StringBuilder().appendCodePoint(x).toString().toCharArray();
if (chars[0] != last0) {
if (last0 != 'public class SplitByUnicode {
public static void main(String[] argv) throws Exception {
String string = "Thats a nice joke ";
System.out.println("Original String:"+string);
String regexPattern = "[\uD83C-\uDBFF\uDC00-\uDFFF]+";
byte[] utf8 = string.getBytes("UTF-8");
String string1 = new String(utf8, "UTF-8");
Pattern pattern = Pattern.compile(regexPattern);
Matcher matcher = pattern.matcher(string1);
List<String> matchList = new ArrayList<String>();
while (matcher.find()) {
matchList.add(matcher.group());
}
for(int i=0;i<matchList.size();i++){
System.out.println(i+":"+matchList.get(i));
}
}
}
') {
System.out.println("-\u" + Integer.toHexString((int)last1).toUpperCase());
}
System.out.print("\u" + Integer.toHexString((int)chars[0]).toUpperCase() + " \u" + Integer.toHexString((int)chars[1]).toUpperCase());
last0 = chars[0];
}
last1 = chars[1];
}
if (last0 != '
Original String:Thats a nice joke
0:
1:
') {
System.out.println("-\u" + Integer.toHexString((int)last1).toUpperCase());
}
}
}
Output:
输出:
String s="Thats a nice joke ";
Pattern pattern = Pattern.compile("[\ud83c\udc00-\ud83c\udfff]|[\ud83d\udc00-\ud83d\udfff]|[\u2600-\u27ff]",
Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(s);
List<String> matchList = new ArrayList<String>();
while (matcher.find()) {
matchList.add(matcher.group());
}
for(int i=0;i<matchList.size();i++){
System.out.println(matchList.get(i));
}
回答by Mr.C
Assuming that you are asking for standard Unicode emoji ranges (there are different blocks by vendor) you may consider these three ranges:
假设您要求标准的 Unicode 表情符号范围(供应商有不同的块),您可以考虑这三个范围:
- 0x20a0 - 0x32ff
- 0x1f000 - 0x1ffff
- 0xfe4e5 - 0xfe4ee
- 0x20a0 - 0x32ff
- 0x1f000 - 0x1ffff
- 0xfe4e5 - 0xfe4ee
Besides all the thoughtful explanation that T.J.Crowder has shared with us, needs to be said that beginning with Java 7 is possible to match UTF-16 encoded surrogate pairs with ease.
除了 TJCrowder 与我们分享的所有深思熟虑的解释之外,需要说明的是,从 Java 7 开始可以轻松匹配 UTF-16 编码的代理对。
Take a look at the docs:
看一下文档:
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
A Unicode character can also be represented in a regular-expression by using its Hex notation(hexadecimal code point value) directly as described in construct \x{...}, for example a supplementary character U+2011F can be specified as \x{2011F}, instead of two consecutive Unicode escape sequences of the surrogate pair \uD840\uDD1F.
Unicode 字符也可以通过直接使用其十六进制表示法(十六进制代码点值)来表示在正则表达式中,如构造 \x{...} 中所述,例如补充字符 U+2011F 可以指定为 \x {2011F},而不是代理对 \uD840\uDD1F 的两个连续 Unicode 转义序列。
Nevertheless, if you cannot switch to Java 7, you can extend the valuable UnicodeEscaperprovided by Guava.
不过,如果你不能切换到 Java 7,你可以扩展Guava 提供的有价值的UnicodeEscaper。
Here an implementation for the sake of example:
这里是一个为了示例的实现:
public static String mysqlSafe(String input) {
if (input == null) return null;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < input.length(); i++) {
if (i < (input.length() - 1)) { // Emojis are two characters long in java, e.g. a rocket emoji is "\uD83D\uDE80";
if (Character.isSurrogatePair(input.charAt(i), input.charAt(i + 1))) {
i += 1; //also skip the second character of the emoji
continue;
}
}
sb.append(input.charAt(i));
}
return sb.toString();
}
回答by Karan Ashar
Had a similar problem. The following served me well and matches surrogate pairs
有类似的问题。以下对我很有帮助,并且匹配代理对
String input = "A string with a \uD83D\uDC66\uD83C\uDFFFfew emojis!";
String result = EmojiParser.removeAllEmojis(input);
Output is:
输出是:
<dependency>
<groupId>com.vdurmont</groupId>
<artifactId>emoji-java</artifactId>
<version>3.1.3</version>
</dependency>
Found the regex from https://stackoverflow.com/a/24071599/915972
回答by Shi Xiangyang
you can do it like this
你可以这样做
compile 'com.vdurmont:emoji-java:3.1.3'
回答by Mike
This worked for me in java 8:
这在 Java 8 中对我有用:
String emojiText = "A , and a became friends. For 's birthday party, they all had s, s, s and .";
EmojiUtils.removeAllEmojis(emojiText);//returns "A , and a became friends. For 's birthday party, they all had s, s, s and .
回答by gidim
Using emoji-javai've wrote a simple method that removes all emojis including fitzpatrick modifiers. Requires an external library but easier to maintain than those monster regexes.
使用emoji-java我写了一个简单的方法来删除所有表情符号,包括fitzpatrick 修饰符。需要一个外部库,但比那些怪物正则表达式更容易维护。
Use:
用:
(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff]|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\udde6-\uddff]|[\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|[\ud83c[\ude32-\ude3a]|[\ud83c[\ude50-\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff])
emoji-java maven installation:
emoji-java maven 安装:
private static String remove_Emojis(String name)
{
//we will store all the letters in this array
ArrayList<Character> nonEmoji = new ArrayList<>();
// and when we rebuild the name we will put it in here
String newName = "";
// we are going to loop through checking each character to see if its an emoji or not
for (int i = 0; i < name.length(); i++)
{
if (Character.isLetterOrDigit(name.charAt(i)))
{
nonEmoji.add(name.charAt(i));
}
else
{
// this is just a 2nd check in case the other method didn't allow some letter
if (Build.VERSION.SDK_INT > 18)
{
if (Character.isAlphabetic(name.charAt(i)))
{
nonEmoji.add(name.charAt(i));
}
}
}
if (name.charAt(i) == ' ')// may want to consider adding or '-' or '\''
{
nonEmoji.add(i);// just add it
}
if (name.charAt(i) == '@' && !name.contains(" "))// I put this in for email addresses
{
nonEmoji.add('@');
}
}
// finally just loop through building it back out
for (int i = 0; i < nonEmoji.size(); i++) {
newName += nonEmoji.get(i);
}
return newName;
}
gradle:
等级:
##代码##EDIT: previously submitted answer was pulled into emoji-java source code.
编辑:先前提交的答案被拉入 emoji-java 源代码。
回答by Eric Nakagawa - Parse Dev Adv
The best regex for extracting ALL emoji is this:
提取所有表情符号的最佳正则表达式是:
##代码##It identifies many single-char emoji that the other answers do not account for. For more information about how this regex works, take a look at this post. https://medium.com/@thekevinscott/emojis-in-javascript-f693d0eb79fb#.enomgcu63
它识别了许多其他答案没有考虑的单字符表情符号。有关此正则表达式如何工作的更多信息,请查看这篇文章。https://medium.com/@thekevinscott/emojis-in-javascript-f693d0eb79fb#.enomgcu63
回答by Andrew Moreau
This is what I use to remove emojis and so far it has shown to allow all other alphabets.
这是我用来删除表情符号的方法,到目前为止,它已显示允许使用所有其他字母。
##代码##回答by Vensent Wang
There are two ways to solve this sticky problem.
有两种方法可以解决这个棘手的问题。
The first one is Using third-party libs like emoji-javaand emoji4j. These are mentioned above. You can easily use the method containsEmoji
or removesEmoji
, etc. And in your own Apps, you need to keep update with these libs.
第一个是使用第三方库,如emoji-java和 emoji4j。这些都是上面提到的。您可以轻松使用containsEmoji
或removesEmoji
等方法。并且在您自己的应用程序中,您需要不断更新这些库。
As for me, I want to find a simple solution to solve this problem.
至于我,我想找到一个简单的解决方案来解决这个问题。
After a whole day of searching, I've found a magic regex:
经过一整天的搜索,我找到了一个神奇的正则表达式:
"(?:[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83E\uDD00-\uD83E\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]\uFE0F?|[\u2700-\u27BF]\uFE0F?|\u24C2\uFE0F?|[\uD83C\uDDE6-\uD83C\uDDFF]{1,2}|[\uD83C\uDD70\uD83C\uDD71\uD83C\uDD7E\uD83C\uDD7F\uD83C\uDD8E\uD83C\uDD91-\uD83C\uDD9A]\uFE0F?|[\u0023\u002A\u0030-\u0039]\uFE0F?\u20E3|[\u2194-\u2199\u21A9-\u21AA]\uFE0F?|[\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55]\uFE0F?|[\u2934\u2935]\uFE0F?|[\u3030\u303D]\uFE0F?|[\u3297\u3299]\uFE0F?|[\uD83C\uDE01\uD83C\uDE02\uD83C\uDE1A\uD83C\uDE2F\uD83C\uDE32-\uD83C\uDE3A\uD83C\uDE50\uD83C\uDE51]\uFE0F?|[\u203C\u2049]\uFE0F?|[\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE]\uFE0F?|[\u00A9\u00AE]\uFE0F?|[\u2122\u2139]\uFE0F?|\uD83C\uDC04\uFE0F?|\uD83C\uDCCF\uFE0F?|[\u231A\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA]\uFE0F?)"
"(?:[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83E\uDD00-\uD83E\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]\uFE0F?|[\u2700-\u27BF]\uFE0F?|\u24C2\uFE0F?|[\uD83C\uDDE6-\uD83C\uDDFF]{1,2}|[\uD83C\uDD70\uD83C\uDD71\uD83C\uDD7E\uD83C\uDD7F\uD83C\uDD8E\uD83C\uDD91-\uD83C\uDD9A]\uFE0F?|[\u0023\u002A\u0030-\u0039]\uFE0F?\u20E3|[\u2194-\u2199\u21A9-\u21AA]\uFE0F?|[\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55]\uFE0F?|[\u2934\u2935]\uFE0F?|[\u3030\u303D]\uFE0F?|[\u3297\u3299]\uFE0F?|[\uD83C\uDE01\uD83C\uDE02\uD83C\uDE1A\uD83C\uDE2F\uD83C\uDE32-\uD83C\uDE3A\uD83C\uDE50\uD83C\uDE51]\uFE0F?|[\u203C\u2049]\uFE0F?|[\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE]\uFE0F?|[\u00A9\u00AE]\uFE0F?|[\u2122\u2139]\uFE0F?|\uD83C\uDC04\uFE0F?|\uD83C\uDCCF\uFE0F?|[\u231A\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA]\uFE0F?)"
which I have tested OK in Java. It perfectly solved my problem.
我已经在 Java 中测试过了。它完美地解决了我的问题。
You can view this on the Github page:
你可以在 Github 页面上查看:
https://github.com/zly394/EmojiRegex
https://github.com/zly394/EmojiRegex
Notes:
笔记:
The answer which provided by @Eric Nakagawa contains some errors, which cannot be operated properly.
@Eric Nakagawa 提供的答案包含一些错误,无法正常操作。