Python 如何从文本中提取所有表情符号?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43146528/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:34:36  来源:igfitidea点击:

How to extract all the emojis from text?

pythonpython-3.xemoji

提问by tumbleweed

Consider the following list:

考虑以下列表:

a_list = ['  me así, bla es se  ds ']

How can I extract in a new list all the emojis inside a_list?:

如何在新列表中提取所有表情符号a_list?:

new_lis = ['     ']

I tried to use regex, but I do not have all the possible emojis encodings.

我尝试使用正则表达式,但我没有所有可能的表情符号编码。

回答by Pedro Castilho

You can use the emojilibrary. You can check if a single codepoint is an emoji codepoint by checking if it is contained in emoji.UNICODE_EMOJI.

您可以使用emoji图书馆。您可以通过检查单个代码点是否包含在emoji.UNICODE_EMOJI.

import emoji

def extract_emojis(s):
  return ''.join(c for c in s if c in emoji.UNICODE_EMOJI)

回答by sheldonzy

I think it's important to point out that the previous answers won't work with emojis like ??? , because it consists of 4 emojis, and using ... in emoji.UNICODE_EMOJIwill return 4 different emojis. Same for emojis with skin color like .

我认为重要的是要指出以前的答案不适用于像 ??? 这样的表情符号。,因为它由 4 个表情符号组成,使用... in emoji.UNICODE_EMOJI将返回 4 个不同的表情符号。与肤色类似的表情符号相同。

My solution includes the emojiand regexmodules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like ???

我的解决方案包括emojiregex模块。regex 模块支持识别字素簇(呈现为单个字符的 Unicode 代码点序列),因此我们可以计算表情符号,如 ???

import emoji
import regex

def split_count(text):

    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):
            emoji_list.append(word)

    return emoji_list

Testing (with more emojis with skin color):

测试(使用更多带有肤色的表情符号):

line = ["  me así, se  ds  hello ? emoji hello ??? how are  you today"]

counter = split_count(line[0])
print(' '.join(emoji for emoji in counter))

output:

输出:

      ? ???   

Edit:

编辑:

If you want to include flags, like the Unicode range would be from to , so add:

如果要包含标志,例如 Unicode 范围将来自 ,所以添加:

flags = regex.findall(u'[\U0001F1E6-\U0001F1FF]', text) 

to the function above, and return emoji_list + flags.

到上面的函数,和return emoji_list + flags

See this postfor more information about the flags.

有关标志的更多信息,请参阅此帖子

回答by Kasramvd

If you don't want to use an external library, as a pythonic way you can simply use regular expressions and re.findall()with a proper regex to find the emojies:

如果您不想使用外部库,作为一种 Pythonic 方式,您可以简单地使用正则表达式并re.findall()使用适当的正则表达式来查找表情符号:

In [74]: import re
In [75]: re.findall(r'[^\w\s,]', a_list[0])
Out[75]: ['', '', '', '', '', '']

The regular expression r'[^\w\s,]'is a negated character class that matches any character that is not a word character, whitespace or comma.

正则表达式r'[^\w\s,]'是一个否定字符类,它匹配任何不是单词字符、空格或逗号的字符。

As I mentioned in comment, a text is generally contain word characters and punctuation which will be easily dealt with by this approach, for other cases you can just add them to the character class manually. Note that since you can specify a range of characters in character class you can even make it shorter and more flexible.

正如我在评论中提到的,文本通常包含单词字符和标点符号,可以通过这种方法轻松处理,对于其他情况,您可以手动将它们添加到字符类中。请注意,由于您可以在字符类中指定一系列字符,您甚至可以使其更短、更灵活。

Another solution is instead of a negated character class that excludes the non-emoji characters use a character class that accepts emojies ([]without ^). Since there are a lot of emojis with different unicode values, you just need to add the ranges to the character class. If you want to match more emojies here is a good reference contain all the standard emojies with the respective range for different emojies http://apps.timwhitlock.info/emoji/tables/unicode:

另一种解决方案是使用接受表情符号([]没有^)的字符类来代替排除非表情符号字符的否定字符类。由于有很多具有不同 unicode 值的表情符号,您只需要将范围添加到字符类中即可。如果您想匹配更多表情符号,这里是一个很好的参考,其中包含所有标准表情符号以及不同表情符号的相应范围http://apps.timwhitlock.info/emoji/tables/unicode

回答by user594836

The top rated answer does not always work. For example flag emojis will not be found. Consider the string:

评分最高的答案并不总是有效。例如,将找不到标志表情符号。考虑字符串:

s = u'Hello \U0001f1f7\U0001f1fa hello'

What would work better is

更好的做法是

import emoji
emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys())
r = re.compile('|'.join(re.escape(p) for p in emojis_list))
print(' '.join(r.findall(s)))

回答by iair linker

The solution to get exactly what tumbleweed ask, is a mix between the top rated answer and user594836's answer. This is the code that works for me in Python 3.6.

准确了解风滚草问题的解决方案是将评分最高的答案和 user594836 的答案混合在一起。这是在 Python 3.6 中对我有用的代码。

import emoji
import re

test_list=['  me así,bla es,se  ds ']

## Create the function to extract the emojis
def extract_emojis(a_list):
    emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys())
    r = re.compile('|'.join(re.escape(p) for p in emojis_list))
    aux=[' '.join(r.findall(s)) for s in a_list]
    return(aux)

## Execute the function
extract_emojis(test_list)

## the output
['     ']

回答by Phani Rithvij

Another way to do it using emojiis to use emoji.demojizeand convert them into text representations of emojis.

另一种使用表情符号的方法是使用emoji.demojize并将它们转换为表情符号的文本表示。

Ex: will be converted to :grinning_face:etc..

例如:将转换为:grinning_face:等。

Then find all :.*:patterns, and use emoji.emojizeon those.

然后找到所有:.*:模式,并emoji.emojize在这些模式上使用。

# -*- coding: utf-8 -*-
import emoji
import re

text = """
Of course, too many emoji characters \
 like , #@^!*&#@^#  helps  people read aaaaaa #douchebag
"""

text = emoji.demojize(text)
text = re.findall(r'(:[^:]*:)', text)
list_emoji = [emoji.emojize(x) for x in text]
print(list_emoji)

This might be a redundant way but it's an example of how emoji.emojizeand emoji.demojizecan be used.

这可能是一种多余的方式,但它是如何使用emoji.emojizeemoji.demojize可以使用的示例。

回答by sushi_dev

from emoji import *

EMOJI_SET = set()

# populate EMOJI_DICT
def pop_emoji_dict():
    for emoji in UNICODE_EMOJI:
        EMOJI_SET.add(emoji)

# check if emoji
def is_emoji(s):
    for letter in s:
        if letter in EMOJI_SET:
            return True
    return False

This is a better solution when working with large datasets since you dont have to loop through all emojis each time. Found this to give me better results :)

在处理大型数据集时,这是一个更好的解决方案,因为您不必每次都遍历所有表情符号。发现这可以给我更好的结果:)

回答by Cornea Valentin

Step 1:Make sure that your text it's decoded on utf-8 text.decode('utf-8')

第 1 步:确保您的文本已在 utf-8 上解码 text.decode('utf-8')

Step 2:Locate all emoji from your text, you must separate the text character by character [str for str in decode]

第 2 步:从您的文本中找到所有表情符号,您必须将文本逐个字符分开[str for str in decode]

Step 3:Saves all emoji in a list [c for c in allchars if c in emoji.UNICODE_EMOJI]full example bellow:

第 3 步:将所有表情符号保存在列表[c for c in allchars if c in emoji.UNICODE_EMOJI]完整示例中:

>>> import emoji
>>> text     = "  me así, bla es se  ds "
>>> decode   = text.decode('utf-8')
>>> allchars = [str for str in decode]
>>> list     = [c for c in allchars if c in emoji.UNICODE_EMOJI]
>>> print list
[u'\U0001f914', u'\U0001f648', u'\U0001f60c', u'\U0001f495', u'\U0001f46d', u'\U0001f459']

if you want to remove from text

如果你想从文本中删除

>>> filtred  = [str for str in decode.split() if not any(i in str for i in list)]
>>> clean_text = ' '.join(filtred)
>>> print clean_text
me así, bla es se ds

回答by Mohammed Terry Hyman

Ok - i had this same problem and I worked out a solution which doesn't require you to import any libraries (like emoji or re) and is a single line of code. It will return all the emojis in the string:

好的 - 我遇到了同样的问题,我制定了一个解决方案,它不需要您导入任何库(如 emoji 或 re)并且是一行代码。它将返回字符串中的所有表情符号:

def extract_emojis(sentence):
    return [word for word in sentence.split() if str(word.encode('unicode-escape'))[2] == '\' ]

This allowed me to create a light-weight solution and i hope it helps you all. Actually - i needed one which would filter out any emojis in a string - and thats the same as the code above but with one minor change:

这使我能够创建一个轻量级的解决方案,我希望它对大家有所帮助。实际上 - 我需要一个可以过滤掉字符串中任何表情符号的东西 - 这与上面的代码相同,但有一个小改动:

def filter_emojis(sentence):
        return [word for word in sentence.split() if str(word.encode('unicode-escape'))[2] != '\' ]

Here is an example of it in action:

这是它的一个例子:

  • a = ' me así, bla es se ds '
  • b = extract_emojis(a)
  • b = ['', '', '', '']
  • a = ' me así, bla es se ds '
  • b = extract_emojis(a)
  • b = ['', '', '', '']

回答by Amar

This function expects a string so converting the list of input to string

此函数需要一个字符串,因此将输入列表转换为字符串

a_list = '  me así, bla es se  ds '

# Import the necessary modules
from nltk.tokenize import regexp_tokenize

# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680- 
 \U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"

print(regexp_tokenize(a_list, emoji)) 

output :['', '', '', '', '']