从 Python 中的字符串中删除表情符号

Question

提问by Mona Jalal

I found this code in Python for removing emojis but it is not working. Can you help with other codes or fix to this?

我在 Python 中找到了用于删除表情符号的代码，但它不起作用。你能帮忙处理其他代码或解决这个问题吗？

I have observed all my emjois start with \xfbut when I try to search for str.startswith("\xf")I get invalid character error.

我已经观察到我所有的 emjois 都是从开始的，\xf但是当我尝试搜索时，str.startswith("\xf")我得到了无效字符错误。

emoji_pattern = r'/[x{1F601}-x{1F64F}]/u'
re.sub(emoji_pattern, '', word)

Here's the error:

这是错误：

Traceback (most recent call last):
  File "test.py", line 52, in <module>
    re.sub(emoji_pattern,'',word)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

Each of the items in a list can be a word ['This', 'dog', '\xf0\x9f\x98\x82', 'https://t.co/5N86jYipOI']

列表中的每一项都可以是一个词 ['This', 'dog', '\xf0\x9f\x98\x82', 'https://t.co/5N86jYipOI']

UPDATE: I used this other code:

更新：我使用了其他代码：

emoji_pattern=re.compile(ur" " " [\U0001F600-\U0001F64F] # emoticons \
                                 |\
                                 [\U0001F300-\U0001F5FF] # symbols & pictographs\
                                 |\
                                 [\U0001F680-\U0001F6FF] # transport & map symbols\
                                 |\
                                 [\U0001F1E0-\U0001F1FF] # flags (iOS)\
                          " " ", re.VERBOSE)

emoji_pattern.sub('', word)

But this still doesn't remove the emojis and shows them! Any clue why is that?

但这仍然不会删除表情符号并显示它们！任何线索为什么会这样？

Answer 1

采纳答案by Abdul-Razak Adam

This works for me. It is motivated by https://stackoverflow.com/a/43813727/6579239

这对我有用。它的动机是https://stackoverflow.com/a/43813727/6579239

def deEmojify(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')

Answer 2

回答by Bryan Oakley

Because [...]means any one of a set of characters, and because two characters in a group separated by a dash means a range of characters (often, "a-z" or "0-9"), your pattern says "a slash, followed by any characters in the group containing x, {, 1, F, 6, 0, 1, the range } through x, {, 1, F, 6, 4, f or }" followed by a slash and the letter u". That range in the middle is what re is calling the bad character range.

因为[...]表示一组字符中的任何一个，并且因为一组中由破折号分隔的两个字符表示一系列字符（通常是“az”或“0-9”），所以您的模式表示“一个斜杠，后跟任何包含 x, {, 1, F, 6, 0, 1, 范围 } 到 x, {, 1, F, 6, 4, f 或 }" 后跟斜杠和字母 u" 的组中的字符。中间的范围是 re 所谓的坏字符范围。

Answer 3

回答by jfs

On Python 2, you have to use u''literal to create a Unicode string. Also, you should pass re.UNICODEflag and convert your input data to Unicode (e.g., text = data.decode('utf-8')):

在 Python 2 上，您必须使用u''文字来创建 Unicode 字符串。此外，您应该传递re.UNICODE标志并将您的输入数据转换为 Unicode（例如，text = data.decode('utf-8')）：

#!/usr/bin/env python
import re

text = u'This dog \U0001f602'
print(text) # with emoji

emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji

Output

输出

This dog 
This dog

Note: emoji_patternmatches only some emoji (not all). See Which Characters are Emoji.

注意：emoji_pattern只匹配一些表情符号（不是全部）。查看哪些字符是表情符号。

Answer 4

回答by scwagner

If you're using the example from the accepted answer and still getting "bad character range" errors, then you're probably using a narrow build (see this answerfor more details). A reformatted version of the regex that seems to work is:

如果您使用的是已接受答案中的示例，但仍然出现“错误字符范围”错误，那么您可能使用的是窄版本（有关更多详细信息，请参阅此答案）。似乎有效的正则表达式的重新格式化版本是：

emoji_pattern = re.compile(
    u"(\ud83d[\ude00-\ude4f])|"  # emoticons
    u"(\ud83c[\udf00-\uffff])|"  # symbols & pictographs (1 of 2)
    u"(\ud83d[\u0000-\uddff])|"  # symbols & pictographs (2 of 2)
    u"(\ud83d[\ude80-\udeff])|"  # transport & map symbols
    u"(\ud83c[\udde0-\uddff])"  # flags (iOS)
    "+", flags=re.UNICODE)

Answer 5

回答by KevinTydlacka

Accepted answer, and others worked for me for a bit, but I ultimately decided to strip all characters outside of the Basic Multilingual Plane. This excludes future additions to other Unicode planes (where emoji's and such live), which means I don't have to update my code every time new Unicode characters are added :).

接受的答案，其他人为我工作了一段时间，但我最终决定去除Basic Multilingual Plane之外的所有字符。这不包括未来添加到其他 Unicode 平面（表情符号和此类平面）的内容，这意味着我不必每次添加新的 Unicode 字符时都更新我的代码 :)。

In Python 2.7 convert to unicode if your text is not already, and then use the negative regex below (subs anything notin regex, which is all characters from BMP exceptfor surrogates, which are used to create 2 byte Supplementary Multilingual Planecharacters).

在 Python 2.7 中，如果您的文本尚未转换为 unicode，然后使用下面的负正则表达式（替换任何不在正则表达式中的内容，这是 BMP 中的所有字符，除了用于创建 2 字节补充多语言平面字符的代理字符）。

NON_BMP_RE = re.compile(u"[^\U00000000-\U0000d7ff\U0000e000-\U0000ffff]", flags=re.UNICODE)
NON_BMP_RE.sub(u'', unicode(text, 'utf-8'))

Answer 6

回答by Ali Tavakoli

Complete vesrion Of remove emojies:

删除表情符号的完整版本：

def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

Answer 7

回答by octohedron

Tried all the answers, unfortunately, they didn't remove the new hugging face emoji or the clinking glasses emoji or , and a lot more.

尝试了所有的答案，不幸的是，他们没有删除新的拥抱脸表情符号或叮当响的眼镜表情符号或等等。

Ended up with a list of all possible emoji, taken from the python emoji package on github, and I had to create a gist because there's a 30k character limit on stackoverflow answers and it's over 70k characters.

最终得到了所有可能的表情符号列表，取自 github 上的 python 表情符号包，我不得不创建一个要点，因为 stackoverflow 答案有 30k 个字符限制，并且超过 70k 个字符。

Answer 8

回答by kingmakerking

If you are not keen on using regex, the best solution could be using the emoji python package.

如果您不热衷于使用正则表达式，最好的解决方案可能是使用emoji python 包。

Here is a simple function to return emoji free text (thanks to this SO answer):

这是一个返回表情符号自由文本的简单函数（感谢这个SO answer）：

import emoji
def give_emoji_free_text(text):
    allchars = [str for str in text.decode('utf-8')]
    emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
    clean_text = ' '.join([str for str in text.decode('utf-8').split() if not any(i in str for i in emoji_list)])
    return clean_text

If you are dealing with strings containing emojis, this is straightforward

如果您正在处理包含表情符号的字符串，这很简单

>> s1 = "Hi  How is your  and . Have a nice weekend "
>> print s1
Hi  How is your  and . Have a nice weekend 
>> print give_emoji_free_text(s1)
Hi How is your and Have a nice weekend

If you are dealing with unicode (as in the exmaple by @jfs), just encode it with utf-8.

如果您正在处理 unicode（如@jfs 的示例中所示），只需使用 utf-8 对其进行编码。

>> s2 = u'This dog \U0001f602'
>> print s2
This dog 
>> print give_emoji_free_text(s2.encode('utf8'))
This dog

Edits

编辑

Based on the comment, it should be as easy as:

根据评论，它应该很简单：

def give_emoji_free_text(text):
    return emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))

Answer 9

回答by KT Works

this is my solution. This solution removes additional man and woman emoji which cant be renered by python ?♂ and ?♀

这是我的解决方案。这个解决方案删除了额外的男人和女人表情符号，这些表情符号不能被 python ?♂ 和 ?♀

emoji_pattern = re.compile("["
                       u"\U0001F600-\U0001F64F"  # emoticons
                       u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                       u"\U0001F680-\U0001F6FF"  # transport & map symbols
                       u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                       u"\U00002702-\U000027B0"
                       u"\U000024C2-\U0001F251"
                       u"\U0001f926-\U0001f937"
                       u"\u200d"
                       u"\u2640-\u2642" 
                       "]+", flags=re.UNICODE)

Answer 10

回答by Tobias Ernst

Converting the string into another character set like this might help:

将字符串转换为另一个字符集可能会有所帮助：

text.encode('latin-1', 'ignore').decode('latin-1')

Kind regards.

亲切的问候。

从 Python 中的字符串中删除表情符号

提问by Mona Jalal

采纳答案by Abdul-Razak Adam

回答by Bryan Oakley

回答by jfs

Output

输出

回答by scwagner

回答by KevinTydlacka

回答by Ali Tavakoli

回答by octohedron

回答by kingmakerking

回答by KT Works

回答by Tobias Ernst

相关推荐

最近更新

标签

从 Python 中的字符串中删除表情符号

提问by Mona Jalal

采纳答案by Abdul-Razak Adam

回答by Bryan Oakley

回答by jfs

Output

输出

回答by scwagner

回答by KevinTydlacka

回答by Ali Tavakoli

回答by octohedron

回答by kingmakerking

回答by KT Works

回答by Tobias Ernst

相关推荐

Python 一次读取整个文件

使用python中的索引创建一个包含列表子集的新列表

Python 为 scipy 安装 BLAS 和 LAPACK 的最简单方法是什么？

我如何测试 python 列表中的空列表条目

相关推荐

最近更新

标签