从 Python 中的字符串中删除表情符号
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33404752/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
removing emojis from a string in Python
提问by Mona Jalal
I found this code in Python for removing emojis but it is not working. Can you help with other codes or fix to this?
我在 Python 中找到了用于删除表情符号的代码,但它不起作用。你能帮忙处理其他代码或解决这个问题吗?
I have observed all my emjois start with \xf
but when I try to search for str.startswith("\xf")
I get invalid character error.
我已经观察到我所有的 emjois 都是从开始的,\xf
但是当我尝试搜索时,str.startswith("\xf")
我得到了无效字符错误。
emoji_pattern = r'/[x{1F601}-x{1F64F}]/u'
re.sub(emoji_pattern, '', word)
Here's the error:
这是错误:
Traceback (most recent call last):
File "test.py", line 52, in <module>
re.sub(emoji_pattern,'',word)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python2.7/re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
Each of the items in a list can be a word ['This', 'dog', '\xf0\x9f\x98\x82', 'https://t.co/5N86jYipOI']
列表中的每一项都可以是一个词 ['This', 'dog', '\xf0\x9f\x98\x82', 'https://t.co/5N86jYipOI']
UPDATE: I used this other code:
更新:我使用了其他代码:
emoji_pattern=re.compile(ur" " " [\U0001F600-\U0001F64F] # emoticons \
|\
[\U0001F300-\U0001F5FF] # symbols & pictographs\
|\
[\U0001F680-\U0001F6FF] # transport & map symbols\
|\
[\U0001F1E0-\U0001F1FF] # flags (iOS)\
" " ", re.VERBOSE)
emoji_pattern.sub('', word)
But this still doesn't remove the emojis and shows them! Any clue why is that?
采纳答案by Abdul-Razak Adam
This works for me. It is motivated by https://stackoverflow.com/a/43813727/6579239
这对我有用。它的动机是https://stackoverflow.com/a/43813727/6579239
def deEmojify(inputString):
return inputString.encode('ascii', 'ignore').decode('ascii')
回答by Bryan Oakley
Because [...]
means any one of a set of characters, and because two characters in a group separated by a dash means a range of characters (often, "a-z" or "0-9"), your pattern says "a slash, followed by any characters in the group containing x, {, 1, F, 6, 0, 1, the range } through x, {, 1, F, 6, 4, f or }" followed by a slash and the letter u". That range in the middle is what re is calling the bad character range.
因为[...]
表示一组字符中的任何一个,并且因为一组中由破折号分隔的两个字符表示一系列字符(通常是“az”或“0-9”),所以您的模式表示“一个斜杠,后跟任何包含 x, {, 1, F, 6, 0, 1, 范围 } 到 x, {, 1, F, 6, 4, f 或 }" 后跟斜杠和字母 u" 的组中的字符。中间的范围是 re 所谓的坏字符范围。
回答by jfs
On Python 2, you have to use u''
literal to create a Unicode string. Also, you should pass re.UNICODE
flag and convert your input data to Unicode (e.g., text = data.decode('utf-8')
):
在 Python 2 上,您必须使用u''
文字来创建 Unicode 字符串。此外,您应该传递re.UNICODE
标志并将您的输入数据转换为 Unicode(例如,text = data.decode('utf-8')
):
#!/usr/bin/env python
import re
text = u'This dog \U0001f602'
print(text) # with emoji
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji
Output
输出
This dog
This dog
Note: emoji_pattern
matches only some emoji (not all). See Which Characters are Emoji.
注意:emoji_pattern
只匹配一些表情符号(不是全部)。查看哪些字符是表情符号。
回答by scwagner
If you're using the example from the accepted answer and still getting "bad character range" errors, then you're probably using a narrow build (see this answerfor more details). A reformatted version of the regex that seems to work is:
如果您使用的是已接受答案中的示例,但仍然出现“错误字符范围”错误,那么您可能使用的是窄版本(有关更多详细信息,请参阅此答案)。似乎有效的正则表达式的重新格式化版本是:
emoji_pattern = re.compile(
u"(\ud83d[\ude00-\ude4f])|" # emoticons
u"(\ud83c[\udf00-\uffff])|" # symbols & pictographs (1 of 2)
u"(\ud83d[\u0000-\uddff])|" # symbols & pictographs (2 of 2)
u"(\ud83d[\ude80-\udeff])|" # transport & map symbols
u"(\ud83c[\udde0-\uddff])" # flags (iOS)
"+", flags=re.UNICODE)
回答by KevinTydlacka
Accepted answer, and others worked for me for a bit, but I ultimately decided to strip all characters outside of the Basic Multilingual Plane. This excludes future additions to other Unicode planes (where emoji's and such live), which means I don't have to update my code every time new Unicode characters are added :).
接受的答案,其他人为我工作了一段时间,但我最终决定去除Basic Multilingual Plane之外的所有字符。这不包括未来添加到其他 Unicode 平面(表情符号和此类平面)的内容,这意味着我不必每次添加新的 Unicode 字符时都更新我的代码 :)。
In Python 2.7 convert to unicode if your text is not already, and then use the negative regex below (subs anything notin regex, which is all characters from BMP exceptfor surrogates, which are used to create 2 byte Supplementary Multilingual Planecharacters).
在 Python 2.7 中,如果您的文本尚未转换为 unicode,然后使用下面的负正则表达式(替换任何不在正则表达式中的内容,这是 BMP 中的所有字符,除了用于创建 2 字节补充多语言平面字符的代理字符)。
NON_BMP_RE = re.compile(u"[^\U00000000-\U0000d7ff\U0000e000-\U0000ffff]", flags=re.UNICODE)
NON_BMP_RE.sub(u'', unicode(text, 'utf-8'))
回答by Ali Tavakoli
Complete vesrion Of remove emojies:
删除表情符号的完整版本:
def remove_emoji(string):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
回答by octohedron
Tried all the answers, unfortunately, they didn't remove the new hugging face emoji or the clinking glasses emoji or , and a lot more.
尝试了所有的答案,不幸的是,他们没有删除新的拥抱脸表情符号或叮当响的眼镜表情符号或等等。
Ended up with a list of all possible emoji, taken from the python emoji package on github, and I had to create a gist because there's a 30k character limit on stackoverflow answers and it's over 70k characters.
最终得到了所有可能的表情符号列表,取自 github 上的 python 表情符号包,我不得不创建一个要点,因为 stackoverflow 答案有 30k 个字符限制,并且超过 70k 个字符。
回答by kingmakerking
If you are not keen on using regex, the best solution could be using the emoji python package.
如果您不热衷于使用正则表达式,最好的解决方案可能是使用emoji python 包。
Here is a simple function to return emoji free text (thanks to this SO answer):
这是一个返回表情符号自由文本的简单函数(感谢这个SO answer):
import emoji
def give_emoji_free_text(text):
allchars = [str for str in text.decode('utf-8')]
emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
clean_text = ' '.join([str for str in text.decode('utf-8').split() if not any(i in str for i in emoji_list)])
return clean_text
If you are dealing with strings containing emojis, this is straightforward
如果您正在处理包含表情符号的字符串,这很简单
>> s1 = "Hi How is your and . Have a nice weekend "
>> print s1
Hi How is your and . Have a nice weekend
>> print give_emoji_free_text(s1)
Hi How is your and Have a nice weekend
If you are dealing with unicode (as in the exmaple by @jfs), just encode it with utf-8.
如果您正在处理 unicode(如@jfs 的示例中所示),只需使用 utf-8 对其进行编码。
>> s2 = u'This dog \U0001f602'
>> print s2
This dog
>> print give_emoji_free_text(s2.encode('utf8'))
This dog
Edits
编辑
Based on the comment, it should be as easy as:
根据评论,它应该很简单:
def give_emoji_free_text(text):
return emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))
回答by KT Works
this is my solution. This solution removes additional man and woman emoji which cant be renered by python ?♂ and ?♀
这是我的解决方案。这个解决方案删除了额外的男人和女人表情符号,这些表情符号不能被 python ?♂ 和 ?♀
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\u200d"
u"\u2640-\u2642"
"]+", flags=re.UNICODE)
回答by Tobias Ernst
Converting the string into another character set like this might help:
将字符串转换为另一个字符集可能会有所帮助:
text.encode('latin-1', 'ignore').decode('latin-1')
Kind regards.
亲切的问候。