从 Python 中的字符串中删除所有十六进制字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36598136/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:05:05  来源:igfitidea点击:

Remove all hex characters from string in Python

pythonpython-2.7utf-8character-encodingstring-parsing

提问by Kludge

Although there are similar questions, I can't seem to find a working solution for my case:

尽管有类似的问题,但我似乎无法为我的案例找到可行的解决方案:

I'm encountering some annoying hex chars in strings, e.g.

我在字符串中遇到了一些烦人的十六进制字符,例如

'\xe2\x80\x9chttp://www.google.com\xe2\x80\x9d blah blah#%#@$^blah'

What I need is to remove these hex \xHHcharacters, and them alone, in order to get the following result:

我需要的是删除这些十六进制\xHH字符,并单独删除它们,以获得以下结果:

'http://www.google.com blah blah#%#@$^blah'

decoding doesn't help:

解码没有帮助:

s.decode('utf8') # u'\u201chttp://www.google.com\u201d blah blah#%#@$^blah'

How can I achieve that?

我怎样才能做到这一点?

回答by Magnun Leno

Just remove all non-ASCII characters:

只需删除所有非 ASCII 字符:

>>> s.decode('utf8').encode('ascii', errors='ignore')
'http://www.google.com blah blah#%#@$^blah'

Other possible solution:

其他可能的解决方案:

>>> import string
>>> s = '\xe2\x80\x9chttp://www.google.com\xe2\x80\x9d blah blah#%#@$^blah'
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'http://www.google.com blah blah#%#@$^blah'

Or use Regular expressions:

或者使用正则表达式:

>>> import re
>>> re.sub(r'[^\x00-\x7f]',r'', s) 
'http://www.google.com blah blah#%#@$^blah'

Pick your favorite one.

选择你最喜欢的一款。

回答by bruno desthuilliers

These are not "hex characters" but the internal representation (utf-8 encoded in the first case, unicode code point in the second case) of the unicode characters 'LEFT DOUBLE QUOTATION MARK' ('“') and 'RIGHT DOUBLE QUOTATION MARK' ('”').

这些不是“十六进制字符”,而是 unicode 字符 'LEFT DOUBLE QUOTATION MARK' ('“') 和 'RIGHT DOUBLE QUOTATION MARK '('”')。

>>> s = "\xe2\x80\x9chttp://www.google.com\xe2\x80\x9d blah blah#%#@$^blah"
>>> print s
“http://www.google.com” blah blah#%#@$^blah
>>> s.decode("utf-8")
u'\u201chttp://www.google.com\u201d blah blah#%#@$^blah'
>>> print s.decode("utf-8")
“http://www.google.com” blah blah#%#@$^blah

As how to remove them, they are just ordinary characters so a simple str.replace()will do:

至于如何删除它们,它们只是普通字符,所以一个简单的方法str.replace()

>>> s.replace("\xe2\x80\x9c", "").replace("\xe2\x80\x9d", "")
'http://www.google.com blah blah#%#@$^blah'

If you want to get rid of all non-ascii characters at once, you just have to decode to unicode then encode to ascii with the "ignore" parameter:

如果您想一次删除所有非 ascii 字符,您只需解码为 un​​icode,然后使用“ignore”参数编码为 ascii:

>>> s.decode("utf-8").encode("ascii", "ignore")
'http://www.google.com blah blah#%#@$^blah'

回答by Peter

You could make it check for valid letters, and instead of typing out everything, it's possible to use the stringmodule. The ones that may be useful to you are string.ascii_letters(contains both string.ascii_lowercaseand string.ascii_uppercase), string.digits, string.printableand string.punctuation.

您可以让它检查有效的字母,而不是输入所有内容,而是可以使用该string模块。可能对您有用的是string.ascii_letters(包含string.ascii_lowercasestring.ascii_uppercasestring.digitsstring.printablestring.punctuation

I'd try string.printablefirst, but if it lets a few too many characters through, you could use a mix of the others.

我会string.printable先尝试,但如果它让太多字符通过,您可以混合使用其他字符。

Here's an example of how I'd do it:

这是我如何做的一个例子:

import string
valid_characters = string.printable
start_string = '\xe2\x80\x9chttp://www.google.com\xe2\x80\x9d blah blah#%#@$^blah'
end_string = ''.join(i for i in start_string if i in valid_characters)

回答by Manthan Koolwal

You can use decode after encoding just like this

您可以像这样在编码后使用解码

s.encode('ascii', errors='ignore').decode("utf-8")