如何将 ASCII 字符串视为 unicode 并在 python 中对其中的转义字符进行转义?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/267436/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?
提问by John
For example, if I have a unicodestring, I can encode it as an ASCIIstring like so:
例如,如果我有一个unicode字符串,我可以将它编码为一个ASCII字符串,如下所示:
>>> u'\u003cfoo/\u003e'.encode('ascii')
'<foo/>'
However, I have e.g. this ASCIIstring:
但是,我有例如这个ASCII字符串:
'\u003foo\u003e'
... that I want to turn into the same ASCIIstring as in my first example above:
...我想变成与上面第一个示例中相同的ASCII字符串:
'<foo/>'
回答by hark
It took me a while to figure this one out, but this pagehad the best answer:
我花了一段时间才弄明白这个问题,但这个页面有最好的答案:
>>> s = '\u003cfoo/\u003e'
>>> s.decode( 'unicode-escape' )
u'<foo/>'
>>> s.decode( 'unicode-escape' ).encode( 'ascii' )
'<foo/>'
There's also a 'raw-unicode-escape' codec to handle the other way to specify Unicode strings -- check the "Unicode Constructors" section of the linked page for more details (since I'm not that Unicode-saavy).
还有一个'raw-unicode-escape'编解码器来处理指定Unicode字符串的另一种方式——查看链接页面的“Unicode构造函数”部分以获取更多详细信息(因为我不是那种Unicode-saavy)。
EDIT: See also Python Standard Encodings.
编辑:另见Python 标准编码。
回答by MakerDrone
Ned Batcheldersaid:
内德巴切尔德说:
It's a little dangerous depending on where the string is coming from, but how about:
>>> s = '\u003cfoo\u003e' >>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii') '<foo>'
根据字符串的来源,这有点危险,但是如何:
>>> s = '\u003cfoo\u003e' >>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii') '<foo>'
Actually this method can be made safe like so:
实际上,这种方法可以像这样安全:
>>> s = '\u003cfoo\u003e'
>>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]
Mind the triple-quote string and the dash right before the closing 3-quotes.
注意三引号字符串和结束三引号之前的破折号。
- Using a 3-quoted string will ensure that if the user enters ' \\" ' (spaces added for visual clarity) in the string it would not disrupt the evaluator;
- The dash at the end is a failsafe in case the user's string ends with a ' \" ' . Before we assign the result we slice the inserted dash with [:-1]
- 使用 3 引号字符串将确保如果用户在字符串中输入 ' \\" '(为了视觉清晰而添加空格),它不会干扰评估器;
- 如果用户的字符串以 ' \" ' 结尾,则末尾的破折号是一个故障保护。在我们分配结果之前,我们用 [:-1] 对插入的破折号进行切片
So there would be no need to worry about what the users enter, as long as it is captured in raw format.
因此,只要以原始格式捕获,就无需担心用户输入的内容。
回答by OkezieE
At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:
在某些时候,当您在要解码的字符串中遇到特殊字符(如中文字符或表情符号)时,您会遇到问题,即如下所示的错误:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)
For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors
对于我的情况(推特数据处理),我解码如下,让我看到所有字符没有错误
>>> s = '\u003cfoo\u003e'
>>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
>>> <foo>
回答by Kaniabi
On Python 2.5 the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).
在 Python 2.5 上,正确的编码是“unicode_escape”,而不是“unicode-escape”(注意下划线)。
I'm not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.
我不确定较新版本的 Python 是否更改了 unicode 名称,但这里仅使用下划线。
Anyway, this is it.
无论如何,就是这样。
回答by Ned Batchelder
It's a little dangerous depending on where the string is coming from, but how about:
根据字符串的来源,这有点危险,但是如何:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'