在 Python 中将 UTF-8 转换为字符串文字
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24571790/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert UTF-8 to string literals in Python
提问by Tminer
I have a string in UTF-8 format but not so sure how to convert this string to it's corresponding character literal. For example I have the string:
我有一个 UTF-8 格式的字符串,但不太确定如何将此字符串转换为相应的字符文字。例如我有字符串:
My string is: 'Entre\xc3\xa9'
我的字符串是: 'Entre\xc3\xa9'
Example one:
示例一:
This code:
这段代码:
u'Entre\xc3\xa9'.encode('latin-1').decode('utf-8')
returns the result: u'Entre\xe9'
返回结果: u'Entre\xe9'
If I then continue by printing this:
如果我然后继续打印这个:
print u'Entre\xe9'
I get the result: Entreé
我得到结果: Entreé
This is great and close to what I need. The problem is, I can't make 'Entre\xc3\xa9' a variable and pass it through the steps as this now breaks. Any tips for getting this working?
这很棒并且接近我需要的东西。问题是,我无法将 'Entre\xc3\xa9' 设为变量并将其传递给步骤,因为这现在已经中断。有什么让这个工作的提示吗?
Example:
例子:
a = 'Entre\xc3\xa9'
b = 'u'+ a.encode('latin-1').decode('utf-8')
c= 'u'+ b
I would like result of "c" to be:
我希望“c”的结果是:
Entreé
采纳答案by Martijn Pieters
The u''
syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode
object being created, but that's not the only way to create such an object.
该u''
语法仅适用于字符串文字,例如在源代码中定义值。使用语法会unicode
创建一个对象,但这不是创建这样一个对象的唯一方法。
You cannot make a unicode
value from a byte string by adding u
in front of it. But if you called str.decode()
with the right encoding, you get a unicode
value. Vice-versa, you can encodeunicode
objects to byte strings with unicode.encode()
.
您不能通过unicode
在字节字符串u
前面添加来从字节字符串中创建值。但是如果你str.decode()
用正确的编码调用,你会得到一个unicode
值。反之亦然,你可以编码unicode
对象与字节串unicode.encode()
。
Note that when displaying a unicode
object, Python representsit by using the Unicode string literal syntax again (so u'...'
), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.
请注意,当显示一个unicode
对象时,Python再次使用 Unicode 字符串文字语法(so )来表示它u'...'
,以简化调试。您可以将表示粘贴回 Python 解释器并获得具有相同值的对象。
Your a
value is defined using a byte string literal, so you only need to decode:
您的a
值是使用字节字符串文字定义的,因此您只需要解码:
a = 'Entre\xc3\xa9'
b = a.decode('utf8')
Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.
您的第一个示例创建了一个Mojibake,这是一个包含实际表示 UTF-8 字节的 Latin-1 代码点的 Unicode 字符串。这就是为什么您必须首先编码为 Latin-1(以撤消 Mojibake),然后从 UTF-8 解码。
You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:
您可能想在Unicode HOWTO 中阅读有关 Python 和 Unicode 的内容。其他感兴趣的文章有:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)by Joel Spolsky
Pragmatic Unicodeby Ned Batchelder
每个软件开发人员绝对、肯定地必须了解 Unicode 和字符集的绝对最低要求(没有任何借口!)作者:Joel Spolsky
内德巴切尔德的实用 Unicode