在 Python 中将 UTF-8 转换为字符串文字

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24571790/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:47:55  来源:igfitidea点击:

Convert UTF-8 to string literals in Python

pythonstringutf-8literals

提问by Tminer

I have a string in UTF-8 format but not so sure how to convert this string to it's corresponding character literal. For example I have the string:

我有一个 UTF-8 格式的字符串,但不太确定如何将此字符串转换为相应的字符文字。例如我有字符串:

My string is: 'Entre\xc3\xa9'

我的字符串是: 'Entre\xc3\xa9'

Example one:

示例一:

This code:

这段代码:

u'Entre\xc3\xa9'.encode('latin-1').decode('utf-8')

returns the result: u'Entre\xe9'

返回结果: u'Entre\xe9'

If I then continue by printing this:

如果我然后继续打印这个:

print u'Entre\xe9'

I get the result: Entreé

我得到结果: Entreé

This is great and close to what I need. The problem is, I can't make 'Entre\xc3\xa9' a variable and pass it through the steps as this now breaks. Any tips for getting this working?

这很棒并且接近我需要的东西。问题是,我无法将 'Entre\xc3\xa9' 设为变量并将其传递给步骤,因为这现在已经中断。有什么让这个工作的提示吗?

Example:

例子:

a = 'Entre\xc3\xa9'
b = 'u'+ a.encode('latin-1').decode('utf-8')
c= 'u'+ b

I would like result of "c" to be:

我希望“c”的结果是:

Entreé

采纳答案by Martijn Pieters

The u''syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicodeobject being created, but that's not the only way to create such an object.

u''语法仅适用于字符串文字,例如在源代码中定义值。使用语法会unicode创建一个对象,但这不是创建这样一个对象的唯一方法。

You cannot make a unicodevalue from a byte string by adding uin front of it. But if you called str.decode()with the right encoding, you get a unicodevalue. Vice-versa, you can encodeunicodeobjects to byte strings with unicode.encode().

您不能通过unicode在字节字符串u前面添加来从字节字符串中创建值。但是如果你str.decode()用正确的编码调用,你会得到一个unicode值。反之亦然,你可以编码unicode对象与字节串unicode.encode()

Note that when displaying a unicodeobject, Python representsit by using the Unicode string literal syntax again (so u'...'), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.

请注意,当显示一个unicode对象时,Python再次使用 Unicode 字符串文字语法(so )来表示u'...',以简化调试。您可以将表示粘贴回 Python 解释器并获得具有相同值的对象。

Your avalue is defined using a byte string literal, so you only need to decode:

您的a值是使用字节字符串文字定义的,因此您只需要解码:

a = 'Entre\xc3\xa9'
b = a.decode('utf8')

Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.

您的第一个示例创建了一个Mojibake,这是一个包含实际表示 UTF-8 字节的 Latin-1 代码点的 Unicode 字符串。这就是为什么您必须首先编码为 Latin-1(以撤消 Mojibake),然后从 UTF-8 解码。

You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:

您可能想在Unicode HOWTO 中阅读有关 Python 和 Unicode 的内容。其他感兴趣的文章有: