在 Python 中将 UTF-8 转换为字符串文字

Question

提问by Tminer

I have a string in UTF-8 format but not so sure how to convert this string to it's corresponding character literal. For example I have the string:

我有一个 UTF-8 格式的字符串，但不太确定如何将此字符串转换为相应的字符文字。例如我有字符串：

My string is: 'Entre\xc3\xa9'

我的字符串是： 'Entre\xc3\xa9'

Example one:

示例一：

This code:

这段代码：

u'Entre\xc3\xa9'.encode('latin-1').decode('utf-8')

returns the result: u'Entre\xe9'

返回结果： u'Entre\xe9'

If I then continue by printing this:

如果我然后继续打印这个：

print u'Entre\xe9'

I get the result: Entreé

我得到结果： Entreé

This is great and close to what I need. The problem is, I can't make 'Entre\xc3\xa9' a variable and pass it through the steps as this now breaks. Any tips for getting this working?

这很棒并且接近我需要的东西。问题是，我无法将 'Entre\xc3\xa9' 设为变量并将其传递给步骤，因为这现在已经中断。有什么让这个工作的提示吗？

Example:

例子：

a = 'Entre\xc3\xa9'
b = 'u'+ a.encode('latin-1').decode('utf-8')
c= 'u'+ b

I would like result of "c" to be:

我希望“c”的结果是：

Entreé

Answer 1

采纳答案by Martijn Pieters

The u''syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicodeobject being created, but that's not the only way to create such an object.

该u''语法仅适用于字符串文字，例如在源代码中定义值。使用语法会unicode创建一个对象，但这不是创建这样一个对象的唯一方法。

You cannot make a unicodevalue from a byte string by adding uin front of it. But if you called str.decode()with the right encoding, you get a unicodevalue. Vice-versa, you can encodeunicodeobjects to byte strings with unicode.encode().

您不能通过unicode在字节字符串u前面添加来从字节字符串中创建值。但是如果你str.decode()用正确的编码调用，你会得到一个unicode值。反之亦然，你可以编码unicode对象与字节串unicode.encode()。

Note that when displaying a unicodeobject, Python representsit by using the Unicode string literal syntax again (so u'...'), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.

请注意，当显示一个unicode对象时，Python再次使用 Unicode 字符串文字语法（so ）来表示它u'...'，以简化调试。您可以将表示粘贴回 Python 解释器并获得具有相同值的对象。

Your avalue is defined using a byte string literal, so you only need to decode:

您的a值是使用字节字符串文字定义的，因此您只需要解码：

a = 'Entre\xc3\xa9'
b = a.decode('utf8')

Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.

您的第一个示例创建了一个Mojibake，这是一个包含实际表示 UTF-8 字节的 Latin-1 代码点的 Unicode 字符串。这就是为什么您必须首先编码为 Latin-1（以撤消 Mojibake），然后从 UTF-8 解码。

You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:

您可能想在Unicode HOWTO 中阅读有关 Python 和 Unicode 的内容。其他感兴趣的文章有：

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)by Joel Spolsky
Pragmatic Unicodeby Ned Batchelder

每个软件开发人员绝对、肯定地必须了解 Unicode 和字符集的绝对最低要求（没有任何借口！）作者：Joel Spolsky
内德巴切尔德的实用 Unicode

在 Python 中将 UTF-8 转换为字符串文字

提问by Tminer

采纳答案by Martijn Pieters

相关推荐

最近更新

标签

在 Python 中将 UTF-8 转换为字符串文字

提问by Tminer

采纳答案by Martijn Pieters

相关推荐

如何在 Python 中将 N 毫秒添加到日期时间

Python3 - 有没有办法在非常大的 SQlite 表上逐行迭代而不将整个表加载到本地内存中？

Python 迭代 list["a","b","c"] 时出现错误“'type' 对象没有属性 '__getitem__'”

Python 熊猫“描述”没有返回所有列的摘要

相关推荐

最近更新

标签

Python 迭代 list["a","b","c"] 时出现错误“'type' 对象没有属性 'getitem'”