无法在 Python 2.4 中解码 unicode 字符串

Question

提问by Rob Lund

This is in python 2.4. Here is my situation. I pull a string from a database, and it contains an umlauted 'o' (\xf6). At this point if I run type(value) it returns str. I then attempt to run .decode('utf-8'), and I get an error ('utf8' codec can't decode bytes in position 1-4).

这是在python 2.4中。这是我的情况。我从数据库中提取一个字符串，它包含一个变音的“o”（\xf6）。此时，如果我运行 type(value) 它将返回 str。然后我尝试运行 .decode('utf-8')，我得到一个错误（'utf8' 编解码器无法解码位置 1-4 的字节）。

Really my goal here is just to successfully make type(value) return unicode. I found an earlier questionthat had some useful information, but the example from the picked answer doesn't seem to run for me. Is there something I am doing wrong here?

真的，我的目标只是成功地使 type(value) 返回 unicode。我发现了一个较早的问题，其中包含一些有用的信息，但所选答案中的示例似乎并不适合我。我在这里做错了什么吗？

Here is some code to reproduce:

这是一些要重现的代码：

Name = 'w\xc3\xb6rner'.decode('utf-8')
file.write('Name: %s - %s\n' %(Name, type(Name)))

I never actually get to the write statement, because it fails on the first statement.

我从来没有真正进入过 write 语句，因为它在第一条语句中失败了。

Thank you for your help.

谢谢您的帮助。

Edit:

编辑：

I verified that the DB's charset is utf8. So in my code to reproduce I changed '\xf6' to '\xc3\xb6', and the failure still occurs. Is there a difference between 'utf-8' and 'utf8'?

我验证了数据库的字符集是 utf8。所以在我要重现的代码中，我将'\xf6' 更改为'\xc3\xb6'，失败仍然发生。“utf-8”和“utf8”之间有区别吗？

The tip on using codecs to write to a file is handy (I'll definitely use it), but in this scenario I am only writing to a log file for debugging purposes.

使用编解码器写入文件的提示很方便（我肯定会使用它），但在这种情况下，我只是为了调试目的而写入日志文件。

Answer 1

采纳答案by bobince

So in my code to reproduce I changed '\xf6' to '\xc3\xb6', and the failure still occurs

所以在我要重现的代码中我把'\xf6'改成了'\xc3\xb6'，故障依旧

Not in the first line it doesn't:

不在第一行，它没有：

>>> 'w\xc3\xb6rner'.decode('utf-8')
u'w\xf6rner'

The second line will error out though:

但是第二行会出错：

>>> file.write('Name: %s - %s\n' %(Name, type(Name)))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 7: ordinal not in range(128)

Which is entirely what you'd expect, trying to write non-ASCII Unicode characters to a byte stream. If you use Jiri's suggestion of a codecs-wrapped stream you can write Unicode directly, otherwise you will have to re-encode the Unicode string into bytes manually.

这完全是您所期望的，尝试将非 ASCII Unicode 字符写入字节流。如果您使用 Jiri 的编解码器包装流的建议，您可以直接编写 Unicode，否则您必须手动将 Unicode 字符串重新编码为字节。

Better, for logging purposes, would be simply to spit out a repr() of the variable. Then you don't have to worry about Unicode characters being in there, or newlines or other unwanted characters:

出于记录目的，更好的是简单地吐出变量的 repr() 。然后您不必担心其中有 Unicode 字符、换行符或其他不需要的字符：

name= 'w\xc3\xb6rner'.decode('utf-8')
file.write('Name: %r\n' % name)

Name: u'w\xf6rner'

Answer 2

回答by Jiri

Your string is notin UTF8 encoding. If you want to 'decode' string to unicode, your string must be in encoding you specified by parameter. I tried this and it works perfectly:

您的字符串不是UTF8 编码。如果要将字符串“解码”为 unicode，则字符串必须采用参数指定的编码。我试过这个，它工作得很好：

print 'w\xf6rner'.decode('cp1250')

EDIT

编辑

For writing unicode strings to the file you can use codecs module:

要将 unicode 字符串写入文件，您可以使用 codecs 模块：

import codecs
f = codecs.open("yourfile.txt", "w", "utf8")
f.write( ... )

It is handy to specify encoding of the input/output and using 'unicode' string throughout your code without bothering of different encodings.

指定输入/输出的编码并在整个代码中使用“unicode”字符串很方便，而无需担心不同的编码。

Answer 3

回答by vartec

It's obviously 1-byte encoding. '?' in UTF-8 is '\xc3\xb6'.

这显然是 1 字节编码。'？在 UTF-8 中是 '\xc3\xb6'。

The encoding might be:

编码可能是：

ISO-8859-1
ISO-8859-2
ISO-8859-13
ISO-8859-15
Win-1250
Win-1252

ISO-8859-1
ISO-8859-2
ISO-8859-13
ISO-8859-15
赢1250
赢1252

Answer 4

回答by Staale

You need to use "ISO-8859-1":

您需要使用“ISO-8859-1”：

Name = 'w\xf6rner'.decode('iso-8859-1')
file.write('Name: %s - %s\n' %(Name, type(Name)))

utf-8 uses 2 bytes for escaping anything outside ascii, but here it's just 1 byte, so iso-8859-1 is probably correct.

utf-8 使用 2 个字节来转义 ascii 之外的任何内容，但这里只有 1 个字节，因此 iso-8859-1 可能是正确的。

无法在 Python 2.4 中解码 unicode 字符串

提问by Rob Lund

采纳答案by bobince

回答by Jiri

回答by vartec

回答by Staale

相关推荐

最近更新

标签

无法在 Python 2.4 中解码 unicode 字符串

提问by Rob Lund

采纳答案by bobince

回答by Jiri

回答by vartec

回答by Staale

相关推荐

python Python正则表达式通过两个分隔符之一拆分字符串

python SQLAlchemy 在提交前使用自动增量获取主键

Python 函数：从购买金额中查找零钱

python 为什么我收到无效的语法 easy_install 错误？

相关推荐

最近更新

标签