无法在 Python 2.4 中解码 unicode 字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/666417/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 20:35:00  来源:igfitidea点击:

Unable to decode unicode string in Python 2.4

pythonunicodedecode

提问by Rob Lund

This is in python 2.4. Here is my situation. I pull a string from a database, and it contains an umlauted 'o' (\xf6). At this point if I run type(value) it returns str. I then attempt to run .decode('utf-8'), and I get an error ('utf8' codec can't decode bytes in position 1-4).

这是在python 2.4中。这是我的情况。我从数据库中提取一个字符串,它包含一个变音的“o”(\xf6)。此时,如果我运行 type(value) 它将返回 str。然后我尝试运行 .decode('utf-8'),我得到一个错误('utf8' 编解码器无法解码位置 1-4 的字节)。

Really my goal here is just to successfully make type(value) return unicode. I found an earlier questionthat had some useful information, but the example from the picked answer doesn't seem to run for me. Is there something I am doing wrong here?

真的,我的目标只是成功地使 type(value) 返回 unicode。我发现了一个较早的问题,其中包含一些有用的信息,但所选答案中的示例似乎并不适合我。我在这里做错了什么吗?

Here is some code to reproduce:

这是一些要重现的代码:

Name = 'w\xc3\xb6rner'.decode('utf-8')
file.write('Name: %s - %s\n' %(Name, type(Name)))

I never actually get to the write statement, because it fails on the first statement.

我从来没有真正进入过 write 语句,因为它在第一条语句中失败了。

Thank you for your help.

谢谢您的帮助。

Edit:

编辑:

I verified that the DB's charset is utf8. So in my code to reproduce I changed '\xf6' to '\xc3\xb6', and the failure still occurs. Is there a difference between 'utf-8' and 'utf8'?

我验证了数据库的字符集是 utf8。所以在我要重现的代码中,我将'\xf6' 更改为'\xc3\xb6',失败仍然发生。“utf-8”和“utf8”之间有区别吗?

The tip on using codecs to write to a file is handy (I'll definitely use it), but in this scenario I am only writing to a log file for debugging purposes.

使用编解码器写入文件的提示很方便(我肯定会使用它),但在这种情况下,我只是为了调试目的而写入日志文件。

采纳答案by bobince

So in my code to reproduce I changed '\xf6' to '\xc3\xb6', and the failure still occurs

所以在我要重现的代码中我把'\xf6'改成了'\xc3\xb6',故障依旧

Not in the first line it doesn't:

不在第一行,它没有:

>>> 'w\xc3\xb6rner'.decode('utf-8')
u'w\xf6rner'

The second line will error out though:

但是第二行会出错:

>>> file.write('Name: %s - %s\n' %(Name, type(Name)))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 7: ordinal not in range(128)

Which is entirely what you'd expect, trying to write non-ASCII Unicode characters to a byte stream. If you use Jiri's suggestion of a codecs-wrapped stream you can write Unicode directly, otherwise you will have to re-encode the Unicode string into bytes manually.

这完全是您所期望的,尝试将非 ASCII Unicode 字符写入字节流。如果您使用 Jiri 的编解码器包装流的建议,您可以直接编写 Unicode,否则您必须手动将 Unicode 字符串重新编码为字节。

Better, for logging purposes, would be simply to spit out a repr() of the variable. Then you don't have to worry about Unicode characters being in there, or newlines or other unwanted characters:

出于记录目的,更好的是简单地吐出变量的 repr() 。然后您不必担心其中有 Unicode 字符、换行符或其他不需要的字符:

name= 'w\xc3\xb6rner'.decode('utf-8')
file.write('Name: %r\n' % name)

Name: u'w\xf6rner'

回答by Jiri

Your string is notin UTF8 encoding. If you want to 'decode' string to unicode, your string must be in encoding you specified by parameter. I tried this and it works perfectly:

您的字符串不是UTF8 编码。如果要将字符串“解码”为 unicode,则字符串必须采用参数指定的编码。我试过这个,它工作得很好:

print 'w\xf6rner'.decode('cp1250')

EDIT

编辑

For writing unicode strings to the file you can use codecs module:

要将 unicode 字符串写入文件,您可以使用 codecs 模块:

import codecs
f = codecs.open("yourfile.txt", "w", "utf8")
f.write( ... )

It is handy to specify encoding of the input/output and using 'unicode' string throughout your code without bothering of different encodings.

指定输入/输出的编码并在整个代码中使用“unicode”字符串很方便,而无需担心不同的编码。

回答by vartec

It's obviously 1-byte encoding. '?' in UTF-8 is '\xc3\xb6'.

这显然是 1 字节编码。'?在 UTF-8 中是 '\xc3\xb6'。

The encoding might be:

编码可能是:

  • ISO-8859-1
  • ISO-8859-2
  • ISO-8859-13
  • ISO-8859-15
  • Win-1250
  • Win-1252
  • ISO-8859-1
  • ISO-8859-2
  • ISO-8859-13
  • ISO-8859-15
  • 赢1250
  • 赢1252

回答by Staale

You need to use "ISO-8859-1":

您需要使用“ISO-8859-1”:

Name = 'w\xf6rner'.decode('iso-8859-1')
file.write('Name: %s - %s\n' %(Name, type(Name)))

utf-8 uses 2 bytes for escaping anything outside ascii, but here it's just 1 byte, so iso-8859-1 is probably correct.

utf-8 使用 2 个字节来转义 ascii 之外的任何内容,但这里只有 1 个字节,因此 iso-8859-1 可能是正确的。