无法在 Python 2.4 中解码 unicode 字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/666417/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Unable to decode unicode string in Python 2.4
提问by Rob Lund
This is in python 2.4. Here is my situation. I pull a string from a database, and it contains an umlauted 'o' (\xf6). At this point if I run type(value) it returns str. I then attempt to run .decode('utf-8'), and I get an error ('utf8' codec can't decode bytes in position 1-4).
这是在python 2.4中。这是我的情况。我从数据库中提取一个字符串,它包含一个变音的“o”(\xf6)。此时,如果我运行 type(value) 它将返回 str。然后我尝试运行 .decode('utf-8'),我得到一个错误('utf8' 编解码器无法解码位置 1-4 的字节)。
Really my goal here is just to successfully make type(value) return unicode. I found an earlier questionthat had some useful information, but the example from the picked answer doesn't seem to run for me. Is there something I am doing wrong here?
真的,我的目标只是成功地使 type(value) 返回 unicode。我发现了一个较早的问题,其中包含一些有用的信息,但所选答案中的示例似乎并不适合我。我在这里做错了什么吗?
Here is some code to reproduce:
这是一些要重现的代码:
Name = 'w\xc3\xb6rner'.decode('utf-8')
file.write('Name: %s - %s\n' %(Name, type(Name)))
I never actually get to the write statement, because it fails on the first statement.
我从来没有真正进入过 write 语句,因为它在第一条语句中失败了。
Thank you for your help.
谢谢您的帮助。
Edit:
编辑:
I verified that the DB's charset is utf8. So in my code to reproduce I changed '\xf6' to '\xc3\xb6', and the failure still occurs. Is there a difference between 'utf-8' and 'utf8'?
我验证了数据库的字符集是 utf8。所以在我要重现的代码中,我将'\xf6' 更改为'\xc3\xb6',失败仍然发生。“utf-8”和“utf8”之间有区别吗?
The tip on using codecs to write to a file is handy (I'll definitely use it), but in this scenario I am only writing to a log file for debugging purposes.
使用编解码器写入文件的提示很方便(我肯定会使用它),但在这种情况下,我只是为了调试目的而写入日志文件。
采纳答案by bobince
So in my code to reproduce I changed '\xf6' to '\xc3\xb6', and the failure still occurs
所以在我要重现的代码中我把'\xf6'改成了'\xc3\xb6',故障依旧
Not in the first line it doesn't:
不在第一行,它没有:
>>> 'w\xc3\xb6rner'.decode('utf-8')
u'w\xf6rner'
The second line will error out though:
但是第二行会出错:
>>> file.write('Name: %s - %s\n' %(Name, type(Name)))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 7: ordinal not in range(128)
Which is entirely what you'd expect, trying to write non-ASCII Unicode characters to a byte stream. If you use Jiri's suggestion of a codecs-wrapped stream you can write Unicode directly, otherwise you will have to re-encode the Unicode string into bytes manually.
这完全是您所期望的,尝试将非 ASCII Unicode 字符写入字节流。如果您使用 Jiri 的编解码器包装流的建议,您可以直接编写 Unicode,否则您必须手动将 Unicode 字符串重新编码为字节。
Better, for logging purposes, would be simply to spit out a repr() of the variable. Then you don't have to worry about Unicode characters being in there, or newlines or other unwanted characters:
出于记录目的,更好的是简单地吐出变量的 repr() 。然后您不必担心其中有 Unicode 字符、换行符或其他不需要的字符:
name= 'w\xc3\xb6rner'.decode('utf-8')
file.write('Name: %r\n' % name)
Name: u'w\xf6rner'
回答by Jiri
Your string is notin UTF8 encoding. If you want to 'decode' string to unicode, your string must be in encoding you specified by parameter. I tried this and it works perfectly:
您的字符串不是UTF8 编码。如果要将字符串“解码”为 unicode,则字符串必须采用参数指定的编码。我试过这个,它工作得很好:
print 'w\xf6rner'.decode('cp1250')
EDIT
编辑
For writing unicode strings to the file you can use codecs module:
要将 unicode 字符串写入文件,您可以使用 codecs 模块:
import codecs
f = codecs.open("yourfile.txt", "w", "utf8")
f.write( ... )
It is handy to specify encoding of the input/output and using 'unicode' string throughout your code without bothering of different encodings.
指定输入/输出的编码并在整个代码中使用“unicode”字符串很方便,而无需担心不同的编码。
回答by vartec
It's obviously 1-byte encoding. '?' in UTF-8 is '\xc3\xb6'.
这显然是 1 字节编码。'?在 UTF-8 中是 '\xc3\xb6'。
The encoding might be:
编码可能是:
- ISO-8859-1
- ISO-8859-2
- ISO-8859-13
- ISO-8859-15
- Win-1250
- Win-1252
- ISO-8859-1
- ISO-8859-2
- ISO-8859-13
- ISO-8859-15
- 赢1250
- 赢1252
回答by Staale
You need to use "ISO-8859-1":
您需要使用“ISO-8859-1”:
Name = 'w\xf6rner'.decode('iso-8859-1')
file.write('Name: %s - %s\n' %(Name, type(Name)))
utf-8 uses 2 bytes for escaping anything outside ascii, but here it's just 1 byte, so iso-8859-1 is probably correct.
utf-8 使用 2 个字节来转义 ascii 之外的任何内容,但这里只有 1 个字节,因此 iso-8859-1 可能是正确的。