python Python字符串解码问题

Question

提问by Yuval Adam

I am trying to parse a CSV file containing some data, mostly numeral but with some strings - which I do not know their encoding, but I do know they are in Hebrew.

我试图解析一个包含一些数据的 CSV 文件，主要是数字但有一些字符串 - 我不知道它们的编码，但我知道它们是希伯来语。

Eventually I need to know the encoding so I can unicode the strings, print them, and perhaps throw them into a database later on.

最终我需要知道编码，以便我可以对字符串进行 Unicode 编码，打印它们，也许稍后将它们扔到数据库中。

I tried using Chardet, which claims the strings are Windows-1255 (cp1255) but trying to do print someString.decode('cp1255')yields the notorious error:

我尝试使用Chardet，它声称字符串是 Windows-1255 ( cp1255) 但尝试这样做会print someString.decode('cp1255')产生臭名昭著的错误：

UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)

I tried every other encoding possible, to no avail. Also, the file is absolutely valid since I can open the CSV in Excel and I see the correct data.

我尝试了所有其他可能的编码，但无济于事。此外，该文件绝对有效，因为我可以在 Excel 中打开 CSV 并看到正确的数据。

Any idea how I can properly decode these strings?

知道如何正确解码这些字符串吗？

EDIT:here is an example. One of the strings looks like this (first five letters of the Hebrew alphabet):

编辑：这是一个例子。其中一个字符串如下所示（希伯来字母表的前五个字母）：

print repr(sampleString)
#prints:
'\xe0\xe1\xe2\xe3\xe4'

(using Python 2.6.2)

（使用 Python 2.6.2）

Answer 1

回答by codeape

This is what's happening:

这是正在发生的事情：

sampleString is a byte string (cp1255 encoded)
sampleString.decode("cp1255")decodes (decode==bytes -> unicode string) the byte string to a unicode string
print sampleString.decode("cp1255")attempts to print the unicode string to stdout. Print has to encodethe unicode string to do that (encode==unicode string -> bytes). The error that you're seeing means that the python print statement cannot write the given unicode string to the console's encoding. sys.stdout.encodingis the terminal's encoding.

sampleString 是一个字节字符串（cp1255 编码）
sampleString.decode("cp1255")将 (decode==bytes -> unicode string) 字节字符串解码为 unicode 字符串
print sampleString.decode("cp1255")尝试将 unicode 字符串打印到标准输出。Print 必须对unicode 字符串进行编码才能做到这一点（encode==unicode string -> bytes）。您看到的错误意味着 python 打印语句无法将给定的 unicode 字符串写入控制台的编码。sys.stdout.encoding是终端的编码。

So the problem is that your console does not support these characters. You should be able to tweak the console to use another encoding. The details on how to do that depends on your OS and terminal program.

所以问题是你的控制台不支持这些字符。您应该能够调整控制台以使用其他编码。有关如何执行此操作的详细信息取决于您的操作系统和终端程序。

Another approach would be to manually specify the encoding to use:

另一种方法是手动指定要使用的编码：

print sampleString.decode("cp1255").encode("utf-8")

See also:

也可以看看：

A simple test program you can experiment with:

您可以试验的简单测试程序：

import sys
print sys.stdout.encoding
samplestring = '\xe0\xe1\xe2\xe3\xe4'
print samplestring.decode("cp1255").encode(sys.argv[1])

On my utf-8 terminal:

在我的 utf-8 终端上：

$ python2.6 test.py utf-8
UTF-8
?????

$ python2.6 test.py latin1
UTF-8
Traceback (most recent call last):
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256)

$ python2.6 test.py ascii
UTF-8
Traceback (most recent call last):
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

$ python2.6 test.py cp424
UTF-8
ABCDE

$ python2.6 test.py iso8859_8
UTF-8
?????

The error messages for latin-1 and ascii means that the unicode characters in the string cannot be represented in these encodings.

latin-1 和 ascii 的错误消息意味着字符串中的 unicode 字符无法用这些编码表示。

Notice the last two. I encode the unicode string to the cp424 and iso8859_8 encodings (two of the encodings listed on http://docs.python.org/library/codecs.html#standard-encodingsthat supports hebrew characters). I get no exception using these encodings, since the hebrew unicode characters have a representation in the encodings.

注意最后两个。我将 unicode 字符串编码为 cp424 和 iso8859_8 编码（http://docs.python.org/library/codecs.html#standard-encodings上列出的两种支持希伯来语字符的编码）。我使用这些编码也不例外，因为希伯来语 unicode 字符在编码中具有表示。

But my utf-8 terminal gets very confused when it receives bytes in a different encoding than utf-8.

但是当我的 utf-8 终端以与 utf-8 不同的编码接收字节时，它会变得非常困惑。

In the first case (cp424), my UTF-8 terminal displays ABCDE, meaning that the utf-8 representation of A corresponds to the cp424 representation of ?, i.e. the byte value 65 means A in utf-8 and ? in cp424.

在第一种情况下（cp424），我的 UTF-8 终端显示 ABCDE，这意味着 A 的 utf-8 表示对应于 ? 的 cp424 表示，即字节值 65 表示 utf-8 和 ? 在 cp424 中。

The encodemethod has an optional string argument you can use to specify what should happen when the encoding cannot represent a character (documentation). The supported strategies are strict (the default), ignore, replace, xmlcharref and backslashreplace. You can even add your own custom strategies.

该encode方法有一个可选的字符串参数，您可以使用它来指定当编码不能表示字符时应该发生什么（文档）。支持的策略有严格（默认）、忽略、替换、xmlcharref 和反斜杠替换。您甚至可以添加自己的自定义策略。

Another test program (I print with quotes around the string to better show how ignore behaves):

另一个测试程序（我在字符串周围加上引号以更好地显示 ignore 的行为）：

import sys
samplestring = '\xe0\xe1\xe2\xe3\xe4'
print "'{0}'".format(samplestring.decode("cp1255").encode(sys.argv[1], 
      sys.argv[2]))

The results:

结果：

$ python2.6 test.py latin1 strict
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    sys.argv[2]))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256)
[/tmp]
$ python2.6 test.py latin1 ignore
''
[/tmp]
$ python2.6 test.py latin1 replace
'?????'
[/tmp]
$ python2.6 test.py latin1 xmlcharrefreplace
'&#1488;&#1489;&#1490;&#1491;&#1492;'
[/tmp]
$ python2.6 test.py latin1 backslashreplace
'\u05d0\u05d1\u05d2\u05d3\u05d4'

Answer 2

回答by Mike Graham

When you decode the string to unicode with someString.decode('cp1255'), you have an abstract representation of some Hebrew text in unicode. (This part happens successfully!) When you use print, you need a concrete, encoded representation in a specific encoding. It looks like your problem isn't with the decode, but with the print.

当您使用将字符串解码为 unicode 时someString.decode('cp1255')，您将获得一些 unicode 希伯来语文本的抽象表示。（这部分成功了！）当您使用时print，您需要在特定编码中的具体编码表示。看起来您的问题不在于解码，而在于print.

To print, either just print someStringif your terminal understands cp1255 or "print someString.decode('cp1255').encode('the_encoding_your_terminal_does_understand')". If you don't need the resulting print to be readable as Hebrew, print repr(someString.decode('cp1255'))also gets you meaningful representation of the abstract unicode string.

要打印，只要print someString您的终端理解 cp1255 或“ print someString.decode('cp1255').encode('the_encoding_your_terminal_does_understand')”。如果您不需要将生成的打印结果作为希伯来语可读，print repr(someString.decode('cp1255'))还可以为您提供抽象 unicode 字符串的有意义的表示。

Answer 3

回答by Yuval Adam

Is someStringis maybe not a normal string, but a unicode string, like you would have us believe with your sampleString?

是someString的，也许不是一个正常的字符串，而是一个unicode字符串，就像你要我们相信你sampleString？

>>> print '\xe0\xe1\xe2\xe3\xe4'.decode('cp1255')
<hebrew characters>

>>> print u'\xe0\xe1\xe2\xe3\xe4'.decode('cp1255')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "[...]/encodings/cp1255.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode characters [...]

Answer 4

回答by Ignacio Vazquez-Abrams

You're getting an encodeerror when printing, so most likely it's decoding fine, you just can't print out the result properly. Try running chcp 65001at the command prompt before starting the Python code.

打印时遇到编码错误，因此很可能解码正常，只是无法正确打印结果。chcp 65001在启动 Python 代码之前尝试在命令提示符下运行。

python Python字符串解码问题

提问by Yuval Adam

回答by codeape

回答by Mike Graham

回答by Yuval Adam

回答by Ignacio Vazquez-Abrams

相关推荐

最近更新

标签

python Python字符串解码问题

提问by Yuval Adam

回答by codeape

回答by Mike Graham

回答by Yuval Adam

回答by Ignacio Vazquez-Abrams

相关推荐

python numpy：沿新轴扩展数组？

Python - 词法分析和标记化

如何以编程方式检查 Python 中异常的堆栈跟踪？

使用 Python 多处理解决令人尴尬的并行问题

相关推荐

最近更新

标签