Python 重定向到文件时出现 UnicodeDecodeError
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4545661/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UnicodeDecodeError when redirecting to file
提问by zedoo
I run this snippet twice, in the Ubuntu terminal (encoding set to utf-8), once with ./test.pyand then with ./test.py >out.txt:
我在 Ubuntu 终端(编码设置为 utf-8)中运行此代码段两次,一次使用./test.py,然后使用./test.py >out.txt:
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni
Without redirection it prints garbage. With redirection I get a UnicodeDecodeError. Can someone explain why I get the error only in the second case, or even better give a detailed explanation of what's going on behind the curtain in both cases?
如果没有重定向,它会打印垃圾。通过重定向,我得到一个 UnicodeDecodeError。有人可以解释为什么我只在第二种情况下出现错误,或者更好地详细解释两种情况下幕后发生的事情吗?
采纳答案by Eric O Lebigot
The whole key to such encoding problems is to understand that there are in principle two distinct concepts of "string": (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:
整个键到这样的编码的问题是要明白,有在原理上的“串”两个截然不同的概念:(1)的字符串的字符,和(2)串/数组字节. 由于历史上普遍存在不超过 256 个字符的编码(ASCII、Latin-1、Windows-1252、Mac OS Roman 等),这种区别在很长一段时间内一直被忽略:这些编码将一组常见字符映射到0 到 255 之间的数字(即字节);在 Web 出现之前,相对有限的文件交换使得这种编码不兼容的情况是可以容忍的,因为大多数程序可以忽略存在多种编码的事实,只要它们生成的文本保留在同一操作系统上:这样的程序只会将文本视为字节(通过操作系统使用的编码)。正确的现代观点基于以下两点正确地将这两个字符串概念分开:
Charactersare mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance ??????, 中蟒 and . "Characters" for machines also include "drawing instructions" like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character listis included in the Unicodestandard; it covers most of the known characters.
On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes(numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requiresan encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly "understood" (by, say, the Python interpreter), streams of bytes are decodedinto characters. A few encodings(UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression "Unicode encoding" as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multipleencodings).
文字大多与电脑无关:可以在黑板上画出来等等,比如??????、中蟒、. 机器的“字符”还包括“绘图指令”,例如空格、回车、设置书写方向的指令(阿拉伯语等)、重音等。Unicode标准中包含一个非常大的字符列表;它涵盖了大多数已知字符。
另一方面,计算机确实需要以某种方式表示抽象字符:为此,它们使用字节数组(包括 0 到 255 之间的数字),因为它们的内存以字节块的形式出现。将字符转换为字节的必要过程称为编码。因此,计算机需要编码才能表示字符。您计算机上的任何文本都经过编码(直到显示为止),无论是将其发送到终端(需要以特定方式编码的字符),还是保存在文件中。为了显示或正确“理解”(例如,通过 Python 解释器),字节流被解码为字符。一些编码(UTF-8, UTF-16,...) 是由 Unicode 为其字符列表定义的(因此 Unicode 定义了字符列表和这些字符的编码——仍然有一些地方将“Unicode 编码”这个表达式视为一种的方式来指代无处不在的 UTF-8,但这是不正确的术语,因为 Unicode 提供了多种编码)。
In summary, computers need to internally represent characters with bytes, and they do so through two operations:
总之,计算机需要在内部用 bytes 来表示字符,它们通过两个操作来实现:
Encoding: characters → bytes
Decoding: bytes → characters
编码:字符→字节
解码:字节→字符
Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination(e.g. of a base character and of accents).
某些编码不能编码所有字符(例如,ASCII),而(某些)Unicode 编码允许您编码所有 Unicode 字符。编码也不一定是唯一的,因为某些字符可以直接表示,也可以表示为组合(例如,基本字符和重音符号)。
Note that the concept of newlineadds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python's universal newline file reading mode).
注意换行的概念增加了一层复杂性,因为它可以由依赖于操作系统的不同(控制)字符来表示(这就是Python通用换行文件读取模式的原因)。
Now, what I have called "character" above is what Unicode calls a "user-perceived character". A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexesin the Unicode list, which are called "code points"—these codes points can be combined together to form a "grapheme cluster". Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them "Unicode strings" (like in Python?2).
现在,我在上面所说的“字符”就是 Unicode 所说的“用户感知字符”。单个用户感知的字符有时可以通过组合在 Unicode 列表中不同索引处找到的字符部分(基本字符、重音符号等)在 Unicode 中表示,称为“代码点”——这些代码点可以组合在一起形成一个“字素簇”。因此,Unicode 导致了字符串的第三个概念,由一系列 Unicode 代码点组成,位于字节和字符串之间,并且更接近后者。我将它们称为“ Unicode 字符串”(就像在 Python 中一样?2)。
While Python can printstrings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python's \uand \UUnicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).
虽然 Python 可以打印(用户感知的)字符的字符串,但Python 非字节字符串本质上是 Unicode 代码点的序列,而不是用户感知的字符。代码点值是在 Python\u和\UUnicode 字符串语法中使用的值。它们不应与字符的编码混淆(并且不必与它有任何关系:Unicode 代码点可以以各种方式进行编码)。
This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is notalways its number of user-perceived characters: thus s = "\u1100\u1161\u11a8"; print(s, "len", len(s))(Python?3) gives ? len 3despite shaving a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as print("\uac01")shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.
这有一个重要的后果:Python (Unicode) 字符串的长度是它的代码点数,这并不总是它的用户感知字符数:因此s = "\u1100\u1161\u11a8"; print(s, "len", len(s))(Python?3)? len 3尽管s只有一个用户感知(韩语) ) 字符(因为它用 3 个代码点表示——即使它不必如此,如图print("\uac01")所示)。但是,在许多实际情况下,字符串的长度是用户感知的字符数,因为 Python 通常将许多字符存储为单个 Unicode 代码点。
In Python 2, Unicode strings are called… "Unicode strings" (unicodetype, literal form u"…"), while byte arrays are "strings" (strtype, where the array of bytes can for instance be constructed with string literals "…"). In Python 3, Unicode strings are simply called "strings" (strtype, literal form "…"), while byte arrays are?"bytes" (bytestype, literal form b"…"). As a consequence, something like ""[0]gives a different result in Python 2 ('\xf0', a byte) and Python 3 ("", the first and only character).
在Python 2 中,Unicode 字符串被称为……“Unicode 字符串”(unicode类型,文字形式u"…"),而字节数组是“字符串”(str类型,其中字节数组可以例如用字符串文字构造"…")。在Python 3 中,Unicode 字符串被简单地称为“字符串”(str类型,文字形式"…"),而字节数组是?“字节”(bytes类型,文字形式b"…")。因此,类似""[0]在 Python 2 ( '\xf0',一个字节) 和 Python 3 ( "",第一个也是唯一的字符) 中给出了不同的结果。
With these few key points, you should be able to understand most encoding related questions!
有了这几个关键点,你应该能看懂大部分编码相关的问题了!
Normally, when you printu"…"to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:
通常,当您打印u"…"到终端时,您不应该得到垃圾:Python 知道终端的编码。实际上,您可以检查终端期望的编码:
% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8
If your input characters can be encoded with the terminal's encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).
如果您的输入字符可以使用终端的编码进行编码,Python 会这样做并将相应的字节发送到您的终端而不会抱怨。终端将在解码输入字节后尽最大努力显示字符(最坏的情况是终端字体没有某些字符,而是打印某种空白)。
If your input characters cannot be encoded with the terminal's encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a UnicodeEncodeErrorsince the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user's terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.
如果您的输入字符无法使用终端的编码进行编码,则表示终端未配置为显示这些字符。Python 会抱怨(在 Python 中使用 a,UnicodeEncodeError因为字符串无法以适合您终端的方式进行编码)。唯一可能的解决方案是使用可以显示字符的终端(通过配置终端使其接受可以表示您的字符的编码,或者使用不同的终端程序)。当您分发可在不同环境中使用的程序时,这一点很重要:您打印的消息应该可以在用户终端中表示。因此有时最好坚持只包含 ASCII 字符的字符串。
However, when you redirect or pipe the outputof your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):
但是,当您重定向或管道输出程序的输出时,通常无法知道接收程序的输入编码是什么,并且上面的代码返回一些默认编码:None (Python 2.7) 或 UTF-8 (蟒蛇3):
% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8
The encoding of stdin, stdout and stderr can however be setthrough the PYTHONIOENCODINGenvironment variable, if needed:
但是,如果需要,可以通过环境变量设置stdin、stdout 和 stderr 的编码PYTHONIOENCODING:
% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8
If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (\u001A) is not printable, if I'm not mistaken.
如果打印到终端没有产生您期望的结果,您可以检查您手动输入的 UTF-8 编码是否正确;例如,如果我没记错的话,您的第一个字符 ( \u001A) 是不可打印的。
At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:
在http://wiki.python.org/moin/PrintFails,您可以找到类似以下的解决方案,适用于 Python 2.x:
import codecs
import locale
import sys
# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni
For Python 3, you can check one of the questions asked previouslyon StackOverflow.
回答by ismail
Encode it while printing
打印时编码
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")
This is because when you run the script manually python encodes it before outputting it to terminal, when you pipe it python does not encode it itself so you have to encode manually when doing I/O.
这是因为当您手动运行脚本时,python 在将其输出到终端之前对其进行编码,当您通过管道将其编码时,python 不会对其本身进行编码,因此您必须在执行 I/O 时手动编码。
回答by Mark Tolonen
Python always encodes Unicode strings when writing to a terminal, file, pipe, etc. When writing to a terminal Python can usually determine the encoding of the terminal and use it correctly. When writing to a file or pipe Python defaults to the 'ascii' encoding unless explicitly told otherwise. Python can be told what to do when piping output through the PYTHONIOENCODINGenvironment variable. A shell can set this variable before redirecting Python output to a file or pipe so the correct encoding is known.
Python 在写入终端、文件、管道等时总是对 Unicode 字符串进行编码。当写入终端时,Python 通常可以确定终端的编码并正确使用它。写入文件或管道时,除非另有明确说明,否则 Python 默认为 'ascii' 编码。当通过PYTHONIOENCODING环境变量管道输出时,Python 可以被告知要做什么。shell 可以在将 Python 输出重定向到文件或管道之前设置此变量,以便知道正确的编码。
In your case you've printed 4 uncommon characters that your terminal didn't support in its font. Here's some examples to help explain the behavior, with characters that are actually supported by my terminal (which uses cp437, not UTF-8).
在您的情况下,您打印了终端不支持的 4 个不常见字符。下面是一些示例来帮助解释这种行为,其中包含我的终端实际支持的字符(使用 cp437,而不是 UTF-8)。
Example 1
示例 1
Note that the #codingcomment indicates the encoding in which the source fileis saved. I chose utf8 so I could support characters in source that my terminal could not. Encoding redirected to stderr so it can be seen when redirected to a file.
请注意,#coding注释指示保存源文件的编码。我选择了 utf8,所以我可以支持我的终端不能支持的源字符。编码重定向到 stderr 以便在重定向到文件时可以看到它。
#coding: utf8
import sys
uni = u'α?ΓπΣσμτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni
Output (run directly from terminal)
输出(直接从终端运行)
cp437
α?ΓπΣσμτΦΘΩδ∞φ
Python correctly determined the encoding of the terminal.
Python 正确确定了终端的编码。
Output (redirected to file)
输出(重定向到文件)
None
Traceback (most recent call last):
File "C:\ex.py", line 5, in <module>
print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
Python could not determine encoding (None) so used 'ascii' default. ASCII only supports converting the first 128 characters of Unicode.
Python 无法确定编码(无),因此使用“ascii”默认值。ASCII 仅支持转换 Unicode 的前 128 个字符。
Output (redirected to file, PYTHONIOENCODING=cp437)
输出(重定向到文件,PYTHONIOENCODING=cp437)
cp437
and my output file was correct:
我的输出文件是正确的:
C:\>type out.txt
α?ΓπΣσμτΦΘΩδ∞φ
Example 2
示例 2
Now I'll throw in a character in the source that isn't supported by my terminal:
现在我将在我的终端不支持的源中加入一个字符:
#coding: utf8
import sys
uni = u'α?ΓπΣσμτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni
Output (run directly from terminal)
输出(直接从终端运行)
cp437
Traceback (most recent call last):
File "C:\ex.py", line 5, in <module>
print uni
File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>
My terminal didn't understand that last Chinese character.
我的终端不理解最后一个汉字。
Output (run directly, PYTHONIOENCODING=437:replace)
输出(直接运行,PYTHONIOENCODING=437:replace)
cp437
α?ΓπΣσμτΦΘΩδ∞φ?
Error handlers can be specified with the encoding. In this case unknown characters were replaced with ?. ignoreand xmlcharrefreplaceare some other options. When using UTF8 (which supports encoding all Unicode characters) replacements will never be made, but the fontused to display the characters must still support them.
可以使用编码指定错误处理程序。在这种情况下,未知字符被替换为?. ignore和xmlcharrefreplace一些其他的选择。当使用 UTF8(支持编码所有 Unicode 字符)时,永远不会进行替换,但用于显示字符的字体必须仍然支持它们。

