Python 为什么我们不应该在 py 脚本中使用 sys.setdefaultencoding("utf-8") ?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3828723/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?
提问by mlzboy
I have seen few py scripts which use this at the top of the script. In what cases one should use it?
我已经看到很少有 py 脚本在脚本的顶部使用它。在什么情况下应该使用它?
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
采纳答案by pyfunc
As per the documentation: This allows you to switch from the default ASCII to other encodings such as UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode.
根据文档:这允许您从默认 ASCII 切换到其他编码,例如 UTF-8,Python 运行时将在必须将字符串缓冲区解码为 unicode 时使用它。
This function is only available at Python start-up time, when Python scans the environment. It has to be called in a system-wide module, sitecustomize.py, After this module has been evaluated, the setdefaultencoding()function is removed from the sysmodule.
此功能仅在 Python 启动时可用,当 Python 扫描环境时。它必须在系统范围的模块中调用sitecustomize.py,在评估此模块后,setdefaultencoding()从sys模块中删除该函数。
The only way to actually use it is with a reload hack that brings the attribute back.
实际使用它的唯一方法是使用重新加载 hack 将属性带回来。
Also, the use of sys.setdefaultencoding()has always been discouraged, and it has become a no-op in py3k. The encoding of py3k is hard-wired to "utf-8" and changing it raises an error.
此外,使用sys.setdefaultencoding()一直气馁,它已成为一个无操作的py3k。py3k 的编码硬连接到“utf-8”,更改它会引发错误。
I suggest some pointers for reading:
我建议一些阅读要点:
- http://blog.ianbicking.org/illusive-setdefaultencoding.html
- http://nedbatchelder.com/blog/200401/printing_unicode_from_python.html
- http://www.diveintopython3.net/strings.html#one-ring-to-rule-them-all
- http://boodebr.org/main/python/all-about-python-and-unicode
- http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python
- http://blog.ianbicking.org/illusive-setdefaultencoding.html
- http://nedbatchelder.com/blog/200401/printing_unicode_from_python.html
- http://www.diveintopython3.net/strings.html#one-ring-to-rule-them-all
- http://boodebr.org/main/python/all-about-python-and-unicode
- http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python
回答by Sérgio
#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'mo?ambique'
print u.encode("utf-8")
print u
chmod +x test.py
./test.py
mo?ambique
mo?ambique
./test.py > output.txt
Traceback (most recent call last):
File "./test.py", line 5, in <module>
print u
UnicodeEncodeError: 'ascii' codec can't encode character
u'\xe7' in position 2: ordinal not in range(128)
on shell works , sending to sdtout not , so that is one workaround, to write to stdout .
在 shell 上工作,发送到 sdtout 不是,所以这是一种解决方法,写入 stdout 。
I made other approach, which is not run if sys.stdout.encoding is not define, or in others words , need export PYTHONIOENCODING=UTF-8 first to write to stdout.
我做了其他方法,如果 sys.stdout.encoding 未定义,则不会运行,或者换句话说,需要先导出 PYTHONIOENCODING=UTF-8 才能写入标准输出。
import sys
if (sys.stdout.encoding is None):
print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
exit(1)
so, using same example:
所以,使用相同的例子:
export PYTHONIOENCODING=UTF-8
./test.py > output.txt
will work
将工作
回答by Alastair McCormack
tl;dr
tl;博士
The answer is NEVER! (unless you really know what you're doing)
答案是永远不会!(除非你真的知道自己在做什么)
9/10 times the solution can be resolved with a proper understanding of encoding/decoding.
通过正确理解编码/解码,可以解决 9/10 倍的解决方案。
1/10 people have an incorrectly defined locale or environment and need to set:
1/10 人的语言环境或环境定义不正确,需要设置:
PYTHONIOENCODING="UTF-8"
in their environment to fix console printing problems.
在他们的环境中修复控制台打印问题。
What does it do?
它有什么作用?
(struck through to avoid re-use) changes the default encoding/decoding used whenever Python 2.x needs to convert a Unicode() to a str() (and vice-versa) and the encoding is not given. I.e:sys.setdefaultencoding("utf-8")
(击穿以避免重复使用)更改在 Python 2.x 需要将 Unicode() 转换为 str() (反之亦然)并且未给出编码时使用的默认编码/解码。IE:sys.setdefaultencoding("utf-8")
str(u"\u20AC")
unicode("")
"{}".format(u"\u20AC")
In Python 2.x, the default encoding is set to ASCII and the above examples will fail with:
在 Python 2.x 中,默认编码设置为 ASCII,上面的示例将失败:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
(My console is configured as UTF-8, so "" = '\xe2\x82\xac', hence exception on \xe2)
(我的控制台配置为 UTF-8,所以"" = '\xe2\x82\xac',因此例外\xe2)
or
或者
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
will allow these to work for me, but won't necessarily work for people who don't use UTF-8. The default of ASCII ensures that assumptions of encoding are not baked into codesys.setdefaultencoding("utf-8")
将允许这些对我有用,但不一定适用于不使用 UTF-8 的人。ASCII 的默认值确保编码的假设不会被烘焙到代码中sys.setdefaultencoding("utf-8")
Console
安慰
also has a side effect of appearing to fix sys.setdefaultencoding("utf-8")sys.stdout.encoding, used when printing characters to the console. Python uses the user's locale (Linux/OS X/Un*x) or codepage (Windows) to set this. Occasionally, a user's locale is broken and just requires PYTHONIOENCODINGto fix the console encoding.
也有一个看起来 fix 的副作用,在将sys.setdefaultencoding("utf-8")sys.stdout.encoding字符打印到控制台时使用。Python 使用用户的语言环境 (Linux/OS X/Un*x) 或代码页 (Windows) 来设置它。有时,用户的语言环境被破坏,只需要PYTHONIOENCODING修复控制台编码。
Example:
例子:
$ export LANG=en_GB.gibberish
$ python
>>> import sys
>>> sys.stdout.encoding
'ANSI_X3.4-1968'
>>> print u"\u20AC"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
>>> exit()
$ PYTHONIOENCODING=UTF-8 python
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print u"\u20AC"
What's so bad with sys.setdefaultencoding("utf-8")?
sys.setdefaultencoding("utf-8")有什么不好?
People have been developing against Python 2.x for 16 years on the understanding that the default encoding is ASCII. UnicodeErrorexception handling methods have been written to handle string to Unicode conversions on strings that are found to contain non-ASCII.
16 年来,人们一直在针对 Python 2.x 进行开发,因为他们认为默认编码是 ASCII。UnicodeError已编写异常处理方法来处理发现包含非 ASCII 的字符串的字符串到 Unicode 转换。
From https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
来自https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
def welcome_message(byte_string):
try:
return u"%s runs your business" % byte_string
except UnicodeError:
return u"%s runs your business" % unicode(byte_string,
encoding=detect_encoding(byte_string))
print(welcome_message(u"Angstrom (??)".encode("latin-1"))
Previous to setting defaultencoding this code would be unable to decode the “?” in the ascii encoding and then would enter the exception handler to guess the encoding and properly turn it into unicode. Printing: Angstrom (??) runs your business. Once you've set the defaultencoding to utf-8 the code will find that the byte_string can be interpreted as utf-8 and so it will mangle the data and return this instead: Angstrom (?) runs your business.
在设置 defaultencoding 之前,此代码将无法解码“?” 在 ascii 编码中,然后将进入异常处理程序以猜测编码并将其正确转换为 unicode。印刷:埃 (??) 经营您的业务。一旦您将 defaultencoding 设置为 utf-8,代码就会发现 byte_string 可以被解释为 utf-8,因此它将破坏数据并返回:Angstrom (?) 运行您的业务。
Changing what should be a constant will have dramatic effects on modules you depend upon. It's better to just fix the data coming in and out of your code.
更改本应为常量的内容将对您依赖的模块产生巨大影响。最好只修复进出代码的数据。
Example problem
示例问题
While the setting of defaultencoding to UTF-8 isn't the root cause in the following example, it shows how problems are masked and how, when the input encoding changes, the code breaks in an unobvious way: UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte
虽然将 defaultencoding 设置为 UTF-8 不是以下示例中的根本原因,但它显示了问题是如何被屏蔽的,以及当输入编码更改时,代码如何以一种不明显的方式中断: UnicodeDecodeError: 'utf8' codec can 't 解码位置 3131 中的字节 0x80:无效的起始字节
回答by ivan_pozdeev
The first danger lies in
reload(sys).When you reload a module, you actually get twocopies of the module in your runtime. The old module is a Python object like everything else, and stays alive as long as there are references to it. So, half of the objects will be pointing to the old module, and half to the new one.When you make some change, you will never see it coming when some random object doesn't see the change:
(This is IPython shell) In [1]: import sys In [2]: sys.stdout Out[2]: <colorama.ansitowin32.StreamWrapper at 0x3a2aac8> In [3]: reload(sys) <module 'sys' (built-in)> In [4]: sys.stdout Out[4]: <open file '<stdout>', mode 'w' at 0x00000000022E20C0> In [11]: import IPython.terminal In [14]: IPython.terminal.interactiveshell.sys.stdout Out[14]: <colorama.ansitowin32.StreamWrapper at 0x3a9aac8>Now,
sys.setdefaultencoding()properAll that it affects is implicit conversion
str<->unicode. Now,utf-8is the sanest encoding on the planet (backward-compatible with ASCII and all), the conversion now "just works", what could possibly go wrong?Well, anything. And that is the danger.
- There may be some code that relies on the
UnicodeErrorbeing thrown for non-ASCII input, or does the transcoding with an error handler, which now produces an unexpected result. And since all code is tested with the default setting, you're strictly on "unsupported" territory here, and no-one gives you guarantees about how their code will behave. - The transcoding may produce unexpected or unusable results if not everything on the system uses UTF-8 because Python 2 actually has multiple independent "default string encodings". (Remember, a program must work for the customer, on the customer's equipment.)
- Again, the worst thing is you will never know that because the conversion is implicit-- you don't really know when and where it happens.(Python Zen, koan 2 ahoy!) You will never know why (and if) your code works on one system and breaks on another. (Or better yet, works in IDE and breaks in console.)
- There may be some code that relies on the
第一个危险在于
reload(sys)。当您重新加载模块时,您实际上会在运行时获得该模块的两个副本。旧模块和其他任何东西一样是一个 Python 对象,只要有对它的引用就一直存在。因此,一半的对象将指向旧模块,另一半指向新模块。当您进行一些更改时,当某些随机对象看不到更改时,您将永远不会看到它的到来:
(This is IPython shell) In [1]: import sys In [2]: sys.stdout Out[2]: <colorama.ansitowin32.StreamWrapper at 0x3a2aac8> In [3]: reload(sys) <module 'sys' (built-in)> In [4]: sys.stdout Out[4]: <open file '<stdout>', mode 'w' at 0x00000000022E20C0> In [11]: import IPython.terminal In [14]: IPython.terminal.interactiveshell.sys.stdout Out[14]: <colorama.ansitowin32.StreamWrapper at 0x3a9aac8>现在,
sys.setdefaultencoding()适当它所影响的只是隐式转换
str<->unicode。现在,utf-8是这个星球上最理智的编码(向后兼容 ASCII 和所有),转换现在“正常工作”,可能会出什么问题?嗯,什么都行。这就是危险。
- 可能有一些代码依赖于
UnicodeError对非 ASCII 输入的抛出,或者使用错误处理程序进行转码,现在会产生意外的结果。而且由于所有代码都使用默认设置进行测试,因此您在这里严格处于“不受支持”的领域,并且没有人向您保证他们的代码的行为方式。 - 如果不是系统上的所有内容都使用 UTF-8,转码可能会产生意外或无法使用的结果,因为 Python 2 实际上有多个独立的“默认字符串编码”。(请记住,程序必须在客户的设备上为客户工作。)
- 同样,最糟糕的是你永远不会知道,因为转换是隐式的——你真的不知道它何时何地发生。(Python Zen,koan 2 ahoy!)您永远不会知道为什么(以及是否)您的代码在一个系统上运行而在另一个系统上中断。(或者更好的是,在 IDE 中工作并在控制台中中断。)
- 可能有一些代码依赖于

