Python UnicodeEncodeError: 'charmap' 编解码器无法编码 - 字符映射到 <undefined>，打印函数

Question

提问by Carlos Eugenio Thompson Pinzón

I am writing a Python (Python 3.3) program to send some data to a webpage using POST method. Mostly for debugging process I am getting the page result and displaying it on the screen using print()function.

我正在编写一个 Python (Python 3.3) 程序来使用 POST 方法将一些数据发送到网页。主要用于调试过程，我获取页面结果并使用print()函数将其显示在屏幕上。

The code is like this:

代码是这样的：

conn.request("POST", resource, params, headers)
response = conn.getresponse()
print(response.status, response.reason)
data = response.read()
print(data.decode('utf-8'));

the HTTPResponse.read()method returns a byteselement encoding the page (which is a well formated UTF-8 document) It seemed okay until I stopped using IDLE GUI for Windows and used the Windows console instead. The returned page has a U+2014 character (em-dash) which the print function translates well in the Windows GUI (I presume Code Page 1252) but does not in the Windows Console (Code Page 850). Given the strictdefault behavior I get the following error:

该HTTPResponse.read()方法返回一个bytes对页面进行编码的元素（这是一个格式良好的 UTF-8 文档）在我停止使用 Windows 的 IDLE GUI 并改为使用 Windows 控制台之前，这似乎没问题。返回的页面有一个 U+2014 字符（长破折号），打印功能在 Windows GUI（我假设代码页 1252）中转换得很好，但在 Windows 控制台（代码页 850）中没有。鉴于strict默认行为，我收到以下错误：

UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 10248: character maps to <undefined>

I could fix it using this quite ugly code:

我可以使用这个非常丑陋的代码来修复它：

print(data.decode('utf-8').encode('cp850','replace').decode('cp850'))

Now it replace the offending character "—" with a ?. Not the ideal case (a hyphen should be a better replacement) but good enough for my purpose.

现在它将有问题的字符“—”替换为?. 不是理想的情况（连字符应该是更好的替代品）但足以满足我的目的。

There are several things I do not like from my solution.

我的解决方案中有几件事我不喜欢。

The code is ugly with all that decoding, encoding, and decoding.
It solves the problem for just this case. If I port the program for a system using some other encoding (latin-1, cp437, back to cp1252, etc.) it should recognize the target encoding. It does not. (for instance, when using again the IDLE GUI, the emdash is also lost, which didn't happen before)
It would be nicer if the emdash translated to a hyphen instead of a interrogation bang.

所有解码，编码和解码的代码都很丑陋。
它解决了这种情况下的问题。如果我为使用其他编码（latin-1、cp437、回到 cp1252 等）的系统移植程序，它应该能够识别目标编码。它不是。（例如，再次使用 IDLE GUI 时，emdash 也会丢失，这在以前没有发生过）
如果 emdash 翻译成连字符而不是审讯爆炸，那就更好了。

The problem is not the emdash (I can think of several ways to solve that particularly problem) but I need to write robust code. I am feeding the page with data from a database and that data can come back. I can anticipate many other conflicting cases: an 'á' U+00c1 (which is possible in my database) could translate into CP-850 (DOS/Windows Console encodign for Western European Languages) but not into CP-437 (encoding for US English, which is default in many Windows instalations).

问题不在于 emdash（我可以想到几种方法来解决这个特别的问题），但我需要编写健壮的代码。我正在使用数据库中的数据为页面提供数据，并且该数据可以返回。我可以预见许多其他冲突情况：'á' U+00c1（在我的数据库中是可能的）可以转换为 CP-850（西欧语言的 DOS/Windows 控制台编码），但不能转换为 CP-437（美国编码）英语，这是许多 Windows 安装的默认设置）。

So, the question:

所以，问题：

Is there a nicer solution that makes my code agnostic from the output interface encoding?

是否有更好的解决方案使我的代码与输出接口编码无关？

Answer 1

回答by Dirk St?cker

I see three solutions to this:

我看到了三个解决方案：

Change the output encoding, so it will always output UTF-8. See e.g. Setting the correct encoding when piping stdout in Python, but I could not get these example to work.
Following example code makes the output aware of your target charset.
```
# -*- coding: utf-8 -*-
import sys

print sys.stdout.encoding
print u"St?cker".encode(sys.stdout.encoding, errors='replace')
print u"Стоескер".encode(sys.stdout.encoding, errors='replace')
```
This example properly replaces any non-printable character in my name with a question mark.
If you create a custom print function, e.g. called myprint, using that mechanisms to encode output properly you can simply replace print with myprintwhereever necessary without making the whole code look ugly.
Reset the output encoding globally at the begin of the software:
The page http://www.macfreek.nl/memory/Encoding_of_Python_stdouthas a good summary what to do to change output encoding. Especially the section "StreamWriter Wrapper around Stdout" is interesting. Essentially it says to change the I/O encoding function like this:
In Python 2:
```
if sys.stdout.encoding != 'cp850':
  sys.stdout = codecs.getwriter('cp850')(sys.stdout, 'strict')
if sys.stderr.encoding != 'cp850':
  sys.stderr = codecs.getwriter('cp850')(sys.stderr, 'strict')
```
In Python 3:
```
if sys.stdout.encoding != 'cp850':
  sys.stdout = codecs.getwriter('cp850')(sys.stdout.buffer, 'strict')
if sys.stderr.encoding != 'cp850':
  sys.stderr = codecs.getwriter('cp850')(sys.stderr.buffer, 'strict')
```
If used in CGI outputting HTML you can replace 'strict' by 'xmlcharrefreplace' to get HTML encoded tags for non-printable characters.
Feel free to modify the approaches, setting different encodings, .... Note that it still wont work to output non-specified data. So any data, input, texts must be correctly convertable into unicode:
```
# -*- coding: utf-8 -*-
import sys
import codecs
sys.stdout = codecs.getwriter("iso-8859-1")(sys.stdout, 'xmlcharrefreplace')
print u"St?cker"                # works
print "St?cker".decode("utf-8") # works
print "St?cker"                 # fails
```

更改输出编码，因此它将始终输出 UTF-8。参见例如在 Python 中管道 stdout 时设置正确的编码，但我无法让这些示例工作。
以下示例代码使输出了解您的目标字符集。
```
# -*- coding: utf-8 -*-
import sys

print sys.stdout.encoding
print u"St?cker".encode(sys.stdout.encoding, errors='replace')
print u"Стоескер".encode(sys.stdout.encoding, errors='replace')
```
这个例子正确地用问号替换了我名字中的任何不可打印的字符。
如果您创建自定义打印函数，例如调用myprint，使用该机制正确编码输出，您可以简单地用myprint任何必要的地方替换打印，而不会使整个代码看起来很丑陋。

在软件开始时全局重置输出编码：

页面http://www.macfreek.nl/memory/Encoding_of_Python_stdout很好地总结了如何更改输出编码。特别是“StreamWriter Wrapper around Stdout”部分很有趣。本质上它说要像这样更改 I/O 编码函数：

在 Python 2 中：

if sys.stdout.encoding != 'cp850':
  sys.stdout = codecs.getwriter('cp850')(sys.stdout, 'strict')
if sys.stderr.encoding != 'cp850':
  sys.stderr = codecs.getwriter('cp850')(sys.stderr, 'strict')

在 Python 3 中：

if sys.stdout.encoding != 'cp850':
  sys.stdout = codecs.getwriter('cp850')(sys.stdout.buffer, 'strict')
if sys.stderr.encoding != 'cp850':
  sys.stderr = codecs.getwriter('cp850')(sys.stderr.buffer, 'strict')

如果在 CGI 输出 HTML 中使用，您可以用 'xmlcharrefreplace' 替换 'strict' 以获得不可打印字符的 HTML 编码标签。

随意修改方法，设置不同的编码，.... 请注意，它仍然无法输出非指定的数据。因此，任何数据、输入、文本都必须正确转换为 unicode：

# -*- coding: utf-8 -*-
import sys
import codecs
sys.stdout = codecs.getwriter("iso-8859-1")(sys.stdout, 'xmlcharrefreplace')
print u"St?cker"                # works
print "St?cker".decode("utf-8") # works
print "St?cker"                 # fails

Answer 2

回答by Jelle Fresen

Based on Dirk St?cker's answer, here's a neat wrapper function for Python 3's print function. Use it just like you would use print.

基于 Dirk St?cker 的回答，这里有一个 Python 3 打印函数的简洁包装函数。像使用打印一样使用它。

As an added bonus, compared to the other answers, this won't print your text as a bytearray ('b"content"'), but as normal strings ('content'), because of the last decode step.

作为一个额外的好处，与其他答案相比，由于最后一个解码步骤，这不会将您的文本打印为字节数组 ('b"content"')，而是作为普通字符串 ('content') 打印。

def uprint(*objects, sep=' ', end='\n', file=sys.stdout):
    enc = file.encoding
    if enc == 'UTF-8':
        print(*objects, sep=sep, end=end, file=file)
    else:
        f = lambda obj: str(obj).encode(enc, errors='backslashreplace').decode(enc)
        print(*map(f, objects), sep=sep, end=end, file=file)

uprint('foo')
uprint(u'Antonín Dvo?ák')
uprint('foo', 'bar', u'Antonín Dvo?ák')

Answer 3

回答by jfs

For debugging purposes, you could use print(repr(data)).

出于调试目的，您可以使用print(repr(data)).

To display text, always print Unicode. Don't hardcode the character encoding of your environment such as Cp850inside your script. To decode the HTTP response, see A good way to get the charset/encoding of an HTTP response in Python.

要显示文本，请始终打印 Unicode。不要在脚本中硬编码环境的字符编码，例如Cp850。要解码 HTTP 响应，请参阅在 Python 中获取 HTTP 响应的字符集/编码的好方法。

To print Unicode to Windows console, you could use win-unicode-consolepackage.

要将 Unicode 打印到 Windows 控制台，您可以使用win-unicode-consolepackage。

Answer 4

回答by Solumyr

If you use Python 3.6 (possibly 3.5 or later), it doesn't give that error to me anymore. I had a similar issue, because I was using v3.4, but it went away after I uninstalled and reinstalled.

如果您使用 Python 3.6（可能是 3.5 或更高版本），它不会再给我这个错误。我有一个类似的问题，因为我使用的是 v3.4，但在我卸载并重新安装后它就消失了。

Answer 5

回答by leemonq

I dug deeper into this and found the best solutions are here.

我深入研究了这一点，发现最好的解决方案在这里。

http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python

In my case I solved "UnicodeEncodeError: 'charmap' codec can't encode character "

就我而言，我解决了“UnicodeEncodeError: 'charmap' codec can't encode character”

original code:

原始代码：

print("Process lines, file_name command_line %s\n"% command_line))

New code:

新代码：

print("Process lines, file_name command_line %s\n"% command_line.encode('utf-8'))

Answer 6

回答by ?eljko Krnji?

If you are using Windows command line to print the data, you should use

如果您使用 Windows 命令行打印数据，则应使用

chcp 65001

This worked for me!

这对我有用！

Python UnicodeEncodeError: 'charmap' 编解码器无法编码 - 字符映射到 <undefined>，打印函数

提问by Carlos Eugenio Thompson Pinzón

回答by Dirk St?cker

回答by Jelle Fresen

回答by jfs

回答by Solumyr

回答by leemonq

回答by ?eljko Krnji?

相关推荐

最近更新

标签

Python UnicodeEncodeError: 'charmap' 编解码器无法编码 - 字符映射到 <undefined>，打印函数

提问by Carlos Eugenio Thompson Pinzón

回答by Dirk St?cker

回答by Jelle Fresen

回答by jfs

回答by Solumyr

回答by leemonq

回答by ?eljko Krnji?

相关推荐

Python 格式错误的字符串 ValueError ast.literal_eval() 与元组的字符串表示

Python Scipy Normaltest 是怎么用的？

用于 MySQL 的转义字符串 Python

在Python中合并具有数百万行的两个表

相关推荐

最近更新

标签