python中的中文和日文字符支持
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14682933/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Chinese and Japanese character support in python
提问by user2030113
How to read correctly japanese and chinese characters.
I'm using python 2.5. Output is displayed as "E:\Test\?????????"
如何正确阅读日文和汉字。我正在使用 python 2.5。输出显示为"E:\Test\?????????"
path = r"E:\Test\は最高のプログラマ"
t = path.encode()
print t
u = path.decode()
print u
t = path.encode("utf-8")
print t
t = path.decode("utf-8")
print t
回答by m.brindley
You should force the string to be a unicodeobject like
你应该强制字符串是一个unicode对象
path = ur"E:\Test\は最高のプログラマ"
Docs on string literals relevant to 2.5 are located here
与 2.5 相关的字符串文字文档位于此处
Edit:I'm not positive on if the object is a unicodein 2.5 but the docs do state that \uXXXX[XXXX]will be processed and the the string will be "a Unicode string".
编辑:我不肯定对象是否是unicode2.5 中的,但文档确实声明\uXXXX[XXXX]将被处理并且字符串将是“Unicode 字符串”。
回答by Martijn Pieters
Please do read the Python Unicode HOWTO; it explains how to process and include non-ASCII text in your Python code.
请阅读Python Unicode HOWTO;它解释了如何在 Python 代码中处理和包含非 ASCII 文本。
If you want to include Japanese text literals in your code, you have several options:
如果您想在代码中包含日语文本文字,您有多种选择:
Use unicode literals (create
unicodeobjects instead of byte strings), but any non-ascii codepoint is represented by a unicode escape character. They take the form of\uabcd, so a backslash, auand 4 hexadecimal digits:ru = u'\u30EB'would be one character, the katakana 'ru' codepoint ('ル').
Use unicode literals, but include the characters in some form of encoding. Your text editor will save files in a given encoding (say, UTF-16); you need to declare that encoding at the top of the source file:
# encoding: utf-16 ru = u'ル'where 'ル' is included without using an escape. The default encoding for Python 2 files is ASCII, so by declaring an encoding you make it possible to use Japanese directly.
Use byte string literals, ready encoded. Encode the codepoints by some other means and include them in your byte string literals. If all you are going to do with them is use them in encoded form anyway, this should be fine:
ru = '\xeb\x30' # ru encoded to UTF16 little-endianI encoded 'ル' to UTF-16 little-endian because that's the default Windows NTFS filename encoding.
使用 unicode 文字(创建
unicode对象而不是字节字符串),但任何非 ascii 代码点都由 unicode 转义字符表示。它们采用\uabcd, 所以反斜杠、au和 4 个十六进制数字的形式:ru = u'\u30EB'将是一个字符,片假名“ru”代码点(“ル”)。
使用 unicode 文字,但以某种编码形式包含字符。您的文本编辑器将以给定的编码(例如 UTF-16)保存文件;您需要在源文件的顶部声明该编码:
# encoding: utf-16 ru = u'ル'其中包含 'ル' 而不使用转义符。Python 2 文件的默认编码是 ASCII,因此通过声明编码,您可以直接使用日语。
使用字节字符串文字,准备好编码。通过其他方式对代码点进行编码,并将它们包含在您的字节字符串文字中。如果您打算对它们做的只是以编码形式使用它们,那么这应该没问题:
ru = '\xeb\x30' # ru encoded to UTF16 little-endian我将“ル”编码为 UTF-16 little-endian,因为这是默认的 Windows NTFS 文件名编码。
Next problem will be your terminal, the Windows console is notorious for not supporting many character sets out of the box. You probably want to configure it to handle UTF-8 instead. See this questionfor some details, but you need to run the following command in the console:
下一个问题将是您的终端,Windows 控制台因不支持许多开箱即用的字符集而臭名昭著。您可能希望将其配置为处理 UTF-8。有关详细信息,请参阅此问题,但您需要在控制台中运行以下命令:
chcp 65001
to switch to UTF-8, and you may need to switch to a console font that can handle your codepoints (Lucida perhaps?).
要切换到 UTF-8,您可能需要切换到可以处理代码点的控制台字体(也许是 Lucida?)。
回答by jfs
There are two independent issues:
有两个独立的问题:
You should specify Python source encoding if you use non-ascii characters and use Unicode literals for data that represents text e.g.:
# -*- coding: utf-8 -*- path = ur"E:\Test\は最高のプログラマ"Printing Unicode to Windows console is complicatedbut if you set correct font then just:
print pathmight work.
如果您使用非 ascii 字符并为表示文本的数据使用 Unicode 文字,则应指定 Python 源编码,例如:
# -*- coding: utf-8 -*- path = ur"E:\Test\は最高のプログラマ"将 Unicode 打印到 Windows 控制台很复杂,但如果您设置了正确的字体,则只需:
print path可能工作。
Regardless of whether your console can display the path; it should be fine to pass the Unicode path to filesystem functions e.g.:
不管你的控制台能否显示路径;将 Unicode 路径传递给文件系统函数应该没问题,例如:
entries = os.listdir(path)
Don't call .encode(char_enc)on bytestrings, call it on Unicode strings instead.
Don't call .decode(char_enc)on Unicode strings, call it on bytestrings instead.
不要调用.encode(char_enc)字节串,而是在 Unicode 字符串上调用它。
不要调用.decode(char_enc)Unicode 字符串,而是调用字节串。

