python 打印日文（中文）字符

Question

提问by anonntheitroad

I read Japanese, and want to try processing some Japanese text. I tried this using Python 3:

我读过日语，想尝试处理一些日语文本。我用 Python 3 试过这个：

for i in range(1,65535):
    print(chr(i), end='')

Python then gave me tons of errors. What went wrong?

Python 然后给了我很多错误。什么地方出了错？

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Traceback (most recent call last):
  File "C:\test\char.py", line 11, in <module>
    print(chr(i), end='')
  File "C:\Python31\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 0: character maps to <undefined>

My understanding is that the chrfunction goes on to convert Unicode numbers into the respective Japanese characters. If so, why are the Japanese characters not outputted? Why does it crash at the end of the list of Roman characters?

我的理解是chr函数继续将 Unicode 数字转换为相应的日语字符。如果是这样，为什么不输出日文字符？为什么它会在罗马字符列表的末尾崩溃？

Please also correct me if I am mistaken in my understanding that the Unicode set was devised solely to cater for non-Western languages.

如果我错误地认为 Unicode 集是专为满足非西方语言而设计的，也请纠正我。

EDIT:

编辑：

I tried the 3 lines suggested by John Machin in IDLE, and the output worked!

我在 IDLE 中尝试了 John Machin 建议的 3 行，并且输出有效！

Before this, I had been using Programmer's Notepad, with the Tools set to capture python.exe compiler's output. Perhaps that is why the errors came about.

在此之前，我一直在使用程序员记事本，将工具设置为捕获 python.exe 编译器的输出。也许这就是错误出现的原因。

However, for most other things, the output is captured properly; then why does it fail particularly in this process? i.e. Why does the code work in the IDLE Python Shell, but not through Programmer's Notepad output capture? Shouldn't the output be the same, regardless of the interface?

然而，对于大多数其他事情，输出被正确捕获；那么为什么它在这个过程中特别失败呢？即为什么代码在 IDLE Python Shell 中工作，而不是通过 Programmer's Notepad 输出捕获？无论接口如何，输出不应该相同吗？

Answer 1

回答by John Machin

If as you say you read Japanese, you must be aware that Japanese is written using FOUR different types of characters: (1) kanji (Chinese characters) (2) Katakana (3) Hiragana (4) Romaji ("Roman" letters). There are many tens of thousands of kanji of which only a few thousand are in common use.

如果你说你读日语，你必须知道日语是用四种不同类型的字符书写的：（1）汉字（汉字）（2）片假名（3）平假名（4）罗马字（“罗马”字母）。汉字有数万个，其中常用的只有几千个。

Your code, had it worked as you imagined it might, would have printed not only the the "Roman" characters, but also Greek, Arabic, Hebrew, Cyrillic (used in Russian etc), Armenian, half a dozen or so different but related character sets used in India, many I've left out, about 11 thousand Hangul Syllables (used in Korean) and a bunch of gibberish for code points that aren't used, and (depending on which shell you were running it in) may have crashed when it got to 0xD800 (the first surrogate).

你的代码，如果它像你想象的那样工作，不仅会打印“罗马”字符，还会打印希腊语、阿拉伯语、希伯来语、西里尔语（用于俄语等）、亚美尼亚语，六种左右不同但相关的印度使用的字符集，我忽略了很多，大约 11,000 个韩文音节（用于韩语）和一堆未使用的代码点的胡言乱语，以及（取决于你在哪个 shell 中运行它）可能当它到达 0xD800（第一个代理）时崩溃了。

A little less ambition will give you Hiragana, Katakana, and a few "CJK Unified Ideographs". The examples below were run in IDLE.

少一点野心会给你平假名、片假名和一些“CJK统一表意文字”。下面的示例在 IDLE 中运行。

>>> for i in range(0x3040, 0x30a0): print(chr(i), end='')

?ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをん???????゛゜ゝゞ?
>>> for i in range(0x30a0, 0x3100): print(chr(i), end='')

?ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶ?????ーヽヾ?
>>> for i in range(0x4e00, 0x4f00): print(chr(i), end='')

一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乎乏乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿

UpdateThe reason you had a problem is that the shell/IDE that you were using supplies only the Windows GUI bog-standard stdout, for which the default encoding (in your neck of the woods) is cp1252 (remember the mention of cp1252 in your traceback?) which is adequate in your case for the Romaji but not much else. Available-anywhere-without-downloads alternatives: (1) IDLE (2) write file encoded in UTF-8 and read it in Notepad. I'm sure others could suggest other IDEs.

更新您遇到问题的原因是您使用的外壳/IDE 仅提供 Windows GUI 沼泽标准标准输出，默认编码（在您的树林中）是 cp1252（请记住在您的回溯？）这对于您的罗马字来说已经足够了，但其他的不多。无需下载即可随时随地使用的替代方法：(1) IDLE (2) 写入以 UTF-8 编码的文件并在记事本中读取。我相信其他人可以推荐其他 IDE。

Answer 2

回答by Jürgen A. Erhard

You problem is your default terminal (output) encoding. Probably latin-1 or even the perennial Python default, ASCII. Those can't encode japanese characters (since it's assumed that the terminal can't display them).

您的问题是您的默认终端（输出）编码。可能是 latin-1 甚至是 Python 的常年默认值 ASCII。那些不能编码日语字符（因为假设终端不能显示它们）。

If your terminal does UTF-8 (the most often used Unicode encoding in the western world), you can either "trick" Python into taking this as the default output encoding, or you can explicity encode the unicode to UTF-8 with

如果您的终端使用 UTF-8（西方世界最常用的 Unicode 编码），您可以“欺骗”Python 将其作为默认输出编码，或者您可以使用以下命令将 unicode 显式编码为 UTF-8

>>>> print (chr(i).encode("UTF-8"), end='')

And as to the "solely", I think that's wrong. It was created to be the oneencoding to bind them... ehm, sorry, the one and only encoding we'll ever need. The encoding (okay, that's using "encoding" not in the sense it's used in the Unicode definition) that can be used to encode all text documents.

至于“唯一”，我认为这是错误的。它被创建为绑定它们的一种编码......呃，抱歉，我们将永远需要一种也是唯一的编码。可用于对所有文本文档进行编码的编码（好吧，这是使用“编码”，而不是它在 Unicode 定义中使用的意义）。

Answer 3

回答by devio

No need to try all 65536 codes of the BMP. Just use the code blocksused for Japanese text

无需尝试 BMP 的所有 65536 代码。只需使用用于日语文本的代码块

Answer 4

回答by Wooble

You're attempting to encode a character (\x80) that isn't defined by your codec; there is no correct mapping so charmap_encode raises an exception. You could wrap the print statement in a try: block, then catch and ignore the exception to only print the characters that you can encode.

您正在尝试对编解码器未定义的字符 (\x80) 进行编码；没有正确的映射，所以 charmap_encode 引发异常。您可以将打印语句包装在 try: 块中，然后捕获并忽略异常以仅打印您可以编码的字符。

Answer 5

回答by moeabdol

for i in range(0x3040, 0x30a0): print unichr(i),

This above is for the Hiragana charset. You can use the same utf-8 encoding above for Katakana, and Kanji as well.

以上是针对平假名字符集。您也可以对片假名和汉字使用上述相同的 utf-8 编码。

Keep in mind that the average japanese uses around 2000-2500 Kanji charachters. However, chinese is probably around 5000-6000.

请记住，日本人平均使用大约 2000-2500 个汉字字符。但是，中文大概在5000-6000左右。

python 打印日文（中文）字符

提问by anonntheitroad

回答by John Machin

回答by Jürgen A. Erhard

回答by devio

回答by Wooble

回答by moeabdol

相关推荐

最近更新

标签

python 打印日文（中文）字符

提问by anonntheitroad

回答by John Machin

回答by Jürgen A. Erhard

回答by devio

回答by Wooble

回答by moeabdol

相关推荐

尝试在 Python 中使用 gevent 库：“ImportError: cannot import name core”

python 在文本文件中删除和插入行

python Django 表单验证：使“必需”成为条件？

在 python 中生成和应用差异

相关推荐

最近更新

标签