使用 Python 从 ascii 转换为 utf-8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2262879/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Converting from ascii to utf-8 with Python
提问by colriot
I have xmpp bot written in python. One of it's plugins is able to execute OS commands and send output to the user. As far as I know output should be unicode-like to send it over xmpp protocol. So I tried to handle it this way:
我有用 python 编写的 xmpp bot。其中一个插件能够执行操作系统命令并将输出发送给用户。据我所知,输出应该像 unicode 一样通过 xmpp 协议发送。所以我试着这样处理:
output = os.popen(cmd).read()
if not isinstance(output, unicode):
output = unicode(output,'utf-8','ignore')
bot.send(xmpp.Message(mess.getFrom(),output))
But when Russian symbols appear in output they aren't converted well.
但是当俄语符号出现在输出中时,它们的转换效果不佳。
sys.getdefaultencoding()
says that default command prompt encoding is 'ascii', but when I try to do
说默认的命令提示符编码是“ascii”,但是当我尝试这样做时
output.decode('ascii')
in python console I get
在 python 控制台中我得到
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 1:
ordinal not in range(128)
OS: Win XP, Python 2.5.4 PS: Sorry for my English :(
操作系统:Win XP,Python 2.5.4 PS:对不起我的英语:(
采纳答案by John Machin
You say """sys.getdefaultencoding() says that default command prompt encoding is 'ascii'"""
你说 """sys.getdefaultencoding() 说默认的命令提示符编码是 'ascii'"""
sys.getdefaultencoding says NOTHING about the "command prompt" encoding.
sys.getdefaultencoding 没有说明“命令提示符”编码。
On Windows, sys.stdout.encoding
should do the job. On my machine, it contains cp850
when Python is run in a Command Prompt window, and cp1252
in IDLE. Yours should contain cp866
and cp1251
respectively.
在 Windows 上,sys.stdout.encoding
应该可以完成这项工作。在我的机器上,它包含cp850
Python 在命令提示符窗口中和cp1252
IDLE 中运行的时间。你的应该分别包含cp866
和cp1251
。
UpdateYou say that you still need cp866 in IDLE. Note this:
更新你说你在 IDLE 中仍然需要 cp866。请注意:
IDLE 2.6.4
>>> import os
>>> os.popen('chcp').read()
'Active code page: 850\n'
>>>
So when your app starts up, check if you are on Windows and if so, parse the result of os.popen('chcp').read()
. The text before the :
is probably locale-dependent. codepage = result.split()[-1]
may be good enough "parsing". On Unix, which doesn't have a Windows/MS-DOS split personality, sys.stdout.encoding
should be OK.
因此,当您的应用程序启动时,请检查您是否在 Windows 上,如果是,请解析os.popen('chcp').read()
. 之前的文本:
可能与语言环境有关。codepage = result.split()[-1]
可能足够好“解析”。在没有 Windows/MS-DOS 人格分裂的 Unix 上,sys.stdout.encoding
应该没问题。
回答by Douglas Leeder
sys.getdefaultencoding()
returns python'sdefault encoding - which is ASCII unless you have changed it. ASCII doesn't support Russian characters.
sys.getdefaultencoding()
返回python 的默认编码——除非你改变它,否则它是 ASCII。ASCII 不支持俄语字符。
You need to work out what encoding the actual text is, either manually, or using the locale module.
您需要手动或使用locale 模块确定实际文本的编码方式。
Typically something like:
通常类似于:
import locale
encoding = locale.getpreferredencoding(do_setlocale=True)?
回答by John Knoeller
Ascii has no defined character values above 127 0x7F. Perhaps you mean the Cyrillic code page? It's 866
Ascii 没有定义超过 127 0x7F 的字符值。也许您的意思是西里尔文代码页?这是 866
See http://en.wikipedia.org/wiki/Code_page
见http://en.wikipedia.org/wiki/Code_page
edit: since this answer was marked correct presumably 886 worked, but as other answers have pointed it, 886 is not the only Russian language code page. If you use a code page different from the one that was used when the Russian symbols were encoded, you will get the wrong result.
编辑:由于这个答案被标记为正确,大概 886 有效,但正如其他答案所指出的那样,886 不是唯一的俄语代码页。如果您使用的代码页与编码俄语符号时使用的代码页不同,则会得到错误的结果。
回答by Mark Tolonen
In Python 'cp855', 'cp866', 'cp1251', 'iso8859_5', 'koi8_r' are differing Russian code pages. You'll need to use the right one to decode the output of popen. In the Windows console, the 'chcp' command lists the code page used by console commands. That won't necessarily be the same code page as Windows applications. On US Windows, 'cp437' is used for the console and 'cp1252' is used for applications like Notepad.
在 Python 中,'cp855'、'cp866'、'cp1251'、'iso8859_5'、'koi8_r' 是不同的俄语代码页。您需要使用正确的方法来解码 popen 的输出。在 Windows 控制台中,'chcp' 命令列出了控制台命令使用的代码页。这不一定是与 Windows 应用程序相同的代码页。在美国 Windows 上,'cp437' 用于控制台,'cp1252' 用于记事本等应用程序。