在 Windows 上的 Python 2.x 中从命令行参数读取 Unicode 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/846850/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-15 12:27:01  来源:igfitidea点击:

Read Unicode characters from command-line arguments in Python 2.x on Windows

pythonwindowscommand-lineunicodepython-2.x

提问by Craig McQueen

I want my Python script to be able to read Unicode command line arguments in Windows. But it appears that sys.argv is a string encoded in some local encoding, rather than Unicode. How can I read the command line in full Unicode?

我希望我的 Python 脚本能够读取 Windows 中的 Unicode 命令行参数。但看起来 sys.argv 是以某种本地编码而不是 Unicode 编码的字符串。如何以完整的 Unicode 读取命令行?

Example code: argv.py

示例代码: argv.py

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)
print first_arg.encode("hex")
print open(first_arg)

On my PC set up for Japanese code page, I get:

在我为日语代码页设置的 PC 上,我得到:

C:\temp>argv.py "PC?ソフト申請書08.09.24.doc"
PC?ソフト申請書08.09.24.doc
<type 'str'>
50438145835c83748367905c90bf8f9130382e30392e32342e646f63
<open file 'PC?ソフト申請書08.09.24.doc', mode 'r' at 0x00917D90>

That's Shift-JIS encoded I believe, and it "works" for that filename. But it breaks for filenames with characters that aren't in the Shift-JIS character set—the final "open" call fails:

我相信这是 Shift-JIS 编码的,它适用于该文件名。但是对于包含不在 Shift-JIS 字符集中的字符的文件名,它会中断——最终的“open”调用失败:

C:\temp>argv.py J?rgen.txt
Jorgen.txt
<type 'str'>
4a6f7267656e2e747874
Traceback (most recent call last):
  File "C:\temp\argv.py", line 7,
in <module>
    print open(first_arg)
IOError: [Errno 2] No such file or directory: 'Jorgen.txt'

Note—I'm talking about Python 2.x, not Python 3.0. I've found that Python 3.0 gives sys.argvas proper Unicode. But it's a bit early yet to transition to Python 3.0 (due to lack of 3rd party library support).

注意——我说的是 Python 2.x,而不是 Python 3.0。我发现 Python 3.0 提供sys.argv了正确的 Unicode。但是现在过渡到 Python 3.0 还为时过早(由于缺乏 3rd 方库支持)。

Update:

更新:

A few answers have said I should decode according to whatever the sys.argvis encoded in. The problem with that is that it's not full Unicode, so some characters are not representable.

一些答案说我应该根据sys.argv编码的内容进行解码。问题在于它不是完整的 Unicode,因此某些字符无法表示。

Here's the use case that gives me grief: I have enabled drag-and-drop of files onto .py files in Windows Explorer. I have file names with all sorts of characters, including some not in the system default code page. My Python script doesn't get the right Unicode filenames passed to it via sys.argv in all cases, when the characters aren't representable in the current code page encoding.

这是让我感到悲伤的用例:我在 Windows Explorer 中启用了将文件拖放到 .py 文件的功能。我有包含各种字符的文件名,包括一些不在系统默认代码页中的文件名。当字符在当前代码页编码中无法表示时,我的 Python 脚本在所有情况下都无法通过 sys.argv 获得正确的 Unicode 文件名。

There is certainly some Windows API to read the command line with full Unicode (and Python 3.0 does it). I assume the Python 2.x interpreter is not using it.

当然有一些 Windows API 可以读取带有完整 Unicode 的命令行(Python 3.0 可以做到)。我假设 Python 2.x 解释器没有使用它。

采纳答案by Craig McQueen

Here is a solution that is just what I'm looking for, making a call to the Windows GetCommandLineArgvWfunction:
Get sys.argv with Unicode characters under Windows(from ActiveState)

这是我正在寻找的解决方案,调用 WindowsGetCommandLineArgvW函数:
Get sys.argv with Unicode characters under Windows(from ActiveState)

But I've made several changes, to simplify its usage and better handle certain uses. Here is what I use:

但是我做了一些更改,以简化其使用并更好地处理某些用途。这是我使用的:

win32_unicode_argv.py

win32_unicode_argv.py

"""
win32_unicode_argv.py

Importing this will replace sys.argv with a full Unicode form.
Windows only.

From this site, with adaptations:
      http://code.activestate.com/recipes/572200/

Usage: simply import this module into a script. sys.argv is changed to
be a list of Unicode strings.
"""


import sys

def win32_unicode_argv():
    """Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode
    strings.

    Versions 2.x of Python don't support Unicode in sys.argv on
    Windows, with the underlying Windows API instead replacing multi-byte
    characters with '?'.
    """

    from ctypes import POINTER, byref, cdll, c_int, windll
    from ctypes.wintypes import LPCWSTR, LPWSTR

    GetCommandLineW = cdll.kernel32.GetCommandLineW
    GetCommandLineW.argtypes = []
    GetCommandLineW.restype = LPCWSTR

    CommandLineToArgvW = windll.shell32.CommandLineToArgvW
    CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
    CommandLineToArgvW.restype = POINTER(LPWSTR)

    cmd = GetCommandLineW()
    argc = c_int(0)
    argv = CommandLineToArgvW(cmd, byref(argc))
    if argc.value > 0:
        # Remove Python executable and commands if present
        start = argc.value - len(sys.argv)
        return [argv[i] for i in
                xrange(start, argc.value)]

sys.argv = win32_unicode_argv()

Now, the way I use it is simply to do:

现在,我使用它的方式很简单:

import sys
import win32_unicode_argv

and from then on, sys.argvis a list of Unicode strings. The Python optparsemodule seems happy to parse it, which is great.

从那时起,sys.argv就是一个 Unicode 字符串列表。Pythonoptparse模块似乎很乐意解析它,这很棒。

回答by monkut

Dealing with encodings is very confusing.

处理编码非常令人困惑。

I believeif your inputing data via the commandline it will encode the data as whatever your system encoding is and is not unicode. (Even copy/paste should do this)

相信如果您通过命令行输入数据,它会将数据编码为您的系统编码,而不是 unicode。(即使复制/粘贴也应该这样做)

So it should be correct to decode into unicode using the system encoding:

所以使用系统编码解码成unicode应该是正确的:

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)

first_arg_unicode = first_arg.decode(sys.getfilesystemencoding())
print first_arg_unicode
print type(first_arg_unicode)

f = codecs.open(first_arg_unicode, 'r', 'utf-8')
unicode_text = f.read()
print type(unicode_text)
print unicode_text.encode(sys.getfilesystemencoding())

running the following Will output: Prompt> python myargv.py "PC?ソフト申請書08.09.24.txt"

运行如下会输出: Prompt> python myargv.py "PC?ソフト申请书08.09.24.txt"

PC?ソフト申請書08.09.24.txt
<type 'str'>
<type 'unicode'>
PC?ソフト申請書08.09.24.txt
<type 'unicode'>
?日本語

Where the "PC?ソフト申請書08.09.24.txt" contained the text, "日本語". (I encoded the file as utf8 using windows notepad, I'm a little stumped as to why there's a '?' in the begining when printing. Something to do with how notepad saves utf8?)

其中“PC?ソフト申请书08.09.24.txt”包含“日本语”字样。(我使用 Windows 记事本将文件编码为 utf8,我有点困惑为什么在打印开始时有一个“?”。与记事本如何保存 utf8 有关系?)

The strings 'decode' method or the unicode() builtin can be used to convert an encoding into unicode.

字符串 'decode' 方法或 unicode() 内置函数可用于将编码转换为 unicode。

unicode_str = utf8_str.decode('utf8')
unicode_str = unicode(utf8_str, 'utf8')

Also, if your dealing with encoded files you may want to use the codecs.open() function in place of the built-in open(). It allows you to define the encoding of the file, and will then use the given encoding to transparently decode the content to unicode.

此外,如果您处理编码文件,您可能希望使用 codecs.open() 函数代替内置的 open() 函数。它允许您定义文件的编码,然后将使用给定的编码透明地将内容解码为 un​​icode。

So when you call content = codecs.open("myfile.txt", "r", "utf8").read()contentwill be in unicode.

所以当你打电话时content = codecs.open("myfile.txt", "r", "utf8").read()content会是unicode。

codecs.open: http://docs.python.org/library/codecs.html?#codecs.open

codecs.open: http://docs.python.org/library/codecs.html?#codecs.open

If I'm miss-understanding something please let me know.

如果我错过了什么,请告诉我。

If you haven't already I recommend reading Joel's article on unicode and encoding: http://www.joelonsoftware.com/articles/Unicode.html

如果你还没有,我建议你阅读乔尔关于 unicode 和编码的文章:http: //www.joelonsoftware.com/articles/Unicode.html

回答by pts

Try this:

尝试这个:

import sys
print repr(sys.argv[1].decode('UTF-8'))

Maybe you have to substitute CP437or CP1252for UTF-8. You should be able to infer the proper encoding name from the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP

也许你需要替换CP437CP1252UTF-8。您应该能够从注册表项中推断出正确的编码名称HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP

回答by a paid nerd

The command line might be in Windows encoding. Try decoding the arguments into unicodeobjects:

命令行可能采用 Windows 编码。尝试将参数解码为unicode对象:

args = [unicode(x, "iso-8859-9") for x in sys.argv]