python 加入文件名时出现 UnicodeEncodeError

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2004137/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 23:32:22  来源:igfitidea点击:

UnicodeEncodeError on joining file name

pythonunicodefilenames

提问by Hyman

It throws out "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)" when executing following code:

执行以下代码时,它抛出“UnicodeDecodeError:'ascii'编解码器无法解码位置2中的字节0xc2:序号不在范围内(128)”:

filename = 'Spywaj.ttf'
print repr(filename)
>> 'Sp\xc2\x88ywaj.ttf'
filepath = os.path.join('/dirname', filename)

But the file is valid and existed on disk. Filename was extracted from "unzip -l" command. How can join filenames like this?

但是该文件是有效的并且存在于磁盘上。文件名是从“unzip -l”命令中提取的。如何加入这样的文件名?

OS and filesystem

操作系统和文件系统

Filesystem: ext3    relatime,errors=remount-ro 0       0
Locale: en_US.UTF-8

Alex's suggestionos.path.join works now but I still cannot access the file on disk with the filename it joined.

Alex 的建议os.path.join 现在有效,但我仍然无法使用它加入的文件名访问磁盘上的文件。

filename = filename.decode('utf-8')
filepath = os.path.join('/dirname', filename)
print filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'
print os.path.isfile(filepath)
>> False

new_filepath = filepath.encode('Latin-1').encode('utf-8')
print new_filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'
print type(filepath)
>> <type 'unicode'>
print os.path.isfile(new_filepath)
>> False

valid_filepath = glob.glob('/dirname/*.ttf')[0]
print valid_filepath
>> /dirname/Spywaj.ttf (SO cannot display the chars in filename)
print type(valid_filepath)
>> <type 'str'>
print os.path.isfile(valid_filepath)
>> True

采纳答案by Alex Martelli

In both Latin-1 (ISO-8859-1) and Windows-1252, 0xc2 would a capital A with a circumflex accent... doesn't seem to be anywhere in the code you show! Can you please add a

在 Latin-1 (ISO-8859-1) 和 Windows-1252 中,0xc2 将是带有抑扬音符的大写字母 A……在您显示的代码中似乎没有任何地方!你能加一个吗

print repr(filename)

before the os.path.joincall (and also put the '/dirname'in a variable and print its repr for completeness?). I'm thinking that maybe that stray character isthere but you're not seeing it for some reason -- the reprwill reveal it.

os.path.join调用之前(并且还将 放入'/dirname'一个变量并打印其 repr 以确保完整性?)。我在想,也许那个流浪角色在那里,但由于某种原因你没有看到它——它repr会揭示它。

If you do have a Latin-1 (or Win-1252) non-Ascii character in your filename, you have to use Unicode -- and/or, depending on your OS and filesystem, some specific encoding thereof.

如果您的文件名中确实有一个 Latin-1(或 Win-1252)非 Ascii 字符,则您必须使用 Unicode —— 和/或,根据您的操作系统和文件系统,它的某些特定编码。

Edit: the OP confirms, thanks to repr, that there are actually twobytes that can't possibly be ASCII -- 0xc2 then 0x88, corresponding to what the OP thinks is one lowercase L. Well, that sequence would be a Unicode uppercase A with caret (codepoint 0x88) in the justly popular UTF-8encoding - how that could look like a lowercase L to the OP beggars explanation, but I imagine somefonts could be graphically crazy enough to afford such confusion.

编辑:OP 确认,多亏了repr,实际上有两个字节不可能是 ASCII——0xc2 然后是 0x88,对应于 OP 认为的一个小写 L。那么,该序列将是一个 Unicode 大写 A插入符号(代码点 0x88)在正当流行的UTF-8编码中 - 对于 OP 乞丐的解释,它看起来像一个小写的 L,但我想有些字体可能在图形上非常疯狂,足以承受这种混乱。

So I would first try filename = filename.decode('utf-8')-- that should allow the os.path.jointo work. If openthen balks at the resulting Unicode string (it might work, depending on the filesystem and OS), next attempt is to try using that Unicode object's .encode('Latin-1')and .encode('utf-8'). If none of the encodings work, information on the OS and filesystem in use, which the OP, I believe, hasn't given yet, becomes crucial.

所以我会首先尝试filename = filename.decode('utf-8')- 这应该允许os.path.join工作。如果open然后对生成的 Unicode 字符串犹豫不决(它可能会起作用,具体取决于文件系统和操作系统),则下一次尝试是尝试使用该 Unicode 对象的.encode('Latin-1').encode('utf-8'). 如果所有编码都不起作用,那么关于正在使用的操作系统和文件系统的信息(我相信 OP 尚未提供)就变得至关重要。

回答by Don Grem

I have fixed the UnicodeDecodeError by adding these lines to /etc/apache2/envvarsand restarting Apache.

我通过将这些行添加到/etc/apache2/envvars并重新启动 Apache来修复 UnicodeDecodeError 。

export LANG='en_US.UTF-8'
export LC_ALL='en_US.UTF-8'

as described here: https://docs.djangoproject.com/en/dev/howto/deployment/wsgi/modwsgi/#if-you-get-a-unicodeencodeerror

如此处所述:https: //docs.djangoproject.com/en/dev/howto/deployment/wsgi/modwsgi/#if-you-get-a-unicodeencodeerror

I have spent some time debugging this.

我花了一些时间调试这个。

回答by YOU

filename = filename.decode('utf-8').encode("latin-1")

works for me with the file from Splywaj.zip

使用Splywaj.zip 中的文件对我有用

>>> os.path.isfile(filename.decode("utf8").encode("latin-1"))
True
>>>

回答by John Machin

Evidence problem 1 ###

证据问题 1 ###

It throws out "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)" when executing following code:

执行以下代码时,它抛出“UnicodeDecodeError:'ascii'编解码器无法解码位置2中的字节0xc2:序号不在范围内(128)”:

filename = 'Spywaj.ttf'
print repr(filename)
>> 'Sp\xc2\x88ywaj.ttf'
filepath = os.path.join('/dirname', filename)

I can't see how it is possible to get that exception -- both args of os.path.join are str objects. There is no reason to try converting anything to unicode. Are you sure that the above code is exactly what you ran?

我看不出怎么可能得到那个异常—— os.path.join 的两个 args 都是 str 对象。没有理由尝试将任何内容转换为 unicode。你确定上面的代码正是你运行的吗?

Evidence problem 2

证据问题2

Alex's suggestion os.path.join works now but I still cannot access the file on disk with the filename it joined.

Alex 的建议 os.path.join 现在有效,但我仍然无法使用它加入的文件名访问磁盘上的文件。

filename = filename.decode('utf-8')
filepath = os.path.join('/dirname', filename)
print filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'

Sorry, assuming that filenamehas not changed from the previous snippet, that's definitely impossible. It looks like the result of os.path.join('/dirname', repr(filename))... please ensure that you publish the code that you actually ran, together with actual output (and actual traceback, if any).

抱歉,假设filename与之前的代码片段相比没有改变,那绝对是不可能的。它看起来像os.path.join('/dirname', repr(filename))......的结果,请确保您发布了您实际运行的代码以及实际输出(以及实际回溯,如果有的话)。

Confusion

困惑

new_filepath = filepath.encode('Latin-1').encode('utf-8')

Alex meant to try twice, each time with one of those encodings -- not try once with both encodings! As all the characters in filepath were in the ASCII range (see evidence problem 2) the effect was simply filepath.encode('ascii')

Alex 打算尝试两次,每次使用其中一种编码——而不是同时使用两种编码!由于 filepath 中的所有字符都在 ASCII 范围内(参见证据问题 2),因此效果只是 filepath.encode('ascii')

Simple solution

简单的解决方案

You know how to find the name of the file that you are interested in:

您知道如何查找您感兴趣的文件的名称:

valid_filepath = glob.glob('/dirname/*.ttf')[0]

If you must hard-code that name in your script, you can use the repr() function to get the representation that you can type into your script without worrying about utf8, unicode, encode, decode and all that noise:

如果您必须在脚本中对该名称进行硬编码,则可以使用 repr() 函数来获取可以在脚本中键入的表示形式,而无需担心 utf8、unicode、encode、decode 和所有这些噪音:

print repr(valid_filepath)

Let's suppose that it prints '/dirname/Sp\xc2\x88ywaj.ttf'... then all you need to do is carefully copy that and paste it into your script:

让我们假设它打印'/dirname/Sp\xc2\x88ywaj.ttf'...然后您需要做的就是小心地复制它并将其粘贴到您的脚本中:

file_path = '/dirname/Sp\xc2\x88ywaj.ttf'