python 加入文件名时出现 UnicodeEncodeError
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2004137/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UnicodeEncodeError on joining file name
提问by Hyman
It throws out "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)" when executing following code:
执行以下代码时,它抛出“UnicodeDecodeError:'ascii'编解码器无法解码位置2中的字节0xc2:序号不在范围内(128)”:
filename = 'Spywaj.ttf'
print repr(filename)
>> 'Sp\xc2\x88ywaj.ttf'
filepath = os.path.join('/dirname', filename)
But the file is valid and existed on disk. Filename was extracted from "unzip -l" command. How can join filenames like this?
但是该文件是有效的并且存在于磁盘上。文件名是从“unzip -l”命令中提取的。如何加入这样的文件名?
OS and filesystem
操作系统和文件系统
Filesystem: ext3 relatime,errors=remount-ro 0 0
Locale: en_US.UTF-8
Alex's suggestionos.path.join works now but I still cannot access the file on disk with the filename it joined.
Alex 的建议os.path.join 现在有效,但我仍然无法使用它加入的文件名访问磁盘上的文件。
filename = filename.decode('utf-8')
filepath = os.path.join('/dirname', filename)
print filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'
print os.path.isfile(filepath)
>> False
new_filepath = filepath.encode('Latin-1').encode('utf-8')
print new_filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'
print type(filepath)
>> <type 'unicode'>
print os.path.isfile(new_filepath)
>> False
valid_filepath = glob.glob('/dirname/*.ttf')[0]
print valid_filepath
>> /dirname/Spywaj.ttf (SO cannot display the chars in filename)
print type(valid_filepath)
>> <type 'str'>
print os.path.isfile(valid_filepath)
>> True
采纳答案by Alex Martelli
In both Latin-1 (ISO-8859-1) and Windows-1252, 0xc2 would a capital A with a circumflex accent... doesn't seem to be anywhere in the code you show! Can you please add a
在 Latin-1 (ISO-8859-1) 和 Windows-1252 中,0xc2 将是带有抑扬音符的大写字母 A……在您显示的代码中似乎没有任何地方!你能加一个吗
print repr(filename)
before the os.path.join
call (and also put the '/dirname'
in a variable and print its repr for completeness?). I'm thinking that maybe that stray character isthere but you're not seeing it for some reason -- the repr
will reveal it.
在os.path.join
调用之前(并且还将 放入'/dirname'
一个变量并打印其 repr 以确保完整性?)。我在想,也许那个流浪角色就在那里,但由于某种原因你没有看到它——它repr
会揭示它。
If you do have a Latin-1 (or Win-1252) non-Ascii character in your filename, you have to use Unicode -- and/or, depending on your OS and filesystem, some specific encoding thereof.
如果您的文件名中确实有一个 Latin-1(或 Win-1252)非 Ascii 字符,则您必须使用 Unicode —— 和/或,根据您的操作系统和文件系统,它的某些特定编码。
Edit: the OP confirms, thanks to repr
, that there are actually twobytes that can't possibly be ASCII -- 0xc2 then 0x88, corresponding to what the OP thinks is one lowercase L.
Well, that sequence would be a Unicode uppercase A with caret (codepoint 0x88) in the justly popular UTF-8encoding - how that could look like a lowercase L to the OP beggars explanation, but I imagine somefonts could be graphically crazy enough to afford such confusion.
编辑:OP 确认,多亏了repr
,实际上有两个字节不可能是 ASCII——0xc2 然后是 0x88,对应于 OP 认为的一个小写 L。那么,该序列将是一个 Unicode 大写 A插入符号(代码点 0x88)在正当流行的UTF-8编码中 - 对于 OP 乞丐的解释,它看起来像一个小写的 L,但我想有些字体可能在图形上非常疯狂,足以承受这种混乱。
So I would first try filename = filename.decode('utf-8')
-- that should allow the os.path.join
to work. If open
then balks at the resulting Unicode string (it might work, depending on the filesystem and OS), next attempt is to try using that Unicode object's .encode('Latin-1')
and .encode('utf-8')
. If none of the encodings work, information on the OS and filesystem in use, which the OP, I believe, hasn't given yet, becomes crucial.
所以我会首先尝试filename = filename.decode('utf-8')
- 这应该允许os.path.join
工作。如果open
然后对生成的 Unicode 字符串犹豫不决(它可能会起作用,具体取决于文件系统和操作系统),则下一次尝试是尝试使用该 Unicode 对象的.encode('Latin-1')
和.encode('utf-8')
. 如果所有编码都不起作用,那么关于正在使用的操作系统和文件系统的信息(我相信 OP 尚未提供)就变得至关重要。
回答by Don Grem
I have fixed the UnicodeDecodeError by adding these lines to /etc/apache2/envvars
and restarting Apache.
我通过将这些行添加到/etc/apache2/envvars
并重新启动 Apache来修复 UnicodeDecodeError 。
export LANG='en_US.UTF-8'
export LC_ALL='en_US.UTF-8'
as described here: https://docs.djangoproject.com/en/dev/howto/deployment/wsgi/modwsgi/#if-you-get-a-unicodeencodeerror
如此处所述:https: //docs.djangoproject.com/en/dev/howto/deployment/wsgi/modwsgi/#if-you-get-a-unicodeencodeerror
I have spent some time debugging this.
我花了一些时间调试这个。
回答by YOU
filename = filename.decode('utf-8').encode("latin-1")
works for me with the file from Splywaj.zip
使用Splywaj.zip 中的文件对我有用
>>> os.path.isfile(filename.decode("utf8").encode("latin-1"))
True
>>>
回答by John Machin
Evidence problem 1 ###
证据问题 1 ###
It throws out "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)" when executing following code:
执行以下代码时,它抛出“UnicodeDecodeError:'ascii'编解码器无法解码位置2中的字节0xc2:序号不在范围内(128)”:
filename = 'Spywaj.ttf'
print repr(filename)
>> 'Sp\xc2\x88ywaj.ttf'
filepath = os.path.join('/dirname', filename)
I can't see how it is possible to get that exception -- both args of os.path.join are str objects. There is no reason to try converting anything to unicode. Are you sure that the above code is exactly what you ran?
我看不出怎么可能得到那个异常—— os.path.join 的两个 args 都是 str 对象。没有理由尝试将任何内容转换为 unicode。你确定上面的代码正是你运行的吗?
Evidence problem 2
证据问题2
Alex's suggestion os.path.join works now but I still cannot access the file on disk with the filename it joined.
Alex 的建议 os.path.join 现在有效,但我仍然无法使用它加入的文件名访问磁盘上的文件。
filename = filename.decode('utf-8')
filepath = os.path.join('/dirname', filename)
print filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'
Sorry, assuming that filename
has not changed from the previous snippet, that's definitely impossible. It looks like the result of os.path.join('/dirname', repr(filename))
... please ensure that you publish the code that you actually ran, together with actual output (and actual traceback, if any).
抱歉,假设filename
与之前的代码片段相比没有改变,那绝对是不可能的。它看起来像os.path.join('/dirname', repr(filename))
......的结果,请确保您发布了您实际运行的代码以及实际输出(以及实际回溯,如果有的话)。
Confusion
困惑
new_filepath = filepath.encode('Latin-1').encode('utf-8')
Alex meant to try twice, each time with one of those encodings -- not try once with both encodings! As all the characters in filepath were in the ASCII range (see evidence problem 2) the effect was simply filepath.encode('ascii')
Alex 打算尝试两次,每次使用其中一种编码——而不是同时使用两种编码!由于 filepath 中的所有字符都在 ASCII 范围内(参见证据问题 2),因此效果只是 filepath.encode('ascii')
Simple solution
简单的解决方案
You know how to find the name of the file that you are interested in:
您知道如何查找您感兴趣的文件的名称:
valid_filepath = glob.glob('/dirname/*.ttf')[0]
If you must hard-code that name in your script, you can use the repr() function to get the representation that you can type into your script without worrying about utf8, unicode, encode, decode and all that noise:
如果您必须在脚本中对该名称进行硬编码,则可以使用 repr() 函数来获取可以在脚本中键入的表示形式,而无需担心 utf8、unicode、encode、decode 和所有这些噪音:
print repr(valid_filepath)
Let's suppose that it prints '/dirname/Sp\xc2\x88ywaj.ttf'
... then all you need to do is carefully copy that and paste it into your script:
让我们假设它打印'/dirname/Sp\xc2\x88ywaj.ttf'
...然后您需要做的就是小心地复制它并将其粘贴到您的脚本中:
file_path = '/dirname/Sp\xc2\x88ywaj.ttf'