Python 3: os.walk() 文件路径 UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27366479/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python 3: os.walk() file paths UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed
提问by Collin Anderson
This code:
这段代码:
for root, dirs, files in os.walk('.'):
print(root)
Gives me this error:
给我这个错误:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 27: surrogates not allowed
How do I walk through a file tree without getting toxic strings like this?
如何遍历文件树而不会得到这样的有毒字符串?
采纳答案by Mark Tolonen
On Linux, filenames are 'just a bunch of bytes', and are not necessarily encoded in a particular encoding. Python 3 tries to turn everything into Unicode strings. In doing so the developers came up with a scheme to translate byte strings to Unicode strings and back without loss, and without knowing the original encoding. They used partial surrogates to encode the 'bad' bytes, but the normal UTF8 encoder can't handle them when printing to the terminal.
在 Linux 上,文件名“只是一堆字节”,不一定以特定编码进行编码。Python 3 尝试将所有内容转换为 Unicode 字符串。在这样做的过程中,开发人员提出了一种方案,可以将字节字符串转换为 Unicode 字符串并返回而不会丢失,并且不知道原始编码。他们使用部分代理来编码“坏”字节,但是在打印到终端时,普通的 UTF8 编码器无法处理它们。
For example, here's a non-UTF8 byte string:
例如,这是一个非 UTF8 字节字符串:
>>> b'C\xc3N'.decode('utf8','surrogateescape')
'C\udcc3N'
It can be converted to and from Unicode without loss:
它可以在不丢失的情况下与 Unicode 相互转换:
>>> b'C\xc3N'.decode('utf8','surrogateescape').encode('utf8','surrogateescape')
b'C\xc3N'
But it can't be printed:
但是不能打印:
>>> print(b'C\xc3N'.decode('utf8','surrogateescape'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 1: surrogates not allowed
You'll have to figure out what you want to do with file names with non-default encodings. Perhaps just encoding them back to original bytes and decode them with unknown replacement. Use this for display but keep the original name to access the file.
你必须弄清楚你想用非默认编码的文件名做什么。也许只是将它们编码回原始字节并使用未知替换对其进行解码。使用它来显示但保留原始名称以访问文件。
>>> b'C\xc3N'.decode('utf8','replace')
C?N
os.walk
can also take a byte string and will return byte strings instead of Unicode strings:
os.walk
也可以采用字节字符串,并将返回字节字符串而不是 Unicode 字符串:
for p,d,f in os.walk(b'.'):
Then you can decode as you like.
然后你可以随意解码。
回答by Collin Anderson
I ended up passing in a byte string to os.walk()
which will apparently return byte strings instead of incorrect unicode strings
我最终传入了一个字节字符串,os.walk()
该字符串显然会返回字节字符串而不是不正确的 unicode 字符串
for root, dirs, files in os.walk(b'.'):
print(root)
回答by Walker Hale IV
Filter with sed
or grep
:
用sed
或过滤grep
:
set | sed -n '/^[a-zA-Z0-9_]*=/p'
# ... or ...
set | grep '^[a-zA-Z0-9_]*='
# ... or ...
set | egrep '^[_[:alnum:]]+='
This is sensitive to how crazy your variable names are. The last version should handle most crazy things.
这对你的变量名有多疯狂很敏感。最后一个版本应该处理最疯狂的事情。
回答by Rohan Goel
Try using this line of code:
尝试使用这行代码:
"bad string".encode('utf-8', 'replace').decode()