Python 3: os.walk() 文件路径 UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed

Question

提问by Collin Anderson

This code:

这段代码：

for root, dirs, files in os.walk('.'):
    print(root)

Gives me this error:

给我这个错误：

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 27: surrogates not allowed

How do I walk through a file tree without getting toxic strings like this?

如何遍历文件树而不会得到这样的有毒字符串？

Answer 1

采纳答案by Mark Tolonen

On Linux, filenames are 'just a bunch of bytes', and are not necessarily encoded in a particular encoding. Python 3 tries to turn everything into Unicode strings. In doing so the developers came up with a scheme to translate byte strings to Unicode strings and back without loss, and without knowing the original encoding. They used partial surrogates to encode the 'bad' bytes, but the normal UTF8 encoder can't handle them when printing to the terminal.

在 Linux 上，文件名“只是一堆字节”，不一定以特定编码进行编码。Python 3 尝试将所有内容转换为 Unicode 字符串。在这样做的过程中，开发人员提出了一种方案，可以将字节字符串转换为 Unicode 字符串并返回而不会丢失，并且不知道原始编码。他们使用部分代理来编码“坏”字节，但是在打印到终端时，普通的 UTF8 编码器无法处理它们。

For example, here's a non-UTF8 byte string:

例如，这是一个非 UTF8 字节字符串：

>>> b'C\xc3N'.decode('utf8','surrogateescape')
'C\udcc3N'

It can be converted to and from Unicode without loss:

它可以在不丢失的情况下与 Unicode 相互转换：

>>> b'C\xc3N'.decode('utf8','surrogateescape').encode('utf8','surrogateescape')
b'C\xc3N'

But it can't be printed:

但是不能打印：

>>> print(b'C\xc3N'.decode('utf8','surrogateescape'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 1: surrogates not allowed

You'll have to figure out what you want to do with file names with non-default encodings. Perhaps just encoding them back to original bytes and decode them with unknown replacement. Use this for display but keep the original name to access the file.

你必须弄清楚你想用非默认编码的文件名做什么。也许只是将它们编码回原始字节并使用未知替换对其进行解码。使用它来显示但保留原始名称以访问文件。

>>> b'C\xc3N'.decode('utf8','replace')
C?N

os.walkcan also take a byte string and will return byte strings instead of Unicode strings:

os.walk也可以采用字节字符串，并将返回字节字符串而不是 Unicode 字符串：

for p,d,f in os.walk(b'.'):

Then you can decode as you like.

然后你可以随意解码。

Answer 2

回答by Collin Anderson

I ended up passing in a byte string to os.walk()which will apparently return byte strings instead of incorrect unicode strings

我最终传入了一个字节字符串，os.walk()该字符串显然会返回字节字符串而不是不正确的 unicode 字符串

for root, dirs, files in os.walk(b'.'):
    print(root)

Answer 3

回答by Walker Hale IV

Filter with sedor grep:

用sed或过滤grep：

set | sed -n '/^[a-zA-Z0-9_]*=/p'
# ... or ...
set | grep '^[a-zA-Z0-9_]*='
# ... or ...
set | egrep '^[_[:alnum:]]+='

This is sensitive to how crazy your variable names are. The last version should handle most crazy things.

这对你的变量名有多疯狂很敏感。最后一个版本应该处理最疯狂的事情。

Answer 4

回答by Rohan Goel

Try using this line of code:

尝试使用这行代码：

"bad string".encode('utf-8', 'replace').decode()

Python 3: os.walk() 文件路径 UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed

提问by Collin Anderson

采纳答案by Mark Tolonen

回答by Collin Anderson

回答by Walker Hale IV

回答by Rohan Goel

相关推荐

最近更新

标签

Python 3: os.walk() 文件路径 UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed

提问by Collin Anderson

采纳答案by Mark Tolonen

回答by Collin Anderson

回答by Walker Hale IV

回答by Rohan Goel

相关推荐

Python 从图像列表创建 PDF

Python 如何解决10054错误

如何使用 python (Pandas) 生成堆叠条形簇

如何在python中加密和解密字符串？

相关推荐

最近更新

标签