Python 从字符串中删除每个非 utf-8 符号

Question

提问by Darth Kotik

I have a big amount of files and parser. What I Have to do is strip all non utf-8 symbols and put data in mongodb. Currently I have code like this.

我有大量的文件和解析器。我要做的是去除所有非 utf-8 符号并将数据放入 mongodb。目前我有这样的代码。

with open(fname, "r") as fp:
    for line in fp:
        line = line.strip()
        line = line.decode('utf-8', 'ignore')
        line = line.encode('utf-8', 'ignore')

somehow I still get an error

不知何故我仍然收到错误

bson.errors.InvalidStringData: strings in documents must be valid UTF-8: 
1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin

I don't get it. Is there some simple way to do it?

我不明白。有什么简单的方法可以做到吗？

UPD: seems like Python and Mongo don't agree about definition of Utf-8 Valid string.

UPD：似乎 Python 和 Mongo 不同意 Utf-8 有效字符串的定义。

Answer 1

采纳答案by Irshad Bhat

Try below code line instead of last two lines. Hope it helps:

尝试下面的代码行而不是最后两行。希望能帮助到你：

line=line.decode('utf-8','ignore').encode("utf-8")

Answer 2

回答by Shafiq

Example to handle no utf-8 characters

处理无 utf-8 字符的示例

import string

test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"

print ''.join(x for x in test if x in string.printable)

Answer 3

回答by AlexG

For python 3, as mentioned in a comment in this thread, you can do:

对于 python 3，如该线程的评论中所述，您可以执行以下操作：

line = bytes(line, 'utf-8').decode('utf-8', 'ignore')

The 'ignore' parameter prevents an error from being raised if any characters are unable to be decoded.

'ignore' 参数可防止在任何字符无法解码时引发错误。

If your line is already a bytes object (e.g. b'my string') then you just need to decode it with decode('utf-8', 'ignore').

如果您的行已经是一个字节对象（例如b'my string'），那么您只需要使用decode('utf-8', 'ignore').

Answer 4

回答by Willem

with open(fname, "r") as fp:
for line in fp:
    line = line.strip()
    line = line.decode('cp1252').encode('utf-8')

Python 从字符串中删除每个非 utf-8 符号

提问by Darth Kotik

采纳答案by Irshad Bhat

回答by Shafiq

回答by AlexG

回答by Willem

相关推荐

最近更新

标签

Python 从字符串中删除每个非 utf-8 符号

提问by Darth Kotik

采纳答案by Irshad Bhat

回答by Shafiq

回答by AlexG

回答by Willem

相关推荐

Python 使用 Pandas 对同一工作簿的多个工作表进行 pd.read_excel()

Python 如何搜索和替换文件中的文本？

Python Numpy hstack-“ValueError：所有输入数组必须具有相同的维数”-但它们确实如此

Python 类型错误：无法连接电子邮件中的“str”和“list”对象

相关推荐

最近更新

标签