Python 从字符串中删除每个非 utf-8 符号

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26541968/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:37:47  来源:igfitidea点击:

Delete every non utf-8 symbols from string

pythonmongodbencode

提问by Darth Kotik

I have a big amount of files and parser. What I Have to do is strip all non utf-8 symbols and put data in mongodb. Currently I have code like this.

我有大量的文件和解析器。我要做的是去除所有非 utf-8 符号并将数据放入 mongodb。目前我有这样的代码。

with open(fname, "r") as fp:
    for line in fp:
        line = line.strip()
        line = line.decode('utf-8', 'ignore')
        line = line.encode('utf-8', 'ignore')

somehow I still get an error

不知何故我仍然收到错误

bson.errors.InvalidStringData: strings in documents must be valid UTF-8: 
1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin

I don't get it. Is there some simple way to do it?

我不明白。有什么简单的方法可以做到吗?

UPD: seems like Python and Mongo don't agree about definition of Utf-8 Valid string.

UPD:似乎 Python 和 Mongo 不同意 Utf-8 有效字符串的定义。

采纳答案by Irshad Bhat

Try below code line instead of last two lines. Hope it helps:

尝试下面的代码行而不是最后两行。希望能帮助到你:

line=line.decode('utf-8','ignore').encode("utf-8")

回答by Shafiq

Example to handle no utf-8 characters

处理无 utf-8 字符的示例

import string

test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"

print ''.join(x for x in test if x in string.printable)

回答by AlexG

For python 3, as mentioned in a comment in this thread, you can do:

对于 python 3,如该线程的评论中所述,您可以执行以下操作:

line = bytes(line, 'utf-8').decode('utf-8', 'ignore')

The 'ignore' parameter prevents an error from being raised if any characters are unable to be decoded.

'ignore' 参数可防止在任何字符无法解码时引发错误。

If your line is already a bytes object (e.g. b'my string') then you just need to decode it with decode('utf-8', 'ignore').

如果您的行已经是一个字节对象(例如b'my string'),那么您只需要使用decode('utf-8', 'ignore').

回答by Willem

with open(fname, "r") as fp:
for line in fp:
    line = line.strip()
    line = line.decode('cp1252').encode('utf-8')