python 如何彻底清理python中的一串非法字符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1911548/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to completely sanitize a string of illegal characters in python?
提问by priestc
I have a feature of my program where the user can upload a csv file, which my program goes through and uses as input. I have one user complaining about a problem where his input is throwing up an error. The error is caused by there being an illegal character that is encoded wrong. The characters is below:
我的程序有一个功能,用户可以上传一个 csv 文件,我的程序通过该文件并将其用作输入。我有一个用户抱怨他的输入抛出错误的问题。该错误是由编码错误的非法字符引起的。字符如下:
?
Sometimes it appears as a diamond with a "?" in the middle, sometimes it appears as a double diamond with "?" in the middle, sometimes it appears as "\xa0", and sometimes it appears as "\xa0\xa0".
有时它显示为带有“?”的菱形。在中间,有时它会显示为带有“?”的双菱形。在中间,有时显示为“\xa0”,有时显示为“\xa0\xa0”。
In my program if I do:
在我的程序中,如果我这样做:
print str_with_weird_char
The string will show up in my terminal with the diamond "?" in place of the weird character. If I copy+paste that string into ipython, it will exit with this message:
该字符串将显示在我的终端中,并带有菱形“?” 代替奇怪的角色。如果我将该字符串复制并粘贴到 ipython 中,它将退出并显示以下消息:
In [1]: g="blah??blah"
WARNING:
********
You or a %run:ed script called sys.stdin.close() or sys.stdout.close()!
Exiting IPython!
notice how the diamond "?" is double now. For some reason copy+paste makes it double...
怎么注意钻石“?” 现在是双。出于某种原因,复制+粘贴使它翻倍......
In the django traceback page, it looks like this:
在 django 回溯页面中,它看起来像这样:
UnicodeDecodeError at /chris/import.html
('ascii', 'blah \xa0 BLAH', 14, 15, 'ordinal not in range(128)')
The thing that messes me up is that I can't do anything with this string without it throwing an exception. I tried unicode(), I tried str(), I tried .encode(), I tried .encode("utf-8"), no matter what it throws up an error.
让我感到困惑的是,如果不抛出异常,我无法对这个字符串做任何事情。我试过 unicode(),我试过 str(),我试过 .encode(),我试过 .encode("utf-8"),不管它抛出什么错误。
What can I do it get this thing to be a working string?
我该怎么做才能让这个东西成为一个有效的字符串?
回答by YOU
You can pass, "ignore" to skip invalid characters in .encode/.decode
like "ILLEGAL".decode("utf8","ignore")
您可以通过“忽略”跳过 .encode/.decode 中的无效字符,例如 "ILLEGAL".decode("utf8","ignore")
>>> "ILLEGA\xa0L".decode("utf8")
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 6: unexpected code byte
>>> "ILLEGA\xa0L".decode("utf8","ignore")
u'ILLEGAL'
>>>
回答by user2588660
Declare the coding on the second line of your script. It really has to be second. Like
在脚本的第二行声明编码。它真的必须是第二个。喜欢
#!/usr/bin/python
# coding=utf-8
This might be enough to solve your problem all by itself. If not, see str.encode('utf-8') and str.decode('utf-8').
这可能足以自行解决您的问题。如果不是,请参阅 str.encode('utf-8') 和 str.decode('utf-8')。