python 如何彻底清理python中的一串非法字符？

Question

提问by priestc

I have a feature of my program where the user can upload a csv file, which my program goes through and uses as input. I have one user complaining about a problem where his input is throwing up an error. The error is caused by there being an illegal character that is encoded wrong. The characters is below:

我的程序有一个功能，用户可以上传一个 csv 文件，我的程序通过该文件并将其用作输入。我有一个用户抱怨他的输入抛出错误的问题。该错误是由编码错误的非法字符引起的。字符如下：

Sometimes it appears as a diamond with a "?" in the middle, sometimes it appears as a double diamond with "?" in the middle, sometimes it appears as "\xa0", and sometimes it appears as "\xa0\xa0".

有时它显示为带有“？”的菱形。在中间，有时它会显示为带有“？”的双菱形。在中间，有时显示为“\xa0”，有时显示为“\xa0\xa0”。

In my program if I do:

在我的程序中，如果我这样做：

print str_with_weird_char

The string will show up in my terminal with the diamond "?" in place of the weird character. If I copy+paste that string into ipython, it will exit with this message:

该字符串将显示在我的终端中，并带有菱形“？” 代替奇怪的角色。如果我将该字符串复制并粘贴到 ipython 中，它将退出并显示以下消息：

In [1]: g="blah??blah"
WARNING: 
********
You or a %run:ed script called sys.stdin.close() or sys.stdout.close()!
Exiting IPython!

notice how the diamond "?" is double now. For some reason copy+paste makes it double...

怎么注意钻石“？” 现在是双。出于某种原因，复制+粘贴使它翻倍......

In the django traceback page, it looks like this:

在 django 回溯页面中，它看起来像这样：

UnicodeDecodeError at /chris/import.html
('ascii', 'blah \xa0 BLAH', 14, 15, 'ordinal not in range(128)')

The thing that messes me up is that I can't do anything with this string without it throwing an exception. I tried unicode(), I tried str(), I tried .encode(), I tried .encode("utf-8"), no matter what it throws up an error.

让我感到困惑的是，如果不抛出异常，我无法对这个字符串做任何事情。我试过 unicode()，我试过 str()，我试过 .encode()，我试过 .encode("utf-8")，不管它抛出什么错误。

What can I do it get this thing to be a working string?

我该怎么做才能让这个东西成为一个有效的字符串？

Answer 1

回答by YOU

You can pass, "ignore" to skip invalid characters in .encode/.decode like "ILLEGAL".decode("utf8","ignore")

您可以通过“忽略”跳过 .encode/.decode 中的无效字符，例如 "ILLEGAL".decode("utf8","ignore")

>>> "ILLEGA\xa0L".decode("utf8")
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 6: unexpected code byte

>>> "ILLEGA\xa0L".decode("utf8","ignore")
u'ILLEGAL'
>>>

Answer 2

回答by user2588660

Declare the coding on the second line of your script. It really has to be second. Like

在脚本的第二行声明编码。它真的必须是第二个。喜欢

#!/usr/bin/python
# coding=utf-8

This might be enough to solve your problem all by itself. If not, see str.encode('utf-8') and str.decode('utf-8').

这可能足以自行解决您的问题。如果不是，请参阅 str.encode('utf-8') 和 str.decode('utf-8')。

python 如何彻底清理python中的一串非法字符？

提问by priestc

回答by YOU

回答by user2588660

相关推荐

最近更新

标签

python 如何彻底清理python中的一串非法字符？

提问by priestc

回答by YOU

回答by user2588660

相关推荐

python 如何为 Google App Engine 创建 .py 文件？

python 跨多行的python正则表达式

Python ctypes：初始化 c_char_p()

python 如何在 Tkinter 中获得带有滚动条的框架？

相关推荐

最近更新

标签