从 pandas 列中删除非 ASCII 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36340627/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:58:09  来源:igfitidea点击:

Remove non-ASCII characters from pandas column

pythonstringpandascharacter-encoding

提问by red_devil

I have been trying to work on this issue for a while.I am trying to remove non ASCII characters form DB_user column and trying to replace them with spaces. But I keep getting some errors. This is how my data frame looks:

我一直在尝试解决这个问题一段时间。我试图从 DB_user 列中删除非 ASCII 字符并尝试用空格替换它们。但我不断收到一些错误。这是我的数据框的外观:

+-----------------------------------------------------------
|      DB_user                            source   count  |                                             
+-----------------------------------------------------------
| ???/"ò|Z?)?]??C %??J                      A        10   |                                       
| ?D$ZGU   ;@D??_???T(?)                    B         3   |                                       
| ?Q`H??M'?Y??KTK$?ù????D?JL4??*?_??        C         2   |                                        
+-----------------------------------------------------------

I was using this function, which I had come across while researching the problem on SO.

我正在使用这个功能,这是我在研究 SO 上的问题时遇到的。

def filter_func(string):
   for i in range(0,len(string)):


      if (ord(string[i])< 32 or ord(string[i])>126
           break

      return ''

And then using the apply function:

df['DB_user'] = df.apply(filter_func,axis=1)

I keep getting the error:

我不断收到错误:

'ord() expected a character, but string of length 66 found', u'occurred at index 2'

However, I thought by using the loop in the filter_func function, I was dealing with this by inputing a char into 'ord'. Therefore the moment it hits a non-ASCII character, it should be replaced by a space.

但是,我认为通过在 filter_func 函数中使用循环,我是通过将一个字符输入到“ord”中来解决这个问题的。因此,当它遇到非 ASCII 字符时,应将其替换为空格。

Could somebody help me out?

有人可以帮我吗?

Thanks!

谢谢!

采纳答案by Padraic Cunningham

You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:

您的代码失败了,因为您没有将它应用到每个字符上,而是按单词应用它,而 ord 错误,因为它只需要一个字符,您需要:

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

You can also simplify the join using a chained comparison:

您还可以使用链式比较来简化连接:

   ''.join([i if 32 < ord(i) < 126 else " " for i in x])

You could also use string.printableto filter the chars:

您还可以使用string.printable过滤字符:

from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

The fastest is to use translate:

最快的是使用translate:

from string import maketrans

del_chars =  " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))

df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

Interestingly that is faster than:

有趣的是,它比:

  df['DB_user'] = df["DB_user"].str.translate(trans)

回答by MaxU

you may try this:

你可以试试这个:

df.DB_user.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

回答by cs95

A common trick is to perform ASCII encoding with the errors="ignore"flag, then subsequently decoding it into ASCII:

一个常见的技巧是使用errors="ignore"标志执行 ASCII 编码,然后将其解码为 ASCII:

df['DB_user'].str.encode('ascii', 'ignore').str.decode('ascii')

From python3.x and above, this is my recommended solution.

从 python3.x 及更高版本,这是我推荐的解决方案。



Minimal Code Sample

最少的代码示例

s = pd.Series(['Déjà vu', 'ò|zz', ';test 123'])
s

0      Déjà vu
1         ò|zz
2    ;test 123
dtype: object


s.str.encode('ascii', 'ignore').str.decode('ascii')

0        Dj vu
1          |zz
2    ;test 123
dtype: object

P.S.: This can also be extended to cases where you need to filter out characters that do not belong to any character encoding scheme (not just ASCII).

PS:这也可以扩展到需要过滤掉不属于任何字符编码方案(不仅仅是ASCII)的字符的情况。

回答by Josh Friedlander

A couple of the answers given here aren't correct. Simple validation:

这里给出的几个答案是不正确的。简单验证:

s = pd.Series([chr(x) for x in range(256)])
s.loc[0]
>> '\x00'
s.replace({r'[^\x00-\x7F]+':''}, regex=True).loc[0]
>> '\x00'  # FAIL
s.str.encode('ascii', 'ignore').str.decode('ascii').loc[0]
>> '\x00'  # FAIL
s.apply(lambda x: ''.join([i if 32 < ord(i) < 126 else " " for i in x])).loc[0]
>> ' '  # Success!
import string
s.apply(lambda x: ''.join([" " if  i not in string.printable else i for i in x])).loc[0]
>> ' '  # Looks good, but...
s.apply(lambda x: ''.join([" " if  i not in string.printable else i for i in x])).loc[11]
>> '\x0b'  # FAIL
del_chars =  " ".join([chr(i) for i in list(range(32)) + list(range(127, 256))])
trans = str.maketrans(del_chars, " " * len(del_chars))
s.apply(lambda x: x.translate(trans)).loc[11]
>> ' '  # Success!

Conclusion: onlythe options in the accepted answer (from Padraic Cunningham) work reliably. There are some bizarre Python errors/typos in his second answer, amended here, but otherwise it should be the fastest.

结论:只有已接受的答案(来自 Padraic Cunningham)中的选项才能可靠地工作。在他的第二个答案中有一些奇怪的 Python 错误/错别字,在这里修改,但除此之外它应该是最快的。

回答by Justin Malinchak

This worked for me:

这对我有用:

import re
def replace_foreign_characters(s):
    return re.sub(r'[^\x00-\x7f]',r'', s)

df['column_name'] = df['column_name'].apply(lambda x: replace_foreign_characters(x))