从 pandas 列中删除非 ASCII 字符

Question

提问by red_devil

I have been trying to work on this issue for a while.I am trying to remove non ASCII characters form DB_user column and trying to replace them with spaces. But I keep getting some errors. This is how my data frame looks:

我一直在尝试解决这个问题一段时间。我试图从 DB_user 列中删除非 ASCII 字符并尝试用空格替换它们。但我不断收到一些错误。这是我的数据框的外观：

+-----------------------------------------------------------
|      DB_user                            source   count  |                                             
+-----------------------------------------------------------
| ???/"ò|Z?)?]??C %??J                      A        10   |                                       
| ?D$ZGU   ;@D??_???T(?)                    B         3   |                                       
| ?Q`H??M'?Y??KTK$?ù????D?JL4??*?_??        C         2   |                                        
+-----------------------------------------------------------

I was using this function, which I had come across while researching the problem on SO.

我正在使用这个功能，这是我在研究 SO 上的问题时遇到的。

def filter_func(string):
   for i in range(0,len(string)):


      if (ord(string[i])< 32 or ord(string[i])>126
           break

      return ''

And then using the apply function:

df['DB_user'] = df.apply(filter_func,axis=1)

I keep getting the error:

我不断收到错误：

'ord() expected a character, but string of length 66 found', u'occurred at index 2'

However, I thought by using the loop in the filter_func function, I was dealing with this by inputing a char into 'ord'. Therefore the moment it hits a non-ASCII character, it should be replaced by a space.

但是，我认为通过在 filter_func 函数中使用循环，我是通过将一个字符输入到“ord”中来解决这个问题的。因此，当它遇到非 ASCII 字符时，应将其替换为空格。

Could somebody help me out?

有人可以帮我吗？

Thanks!

谢谢！

Answer 1

采纳答案by Padraic Cunningham

You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:

您的代码失败了，因为您没有将它应用到每个字符上，而是按单词应用它，而 ord 错误，因为它只需要一个字符，您需要：

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

You can also simplify the join using a chained comparison:

您还可以使用链式比较来简化连接：

   ''.join([i if 32 < ord(i) < 126 else " " for i in x])

You could also use string.printableto filter the chars:

您还可以使用string.printable过滤字符：

from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

The fastest is to use translate:

最快的是使用translate：

from string import maketrans

del_chars =  " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))

df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

Interestingly that is faster than:

有趣的是，它比：

  df['DB_user'] = df["DB_user"].str.translate(trans)

Answer 2

回答by MaxU

you may try this:

你可以试试这个：

df.DB_user.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

Answer 3

回答by cs95

A common trick is to perform ASCII encoding with the errors="ignore"flag, then subsequently decoding it into ASCII:

一个常见的技巧是使用errors="ignore"标志执行 ASCII 编码，然后将其解码为 ASCII：

df['DB_user'].str.encode('ascii', 'ignore').str.decode('ascii')

From python3.x and above, this is my recommended solution.

从 python3.x 及更高版本，这是我推荐的解决方案。

Minimal Code Sample

最少的代码示例

s = pd.Series(['Déjà vu', 'ò|zz', ';test 123'])
s

0      Déjà vu
1         ò|zz
2    ;test 123
dtype: object


s.str.encode('ascii', 'ignore').str.decode('ascii')

0        Dj vu
1          |zz
2    ;test 123
dtype: object

P.S.: This can also be extended to cases where you need to filter out characters that do not belong to any character encoding scheme (not just ASCII).

PS：这也可以扩展到需要过滤掉不属于任何字符编码方案（不仅仅是ASCII）的字符的情况。

Answer 4

回答by Josh Friedlander

A couple of the answers given here aren't correct. Simple validation:

这里给出的几个答案是不正确的。简单验证：

s = pd.Series([chr(x) for x in range(256)])
s.loc[0]
>> '\x00'
s.replace({r'[^\x00-\x7F]+':''}, regex=True).loc[0]
>> '\x00'  # FAIL
s.str.encode('ascii', 'ignore').str.decode('ascii').loc[0]
>> '\x00'  # FAIL
s.apply(lambda x: ''.join([i if 32 < ord(i) < 126 else " " for i in x])).loc[0]
>> ' '  # Success!
import string
s.apply(lambda x: ''.join([" " if  i not in string.printable else i for i in x])).loc[0]
>> ' '  # Looks good, but...
s.apply(lambda x: ''.join([" " if  i not in string.printable else i for i in x])).loc[11]
>> '\x0b'  # FAIL
del_chars =  " ".join([chr(i) for i in list(range(32)) + list(range(127, 256))])
trans = str.maketrans(del_chars, " " * len(del_chars))
s.apply(lambda x: x.translate(trans)).loc[11]
>> ' '  # Success!

Conclusion: onlythe options in the accepted answer (from Padraic Cunningham) work reliably. There are some bizarre Python errors/typos in his second answer, amended here, but otherwise it should be the fastest.

结论：只有已接受的答案（来自 Padraic Cunningham）中的选项才能可靠地工作。在他的第二个答案中有一些奇怪的 Python 错误/错别字，在这里修改，但除此之外它应该是最快的。

Answer 5

回答by Justin Malinchak

This worked for me:

这对我有用：

import re
def replace_foreign_characters(s):
    return re.sub(r'[^\x00-\x7f]',r'', s)

df['column_name'] = df['column_name'].apply(lambda x: replace_foreign_characters(x))

从 pandas 列中删除非 ASCII 字符

提问by red_devil

采纳答案by Padraic Cunningham

回答by MaxU

回答by cs95

回答by Josh Friedlander

回答by Justin Malinchak

相关推荐

最近更新

标签

从 pandas 列中删除非 ASCII 字符

提问by red_devil

采纳答案by Padraic Cunningham

回答by MaxU

回答by cs95

回答by Josh Friedlander

回答by Justin Malinchak

相关推荐

pandas 熊猫将行从 1 个 DF 移动到另一个 DF

pandas 概率分布函数 Python

pandas 'DataFrame' 对象没有属性 'value_counts'

pandas 需要对数据框中的负值进行计数

相关推荐

最近更新

标签