pandas 如何从 Python 数据框中查找特殊字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/51287850/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:47:41  来源:igfitidea点击:

How to find special characters from Python Data frame

python-3.xpandasdataframespecial-characters

提问by SPy

I need to find special characters from entire dataframe.

我需要从整个数据框中找到特殊字符。

In below data frame some columns contains special characters, how to find the which columns contains special characters?

在下面的数据框中,某些列包含特殊字符,如何查找哪些列包含特殊字符?

enter image description here

在此处输入图片说明

Want to display text for each columns if it contains special characters.

如果每列包含特殊字符,则希望为每列显示文本。

采纳答案by rafaelc

You can setup an alphabet of valid characters, for example

例如,您可以设置有效字符的字母表

import string
alphabet = string.ascii_letters+string.punctuation

Which is

这是

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'

And just use

只需使用

df.col.str.strip(alphabet).astype(bool).any()

For example,

例如,

df = pd.DataFrame({'col1':['abc', 'hello?'], 'col2': ['?éG', '?']})


    col1    col2
0   abc     ?éG
1   hello?  ?

Then, with the above alphabet,

然后,用上面的字母表,

df.col1.str.strip(alphabet).astype(bool).any()
False
df.col2.str.strip(alphabet).astype(bool).any()
True

The statement special characterscan be very tricky, because it depends on your interpretation. For example, you mightor might notconsider #to be a special character. Also, some languages (such as Portuguese) may have chars like ?and ébut others (such as English) will not.

语句特殊字符可能非常棘手,因为这取决于您的解释。例如,您可能会可能不会认为#是特殊字符。此外,某些语言(例如葡萄牙语)可能具有类似字符?é而其他语言(例如英语)则不会。

回答by Plinus

To remove unwanted characters from dataframe columns, use regex:

要从数据框列中删除不需要的字符,请使用正则表达式:

def strip_character(dataCol):
    r = re.compile(r'[^a-zA-Z !@#$%&*_+-=|\:";<>,./()[\]{}\']')
    return r.sub('', dataCol)

df[resultCol] = df[dataCol].apply(strip_character)