pandas 从数据框中删除特殊字符和字母数字的简单方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44009113/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:36:58  来源:igfitidea点击:

Simple way to remove special characters and alpha numerical from dataframe

pythonregexpandasdataframedata-cleaning

提问by Sitz Blogz

I have a large dataset with some x rows and y number of columns. one of the columns as words and some unwanted data. That unwanted data is has no specific pattern hence I am finding it difficult to remove that from the dataframe.

我有一个包含 x 行和 y 列数的大型数据集。其中一列作为单词和一些不需要的数据。不需要的数据没有特定的模式,因此我发现很难从数据框中删除它。

nonhashtag
['want', 'better', 'than', 'Dhabi,', 'United', 'Arab', 'Emirates']
['Just', 'posted', 'photo', 'Rasim', 'Villa']
['Dhabi', 'International', 'Airport', '(AUH)', '\xd9\x85\xd8\xb7\xd8\xa7\xd8\xb1', '\xd8\xa3\xd8\xa8\xd9\x88', '\xd8\xb8\xd8\xa8\xd9\x8a', '\xd8\xa7\xd9\x84\xd8\xaf\xd9\x88\xd9\x84\xd9\x8a', 'Dhabi']
['just', 'shrug', 'off!', 'Dubai', 'Mall', 'Burj', 'Khalifa']
['out!', 'Cowboy', 'steppin', 'Notorious', 'going', 'sleep!', 'Make', 'happen']
['Buona', 'notte', '\xd1\x81\xd0\xbf\xd0\xbe\xd0\xba\xd0\xbe\xd0\xb9\xd0\xbd\xd0\xbe\xd0\xb9', '\xd0\xbd\xd0\xbe\xd1\x87\xd0\xb8', '\xd9\x84\xd9\x8a\xd9\x84\xd8\xa9', '\xd8\xb3\xd8\xb9\xd9\x8a\xd8\xaf\xd8\xa9!', '\xd8\xa3\xd8\xa8\xd9\x88', '\xd8\xb8\xd8\xa8\xd9\x8a', 'Viceroy', 'Hotel,', 'Yas\xe2\x80\xa6']

Every character which is not a word is to be removed this is only one column in the large dataset. Column name is nonhashtag

每个不是单词的字符都将被删除,这只是大型数据集中的一列。列名是nonhashtag

What is the simple way to clean the column. straight away remove them or replace with NAN

清洗色谱柱的简单方法是什么。立即删除它们或替换为NAN

Expected output

预期输出

nonhashtag
    ['want', 'better', 'than', 'Dhabi,', 'United', 'Arab', 'Emirates']
    ['Just', 'posted', 'photo', 'Rasim', 'Villa']
    ['Dhabi', 'International', 'Airport', '(AUH)', 'Dhabi']
    ['just', 'shrug', 'off!', 'Dubai', 'Mall', 'Burj', 'Khalifa']
    ['out!', 'Cowboy', 'steppin', 'Notorious', 'going', 'sleep!', 'Make', 'happen']
    ['Buona', 'notte', 'Viceroy', 'Hotel,']

Every []is one row in that particular column so removing of only the \x and remaining charactersis needed the empty []should be left in the row. Keeping the row is important as other column's that row is filled with needed information.

[]一个都是该特定列中的一行,因此只\x and remaining characters需要删除空的[]应该留在行中。保留该行很重要,因为其他列的该行填充了所需的信息。

To write a proper code I couldn't get pass through the input read as I am not able to find a pattern in the dataset to write a regex.

为了编写正确的代码,我无法通过输入读取,因为我无法在数据集中找到模式来编写正则表达式。

Thanks in advance for the help

在此先感谢您的帮助

回答by MaxU

Is that what you want?

那是你要的吗?

In [71]: df.nonhashtag.apply(' '.join).str.replace('[^A-Za-z\s]+', '') \
           .str.split(expand=False)
Out[71]:
0    [want, better, than, Dhabi, United, Arab, Emir...
1                  [Just, posted, photo, Rasim, Villa]
2          [Dhabi, International, Airport, AUH, Dhabi]
3       [just, shrug, off, Dubai, Mall, Burj, Khalifa]
4    [out, Cowboy, steppin, Notorious, going, sleep...
5                  [Buona, notte, Viceroy, Hotel, Yas]
Name: nonhashtag, dtype: object

'[^A-Za-z\s]+'- is a RegEx meaning take all characters exceptthose:

'[^A-Za-z\s]+'- 是一个正则表达式,意思是除了那些字符之外的所有字符:

  • with ASCII codes from Ato Z
  • from ato z
  • spaces and tabs
  • 使用 ASCII 码从AZ
  • az
  • 空格和制表符

So .str.replace('[^A-Za-z\s]+', '')will remove all characters except letters belonging to english alphabet, spaces and tabs

因此.str.replace('[^A-Za-z\s]+', '')将删除除属于英文字母表、空格和制表符之外的所有字符

回答by rishi jain

I import lot of files and many a times columns names are dirty, they get Unwanted special characters and I don't know which all characters might come. I only want Underscores in column names and no spaces

我导入了很多文件,很多时候列名都是脏的,它们得到了不需要的特殊字符,我不知道所有字符都可能出现。我只想要列名中的下划线,没有空格

df.columns = df.columns.str.strip()     
df.columns = df.columns.str.replace(' ', '_')         
df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")    
df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")