pandas 从数据框中删除特殊字符和字母数字的简单方法

Question

提问by Sitz Blogz

I have a large dataset with some x rows and y number of columns. one of the columns as words and some unwanted data. That unwanted data is has no specific pattern hence I am finding it difficult to remove that from the dataframe.

我有一个包含 x 行和 y 列数的大型数据集。其中一列作为单词和一些不需要的数据。不需要的数据没有特定的模式，因此我发现很难从数据框中删除它。

nonhashtag
['want', 'better', 'than', 'Dhabi,', 'United', 'Arab', 'Emirates']
['Just', 'posted', 'photo', 'Rasim', 'Villa']
['Dhabi', 'International', 'Airport', '(AUH)', '\xd9\x85\xd8\xb7\xd8\xa7\xd8\xb1', '\xd8\xa3\xd8\xa8\xd9\x88', '\xd8\xb8\xd8\xa8\xd9\x8a', '\xd8\xa7\xd9\x84\xd8\xaf\xd9\x88\xd9\x84\xd9\x8a', 'Dhabi']
['just', 'shrug', 'off!', 'Dubai', 'Mall', 'Burj', 'Khalifa']
['out!', 'Cowboy', 'steppin', 'Notorious', 'going', 'sleep!', 'Make', 'happen']
['Buona', 'notte', '\xd1\x81\xd0\xbf\xd0\xbe\xd0\xba\xd0\xbe\xd0\xb9\xd0\xbd\xd0\xbe\xd0\xb9', '\xd0\xbd\xd0\xbe\xd1\x87\xd0\xb8', '\xd9\x84\xd9\x8a\xd9\x84\xd8\xa9', '\xd8\xb3\xd8\xb9\xd9\x8a\xd8\xaf\xd8\xa9!', '\xd8\xa3\xd8\xa8\xd9\x88', '\xd8\xb8\xd8\xa8\xd9\x8a', 'Viceroy', 'Hotel,', 'Yas\xe2\x80\xa6']

Every character which is not a word is to be removed this is only one column in the large dataset. Column name is nonhashtag

每个不是单词的字符都将被删除，这只是大型数据集中的一列。列名是nonhashtag

What is the simple way to clean the column. straight away remove them or replace with NAN

清洗色谱柱的简单方法是什么。立即删除它们或替换为NAN

Expected output

预期输出

nonhashtag
    ['want', 'better', 'than', 'Dhabi,', 'United', 'Arab', 'Emirates']
    ['Just', 'posted', 'photo', 'Rasim', 'Villa']
    ['Dhabi', 'International', 'Airport', '(AUH)', 'Dhabi']
    ['just', 'shrug', 'off!', 'Dubai', 'Mall', 'Burj', 'Khalifa']
    ['out!', 'Cowboy', 'steppin', 'Notorious', 'going', 'sleep!', 'Make', 'happen']
    ['Buona', 'notte', 'Viceroy', 'Hotel,']

Every []is one row in that particular column so removing of only the \x and remaining charactersis needed the empty []should be left in the row. Keeping the row is important as other column's that row is filled with needed information.

每[]一个都是该特定列中的一行，因此只\x and remaining characters需要删除空的[]应该留在行中。保留该行很重要，因为其他列的该行填充了所需的信息。

To write a proper code I couldn't get pass through the input read as I am not able to find a pattern in the dataset to write a regex.

为了编写正确的代码，我无法通过输入读取，因为我无法在数据集中找到模式来编写正则表达式。

Thanks in advance for the help

在此先感谢您的帮助

Answer 1

回答by MaxU

Is that what you want?

那是你要的吗？

In [71]: df.nonhashtag.apply(' '.join).str.replace('[^A-Za-z\s]+', '') \
           .str.split(expand=False)
Out[71]:
0    [want, better, than, Dhabi, United, Arab, Emir...
1                  [Just, posted, photo, Rasim, Villa]
2          [Dhabi, International, Airport, AUH, Dhabi]
3       [just, shrug, off, Dubai, Mall, Burj, Khalifa]
4    [out, Cowboy, steppin, Notorious, going, sleep...
5                  [Buona, notte, Viceroy, Hotel, Yas]
Name: nonhashtag, dtype: object

'[^A-Za-z\s]+'- is a RegEx meaning take all characters exceptthose:

'[^A-Za-z\s]+'- 是一个正则表达式，意思是除了那些字符之外的所有字符：

with ASCII codes from Ato Z
from ato z
spaces and tabs

使用 ASCII 码从A到Z
从a到z
空格和制表符

So .str.replace('[^A-Za-z\s]+', '')will remove all characters except letters belonging to english alphabet, spaces and tabs

因此.str.replace('[^A-Za-z\s]+', '')将删除除属于英文字母表、空格和制表符之外的所有字符

Answer 2

回答by rishi jain

I import lot of files and many a times columns names are dirty, they get Unwanted special characters and I don't know which all characters might come. I only want Underscores in column names and no spaces

我导入了很多文件，很多时候列名都是脏的，它们得到了不需要的特殊字符，我不知道所有字符都可能出现。我只想要列名中的下划线，没有空格

df.columns = df.columns.str.strip()     
df.columns = df.columns.str.replace(' ', '_')         
df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")    
df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")

pandas 从数据框中删除特殊字符和字母数字的简单方法

提问by Sitz Blogz

回答by MaxU

回答by rishi jain

相关推荐

最近更新

标签

pandas 从数据框中删除特殊字符和字母数字的简单方法

提问by Sitz Blogz

回答by MaxU

回答by rishi jain

相关推荐

Python Pandas：检查一列中的字符串是否包含在同一行中另一列的字符串中

pandas 熊猫 drop_duplicates 方法不起作用

如何在 Pandas 中使用 base 10 错误修复 int() 的无效文字

Pandas 加载 CSV 的速度比 SQL 快

相关推荐

最近更新

标签