pandas 在数据框的整个列中应用正则表达式

Question

提问by hello kee

I have a Dataframe with 3 columns:

我有一个包含 3 列的数据框：

id,name,team 
101,kevin, marketing
102,scott,admin\n
103,peter,finance\n

I am trying to apply a regex function such that I remove the unnecessary spaces. I have got the code that removes these spaces how ever I am unable loop it through the entire Dataframe.

我正在尝试应用正则表达式函数，以便删除不必要的空格。我有删除这些空格的代码，但我无法在整个 Dataframe 中循环它。

This is what I have tried thus far:

这是我迄今为止尝试过的：

df['team'] = re.sub(r'[\n\r]*','',df['team'])

But this throws an error AttributeError: 'Series' object has no attribute 're'

但这会引发错误 AttributeError: 'Series' object has no attribute 're'

Could anyone advice how could I loop this regex through the entire Dataframe df['team']column

谁能建议我如何在整个 Dataframedf['team']列中循环这个正则表达式

Answer 1

回答by YOLO

You are almost there, there are two simple ways of doing this:

你快到了，有两种简单的方法可以做到这一点：

# option 1 - faster way
df['team'] =  [re.sub(r'[\n\r]*','', str(x)) for x in df['team']]

# option 2
df['team'] =  df['team'].apply(lambda x: re.sub(r'[\n\r]*','', str(x)))

Answer 2

回答by josem8f

As long it's a dataframe check replace https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

只要它是一个数据框检查替换https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

df['team'].replace( { r"[\n\r]+" : '' }, inplace= True, regex = True)

Regarding the regex, '*' means 0 or more, you should need '+' which is 1 or more

关于正则表达式，'*' 表示 0 或更多，你应该需要 '+' 是 1 或更多

Answer 3

回答by ShadyMBA

Here's a powerful technique to replace multiple words in a pandas column in one step without loops. In my code I wanted to eliminate things like 'CORPORATION', 'LLC' etc. (all of them is in the RemoveDB.csv file) from my column without using a loop. In this scenario I'm removing 40 words from the entire column in one step.

这是一种无需循环即可一步替换 pandas 列中的多个单词的强大技术。在我的代码中，我想在不使用循环的情况下从我的列中消除诸如“CORPORATION”、“LLC”等（所有这些都在 RemoveDB.csv 文件中）之类的内容。在这种情况下，我将一步从整个列中删除 40 个单词。

RemoveDB = pd.read_csv('RemoveDBcsv')
RemoveDB = RemoveDB['REMOVE'].tolist()
RemoveDB = '|'.join(RemoveDB)
pattern = re.compile(RemoveDB)    
df['NAME']= df['NAME'].str.replace(pattern,'', regex = True)

Answer 4

回答by user1966723

Another example (but without regex) but maybe still usefull for someone.

另一个例子（但没有正则表达式）但可能对某人仍然有用。

id = pd.Series(['101','102','103'])
name = pd.Series(['kevin','scott','peter'])
team = pd.Series(['     marketing','admin\n', 'finance\n'])

testsO = pd.DataFrame({'id': id, 'name': name, 'team': team})
print(testsO)
testsO['team'] = testsO['team'].str.strip()
print(testsO)

pandas 在数据框的整个列中应用正则表达式

提问by hello kee

回答by YOLO

回答by josem8f

回答by ShadyMBA

回答by user1966723

相关推荐

最近更新

标签

pandas 在数据框的整个列中应用正则表达式

提问by hello kee

回答by YOLO

回答by josem8f

回答by ShadyMBA

回答by user1966723

相关推荐

pandas Python数据帧中的置信区间

pandas GeoPandas 的过度功能不起作用

pandas 基于多个条件加入两个熊猫数据框

pandas 如何在python中制作帕累托图？

相关推荐

最近更新

标签