pandas 在数据框的整个列中应用正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53962844/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:14:23  来源:igfitidea点击:

Applying Regex across entire column of a Dataframe

pythonpython-3.xpandas

提问by hello kee

I have a Dataframe with 3 columns:

我有一个包含 3 列的数据框:

id,name,team 
101,kevin, marketing
102,scott,admin\n
103,peter,finance\n

I am trying to apply a regex function such that I remove the unnecessary spaces. I have got the code that removes these spaces how ever I am unable loop it through the entire Dataframe.

我正在尝试应用正则表达式函数,以便删除不必要的空格。我有删除这些空格的代码,但我无法在整个 Dataframe 中循环它。

This is what I have tried thus far:

这是我迄今为止尝试过的:

df['team'] = re.sub(r'[\n\r]*','',df['team'])

But this throws an error AttributeError: 'Series' object has no attribute 're'

但这会引发错误 AttributeError: 'Series' object has no attribute 're'

Could anyone advice how could I loop this regex through the entire Dataframe df['team']column

谁能建议我如何在整个 Dataframedf['team']列中循环这个正则表达式

回答by YOLO

You are almost there, there are two simple ways of doing this:

你快到了,有两种简单的方法可以做到这一点:

# option 1 - faster way
df['team'] =  [re.sub(r'[\n\r]*','', str(x)) for x in df['team']]

# option 2
df['team'] =  df['team'].apply(lambda x: re.sub(r'[\n\r]*','', str(x)))

回答by josem8f

As long it's a dataframe check replace https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

只要它是一个数据框检查替换https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

df['team'].replace( { r"[\n\r]+" : '' }, inplace= True, regex = True)

Regarding the regex, '*' means 0 or more, you should need '+' which is 1 or more

关于正则表达式,'*' 表示 0 或更多,你应该需要 '+' 是 1 或更多

回答by ShadyMBA

Here's a powerful technique to replace multiple words in a pandas column in one step without loops. In my code I wanted to eliminate things like 'CORPORATION', 'LLC' etc. (all of them is in the RemoveDB.csv file) from my column without using a loop. In this scenario I'm removing 40 words from the entire column in one step.

这是一种无需循环即可一步替换 pandas 列中的多个单词的强大技术。在我的代码中,我想在不使用循环的情况下从我的列中消除诸如“CORPORATION”、“LLC”等(所有这些都在 RemoveDB.csv 文件中)之类的内容。在这种情况下,我将一步从整个列中删除 40 个单词。

RemoveDB = pd.read_csv('RemoveDBcsv')
RemoveDB = RemoveDB['REMOVE'].tolist()
RemoveDB = '|'.join(RemoveDB)
pattern = re.compile(RemoveDB)    
df['NAME']= df['NAME'].str.replace(pattern,'', regex = True)

回答by user1966723

Another example (but without regex) but maybe still usefull for someone.

另一个例子(但没有正则表达式)但可能对某人仍然有用。

id = pd.Series(['101','102','103'])
name = pd.Series(['kevin','scott','peter'])
team = pd.Series(['     marketing','admin\n', 'finance\n'])

testsO = pd.DataFrame({'id': id, 'name': name, 'team': team})
print(testsO)
testsO['team'] = testsO['team'].str.strip()
print(testsO)