pandas 在数据框的整个列中应用正则表达式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/53962844/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Applying Regex across entire column of a Dataframe
提问by hello kee
I have a Dataframe with 3 columns:
我有一个包含 3 列的数据框:
id,name,team
101,kevin, marketing
102,scott,admin\n
103,peter,finance\n
I am trying to apply a regex function such that I remove the unnecessary spaces. I have got the code that removes these spaces how ever I am unable loop it through the entire Dataframe.
我正在尝试应用正则表达式函数,以便删除不必要的空格。我有删除这些空格的代码,但我无法在整个 Dataframe 中循环它。
This is what I have tried thus far:
这是我迄今为止尝试过的:
df['team'] = re.sub(r'[\n\r]*','',df['team'])
But this throws an error AttributeError: 'Series' object has no attribute 're'
但这会引发错误 AttributeError: 'Series' object has no attribute 're'
Could anyone advice how could I loop this regex through the entire Dataframe df['team']
column
谁能建议我如何在整个 Dataframedf['team']
列中循环这个正则表达式
回答by YOLO
You are almost there, there are two simple ways of doing this:
你快到了,有两种简单的方法可以做到这一点:
# option 1 - faster way
df['team'] = [re.sub(r'[\n\r]*','', str(x)) for x in df['team']]
# option 2
df['team'] = df['team'].apply(lambda x: re.sub(r'[\n\r]*','', str(x)))
回答by josem8f
As long it's a dataframe check replace https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html
只要它是一个数据框检查替换https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html
df['team'].replace( { r"[\n\r]+" : '' }, inplace= True, regex = True)
Regarding the regex, '*' means 0 or more, you should need '+' which is 1 or more
关于正则表达式,'*' 表示 0 或更多,你应该需要 '+' 是 1 或更多
回答by ShadyMBA
Here's a powerful technique to replace multiple words in a pandas column in one step without loops. In my code I wanted to eliminate things like 'CORPORATION', 'LLC' etc. (all of them is in the RemoveDB.csv file) from my column without using a loop. In this scenario I'm removing 40 words from the entire column in one step.
这是一种无需循环即可一步替换 pandas 列中的多个单词的强大技术。在我的代码中,我想在不使用循环的情况下从我的列中消除诸如“CORPORATION”、“LLC”等(所有这些都在 RemoveDB.csv 文件中)之类的内容。在这种情况下,我将一步从整个列中删除 40 个单词。
RemoveDB = pd.read_csv('RemoveDBcsv')
RemoveDB = RemoveDB['REMOVE'].tolist()
RemoveDB = '|'.join(RemoveDB)
pattern = re.compile(RemoveDB)
df['NAME']= df['NAME'].str.replace(pattern,'', regex = True)
回答by user1966723
Another example (but without regex) but maybe still usefull for someone.
另一个例子(但没有正则表达式)但可能对某人仍然有用。
id = pd.Series(['101','102','103'])
name = pd.Series(['kevin','scott','peter'])
team = pd.Series([' marketing','admin\n', 'finance\n'])
testsO = pd.DataFrame({'id': id, 'name': name, 'team': team})
print(testsO)
testsO['team'] = testsO['team'].str.strip()
print(testsO)