pandas 在熊猫数据框替换功能中使用正则表达式匹配组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41472951/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using regex matched groups in pandas dataframe replace function
提问by Peter D
I'm just learning python/pandas and like how powerful and concise it is.
我只是在学习 python/pandas,喜欢它的强大和简洁。
During data cleaning I want to use replace on a column in a dataframe with regex but I want to reinsert parts of the match (groups).
在数据清理期间,我想使用正则表达式在数据框中的列上使用替换,但我想重新插入匹配的部分(组)。
Simple Example: lastname, firstname -> firstname lastname
简单示例:姓氏,名字 -> 名字姓氏
I tried something like the following (actual case is more complex so excuse the simple regex):
我尝试了以下内容(实际情况更复杂,所以请原谅简单的正则表达式):
df['Col1'].replace({'([A-Za-z])+, ([A-Za-z]+)' : ' '}, inplace=True, regex=True)
However, this results in empty values. The match part works as expected, but the value part doesn't. I guess this could be achieved by some splitting and merging, but I am looking for a general answer as to whether the regex group can be used in replace.
但是,这会导致空值。匹配部分按预期工作,但值部分没有。我想这可以通过一些拆分和合并来实现,但我正在寻找关于是否可以使用正则表达式组替换的一般答案。
回答by MaxU
I think you have a few issues with the RegEx's.
我认为您对 RegEx 有一些问题。
As @Abdou just saiduse either '\\2 \\1'
or better r'\2 \1'
, as '\1'
is a symbol with ASCII code 1
正如@Abdou 刚才所说的那样使用'\\2 \\1'
或更好r'\2 \1'
,因为'\1'
是带有 ASCII 代码的符号1
Your solution should work if you will use correct RegEx's:
如果您将使用正确的 RegEx,您的解决方案应该有效:
In [193]: df
Out[193]:
name
0 John, Doe
1 Max, Mustermann
In [194]: df.name.replace({r'(\w+),\s+(\w+)' : r' '}, regex=True)
Out[194]:
0 Doe John
1 Mustermann Max
Name: name, dtype: object
In [195]: df.name.replace({r'(\w+),\s+(\w+)' : r' ', 'Max':'Fritz'}, regex=True)
Out[195]:
0 Doe John
1 Mustermann Fritz
Name: name, dtype: object
回答by piRSquared
setup
设置
df = pd.DataFrame(dict(name=['Smith, Sean']))
print(df)
name
0 Smith, Sean
using replace
使用 replace
df.name.str.replace(r'(\w+),\s*(\w+)', r' ')
0 Sean Smith
Name: name, dtype: object
using extract
split to two columns
使用extract
拆分为两列
df.name.str.extract('(?P<Last>\w+),\s*(?P<First>\w+)', expand=True)
Last First
0 Smith Sean