从 Pandas DataFrame 的一列中提取 2 个特殊字符之间的子字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44000278/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:36:34  来源:igfitidea点击:

Extract sub-string between 2 special characters from one column of Pandas DataFrame

pythonregexpandas

提问by raja

I have a Python Pandas DataFrame like this:

我有一个像这样的 Python Pandas DataFrame:

Name  
Jim, Mr. Jones
Sara, Miss. Baker
Leila, Mrs. Jacob
Ramu, Master. Kuttan 

I would like to extract only name title from Name column and copy it into a new column named Title. Output DataFrame looks like this:

我只想从 Name 列中提取 name title 并将其复制到名为 Title 的新列中。输出数据帧如下所示:

Name                    Title
Jim, Mr. Jones          Mr
Sara, Miss. Baker       Miss
Leila, Mrs. Jacob       Mrs
Ramu, Master. Kuttan    Master

I am trying to find a solution with regex but failed to find a proper result.

我正在尝试使用正则表达式找到解决方案,但未能找到正确的结果。

采纳答案by MaxU

In [157]: df['Title'] = df.Name.str.extract(r',\s*([^\.]*)\s*\.', expand=False)

In [158]: df
Out[158]:
                   Name   Title
0        Jim, Mr. Jones      Mr
1     Sara, Miss. Baker    Miss
2     Leila, Mrs. Jacob     Mrs
3  Ramu, Master. Kuttan  Master

or

或者

In [163]: df['Title'] = df.Name.str.split(r'\s*,\s*|\s*\.\s*').str[1]

In [164]: df
Out[164]:
                   Name   Title
0        Jim, Mr. Jones      Mr
1     Sara, Miss. Baker    Miss
2     Leila, Mrs. Jacob     Mrs
3  Ramu, Master. Kuttan  Master

回答by svdc

Have a look at str.extract.

看看str.extract

The regexp you are looking for is (?<=, )\w+(?=.). In words: take the substring that is preceded by ,(but do not include), consists of at least one word character, and ends with a .(but do not include). In future, use an online regexp tester such as regex101; regexps become rather trivial that way.

您正在寻找的正则表达式是(?<=, )\w+(?=.). in words:取前面有,(但不包括),至少由一个单词字符组成,并以a .(但不包括)结尾的子串。以后,请使用在线正则表达式测试器,例如regex101;正则表达式变得相当微不足道。

This is assuming each entry in the Namecolumn is formatted the same way.

这是假设Name列中的每个条目的格式都相同。