Python 如何使用正则表达式提取熊猫数据框中的特定内容?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36028932/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:19:08  来源:igfitidea点击:

How to extract specific content in a pandas dataframe with a regex?

pythonregexstringpython-2.7pandas

提问by tumbleweed

Consider the following pandas dataframe:

考虑以下熊猫数据框:

In [114]:

df['movie_title'].head()

?
Out[114]:

0     Toy Story (1995)
1     GoldenEye (1995)
2    Four Rooms (1995)
3    Get Shorty (1995)
4       Copycat (1995)
...
Name: movie_title, dtype: object

Update:I would like to extract with a regular expression just the titles of the movies. So, let's use the following regex: \b([^\d\W]+)\b. So I tried the following:

更新:我想用正则表达式提取电影的标题。所以,让我们用下面的正则表达式:\b([^\d\W]+)\b。所以我尝试了以下方法:

df_3['movie_title'] = df_3['movie_title'].str.extract('\b([^\d\W]+)\b')
df_3['movie_title']

However, I get the following:

但是,我得到以下信息:

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN

Any idea of how to extract specific features from text in a pandas dataframe?. More specifically, how can I extract just the titles of the movies in a completely new dataframe?. For instance, the desired output should be:

知道如何从 Pandas 数据框中的文本中提取特定特征吗?更具体地说,如何在全新的数据框中仅提取电影的标题?例如,所需的输出应该是:

Out[114]:

0     Toy Story
1     GoldenEye
2    Four Rooms
3    Get Shorty
4       Copycat
...
Name: movie_title, dtype: object

回答by jezrael

You can try str.extractand strip, but better is use str.split, because in names of movies can be numbers too. Next solution is replacecontent of parentheses by regexand stripleading and trailing whitespaces:

您可以尝试str.extractstrip,但最好使用str.split,因为电影的名称也可以是数字。接下来的解决办法是replace用括号的内容regexstrip前导和尾部空格:

#convert column to string
df['movie_title'] = df['movie_title'].astype(str)

#but it remove numbers in names of movies too
df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip()
df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip()
print df
          movie_title      titles      titles1      titles2
0  Toy Story 2 (1995)   Toy Story  Toy Story 2  Toy Story 2
1    GoldenEye (1995)   GoldenEye    GoldenEye    GoldenEye
2   Four Rooms (1995)  Four Rooms   Four Rooms   Four Rooms
3   Get Shorty (1995)  Get Shorty   Get Shorty   Get Shorty
4      Copycat (1995)     Copycat      Copycat      Copycat

回答by su79eu7k

You should assign text group(s) with ()like below to capture specific part of it.

您应该使用()如下所示分配文本组以捕获其中的特定部分。

new_df['just_movie_titles'] = df['movie_title'].str.extract('(.+?) \(')
new_df['just_movie_titles']

pandas.core.strings.StringMethods.extract

StringMethods.extract(pat, flags=0, **kwargs)

Find groups in each string using passed regular expression

pandas.core.strings.StringMethods.extract

StringMethods.extract(pat, flags=0, **kwargs)

使用传递的正则表达式查找每个字符串中的组