Python 如何使用正则表达式提取熊猫数据框中的特定内容?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36028932/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract specific content in a pandas dataframe with a regex?
提问by tumbleweed
Consider the following pandas dataframe:
考虑以下熊猫数据框:
In [114]:
df['movie_title'].head()
?
Out[114]:
0 Toy Story (1995)
1 GoldenEye (1995)
2 Four Rooms (1995)
3 Get Shorty (1995)
4 Copycat (1995)
...
Name: movie_title, dtype: object
Update:I would like to extract with a regular expression just the titles of the movies. So, let's use the following regex: \b([^\d\W]+)\b
. So I tried the following:
更新:我想用正则表达式提取电影的标题。所以,让我们用下面的正则表达式:\b([^\d\W]+)\b
。所以我尝试了以下方法:
df_3['movie_title'] = df_3['movie_title'].str.extract('\b([^\d\W]+)\b')
df_3['movie_title']
However, I get the following:
但是,我得到以下信息:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
Any idea of how to extract specific features from text in a pandas dataframe?. More specifically, how can I extract just the titles of the movies in a completely new dataframe?. For instance, the desired output should be:
知道如何从 Pandas 数据框中的文本中提取特定特征吗?更具体地说,如何在全新的数据框中仅提取电影的标题?例如,所需的输出应该是:
Out[114]:
0 Toy Story
1 GoldenEye
2 Four Rooms
3 Get Shorty
4 Copycat
...
Name: movie_title, dtype: object
回答by jezrael
You can try str.extract
and strip
, but better is use str.split
, because in names of movies can be numbers too. Next solution is replace
content of parentheses by regex
and strip
leading and trailing whitespaces:
您可以尝试str.extract
和strip
,但最好使用str.split
,因为电影的名称也可以是数字。接下来的解决办法是replace
用括号的内容regex
和strip
前导和尾部空格:
#convert column to string
df['movie_title'] = df['movie_title'].astype(str)
#but it remove numbers in names of movies too
df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip()
df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip()
print df
movie_title titles titles1 titles2
0 Toy Story 2 (1995) Toy Story Toy Story 2 Toy Story 2
1 GoldenEye (1995) GoldenEye GoldenEye GoldenEye
2 Four Rooms (1995) Four Rooms Four Rooms Four Rooms
3 Get Shorty (1995) Get Shorty Get Shorty Get Shorty
4 Copycat (1995) Copycat Copycat Copycat
回答by su79eu7k
You should assign text group(s) with ()
like below to capture specific part of it.
您应该使用()
如下所示分配文本组以捕获其中的特定部分。
new_df['just_movie_titles'] = df['movie_title'].str.extract('(.+?) \(')
new_df['just_movie_titles']
pandas.core.strings.StringMethods.extract
StringMethods.extract(pat, flags=0, **kwargs)
Find groups in each string using passed regular expression
pandas.core.strings.StringMethods.extract
StringMethods.extract(pat, flags=0, **kwargs)
使用传递的正则表达式查找每个字符串中的组