Python 如何使用正则表达式提取熊猫数据框中的特定内容？

Question

提问by tumbleweed

Consider the following pandas dataframe:

考虑以下熊猫数据框：

In [114]:

df['movie_title'].head()

?
Out[114]:

0     Toy Story (1995)
1     GoldenEye (1995)
2    Four Rooms (1995)
3    Get Shorty (1995)
4       Copycat (1995)
...
Name: movie_title, dtype: object

Update:I would like to extract with a regular expression just the titles of the movies. So, let's use the following regex: \b([^\d\W]+)\b. So I tried the following:

更新：我想用正则表达式提取电影的标题。所以，让我们用下面的正则表达式：\b([^\d\W]+)\b。所以我尝试了以下方法：

df_3['movie_title'] = df_3['movie_title'].str.extract('\b([^\d\W]+)\b')
df_3['movie_title']

However, I get the following:

但是，我得到以下信息：

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN

Any idea of how to extract specific features from text in a pandas dataframe?. More specifically, how can I extract just the titles of the movies in a completely new dataframe?. For instance, the desired output should be:

知道如何从 Pandas 数据框中的文本中提取特定特征吗？更具体地说，如何在全新的数据框中仅提取电影的标题？例如，所需的输出应该是：

Out[114]:

0     Toy Story
1     GoldenEye
2    Four Rooms
3    Get Shorty
4       Copycat
...
Name: movie_title, dtype: object

Answer 1

回答by jezrael

You can try str.extractand strip, but better is use str.split, because in names of movies can be numbers too. Next solution is replacecontent of parentheses by regexand stripleading and trailing whitespaces:

您可以尝试str.extract和strip，但最好使用str.split，因为电影的名称也可以是数字。接下来的解决办法是replace用括号的内容regex和strip前导和尾部空格：

#convert column to string
df['movie_title'] = df['movie_title'].astype(str)

#but it remove numbers in names of movies too
df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip()
df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip()
print df
          movie_title      titles      titles1      titles2
0  Toy Story 2 (1995)   Toy Story  Toy Story 2  Toy Story 2
1    GoldenEye (1995)   GoldenEye    GoldenEye    GoldenEye
2   Four Rooms (1995)  Four Rooms   Four Rooms   Four Rooms
3   Get Shorty (1995)  Get Shorty   Get Shorty   Get Shorty
4      Copycat (1995)     Copycat      Copycat      Copycat

Answer 2

回答by su79eu7k

You should assign text group(s) with ()like below to capture specific part of it.

您应该使用()如下所示分配文本组以捕获其中的特定部分。

new_df['just_movie_titles'] = df['movie_title'].str.extract('(.+?) \(')
new_df['just_movie_titles']

pandas.core.strings.StringMethods.extract
StringMethods.extract(pat, flags=0, **kwargs)
Find groups in each string using passed regular expression

pandas.core.strings.StringMethods.extract
StringMethods.extract(pat, flags=0, **kwargs)
使用传递的正则表达式查找每个字符串中的组

Python 如何使用正则表达式提取熊猫数据框中的特定内容？

提问by tumbleweed

回答by jezrael

回答by su79eu7k

相关推荐

最近更新

标签

Python 如何使用正则表达式提取熊猫数据框中的特定内容？

提问by tumbleweed

回答by jezrael

回答by su79eu7k

相关推荐

Python：如何将带有整数值的变量相加？

Python Json.dump 失败，“必须是 unicode，而不是 str”类型错误

Python 熊猫：如何删除 nan 和 -inf 值

Python 计算pandas DataFrame列中值的频率

相关推荐

最近更新

标签