pandas 熊猫数据框列上的子字符串

Question

提问by Mike

I want to extract a substring (Titles - Mr. Mrs, Miss etc.) from a column (Name) in a pandas dataframe and then write the new column (Title) back into the dataframe.

我想从 Pandas 数据框中的列（名称）中提取子字符串（标题 - 夫人、小姐等），然后将新列（标题）写回数据框中。

In the Name column of the dataframe I have a name such as "Brand, Mr. Owen Harris" The two delimiters are the , and .

在数据框的 Name 列中，我有一个名称，例如“Brand, Mr. Owen Harris”。两个分隔符是 , 和。

I have attempted to use a split method, but this only splits the original string in two within a list. So I still send up ['Braund', ' Mr. Owen Harris'] in the list.

我曾尝试使用 split 方法，但这只会将列表中的原始字符串一分为二。所以我仍然在列表中发送 ['Braund'，'Owen Harris 先生']。

import pandas as pd
#import re
df_Train = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vTliZmavBsJCFDiEwxcSIIftu-0gR9p34n8Bq4OUNL4TxwHY-JMS6KhZEbWr1bp91UqHPkliZBBFgwh/pub?gid=1593012114&single=true&output=csv')
a= df_Train['Name'].str.split(',')
for i in a:
    print(i[1])

I am thinking this might be situation where regex comes into play. My reading suggests a Lookahead (?=,) and Lookbehind (?<='.') approach should do the trick. for example

我认为这可能是正则表达式发挥作用的情况。我的阅读建议 Lookahead (?=,) 和 Lookbehind (?<='.') 方法应该可以解决问题。例如

import re
a= df_Train['Name'].str.split(r'(?=,)*(?<='.'))
for i in a:
    print(i)
    print(i[1])`

But I am running into errors (EOL while scanning string literal) . Can someone point me in the right direction?

但是我遇到了错误（扫描字符串文字时的 EOL）。有人可以指出我正确的方向吗？

Cheers Mike

干杯迈克

Answer 1

回答by Scott Boston

You do it like this.

你这样做。

df_Train.Name.str.split(',').str[1].str.split('.').str[0].str.strip()

Output head(5):

输出头(5)：

0       Mr
1      Mrs
2     Miss
3      Mrs
4       Mr

Summation of results

结果汇总

df_Train.Name.str.split(',').str[1].str.split('.').str[0].str.strip()\
             .value_counts()

Output

输出

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Col               2
Major             2
Lady              1
Mme               1
Sir               1
Ms                1
the Countess      1
Jonkheer          1
Don               1
Capt              1
Name: Name, dtype: int64

Answer 2

回答by maxymoo

The error is coming from the fact that you have single quotes around the period inside your single-quoted regex string-literal; this actually isn't the correct syntax, I think you mean to use an escaped-period i.e. r'(?=,)*(?<=\.). However you don't need to use lookahead/lookbehind here, it's more usual and simpler to use capture-groups to describe your regex; in this case the regex would be

错误来自这样一个事实，即您在单引号正则表达式字符串文字中的句点周围有单引号；这实际上不是正确的语法，我认为您的意思是使用转义句点，即r'(?=,)*(?<=\.). 但是，您不需要在这里使用前瞻/后视，使用捕获组来描述您的正则表达式更为常见和简单；在这种情况下，正则表达式将是

df_Train['Name'].str.extract(", (\w*)\.")

pandas 熊猫数据框列上的子字符串

提问by Mike

回答by Scott Boston

回答by maxymoo

相关推荐

最近更新

标签

pandas 熊猫数据框列上的子字符串

提问by Mike

回答by Scott Boston

回答by maxymoo

相关推荐

pandas 如何在熊猫中设置特定的单元格值？

在 Pandas 中，如何根据值的类型过滤系列？

pandas 查找单个列的最大值/最小值

pandas Python 3.6 安装大熊猫错误 - 找不到大熊猫的匹配分布

相关推荐

最近更新

标签