pandas 熊猫数据框列上的子字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47274888/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Substring on pandas dataframe column
提问by Mike
I want to extract a substring (Titles - Mr. Mrs, Miss etc.) from a column (Name) in a pandas dataframe and then write the new column (Title) back into the dataframe.
我想从 Pandas 数据框中的列(名称)中提取子字符串(标题 - 夫人、小姐等),然后将新列(标题)写回数据框中。
In the Name column of the dataframe I have a name such as "Brand, Mr. Owen Harris" The two delimiters are the , and .
在数据框的 Name 列中,我有一个名称,例如“Brand, Mr. Owen Harris”。两个分隔符是 , 和 。
I have attempted to use a split method, but this only splits the original string in two within a list. So I still send up ['Braund', ' Mr. Owen Harris'] in the list.
我曾尝试使用 split 方法,但这只会将列表中的原始字符串一分为二。所以我仍然在列表中发送 ['Braund','Owen Harris 先生']。
import pandas as pd
#import re
df_Train = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vTliZmavBsJCFDiEwxcSIIftu-0gR9p34n8Bq4OUNL4TxwHY-JMS6KhZEbWr1bp91UqHPkliZBBFgwh/pub?gid=1593012114&single=true&output=csv')
a= df_Train['Name'].str.split(',')
for i in a:
print(i[1])
I am thinking this might be situation where regex comes into play. My reading suggests a Lookahead (?=,) and Lookbehind (?<='.') approach should do the trick. for example
我认为这可能是正则表达式发挥作用的情况。我的阅读建议 Lookahead (?=,) 和 Lookbehind (?<='.') 方法应该可以解决问题。例如
import re
a= df_Train['Name'].str.split(r'(?=,)*(?<='.'))
for i in a:
print(i)
print(i[1])`
But I am running into errors (EOL while scanning string literal) . Can someone point me in the right direction?
但是我遇到了错误(扫描字符串文字时的 EOL)。有人可以指出我正确的方向吗?
Cheers Mike
干杯迈克
回答by Scott Boston
You do it like this.
你这样做。
df_Train.Name.str.split(',').str[1].str.split('.').str[0].str.strip()
Output head(5):
输出头(5):
0 Mr
1 Mrs
2 Miss
3 Mrs
4 Mr
Summation of results
结果汇总
df_Train.Name.str.split(',').str[1].str.split('.').str[0].str.strip()\
.value_counts()
Output
输出
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Mlle 2
Col 2
Major 2
Lady 1
Mme 1
Sir 1
Ms 1
the Countess 1
Jonkheer 1
Don 1
Capt 1
Name: Name, dtype: int64
回答by maxymoo
The error is coming from the fact that you have single quotes around the period inside your single-quoted regex string-literal; this actually isn't the correct syntax, I think you mean to use an escaped-period i.e. r'(?=,)*(?<=\.)
. However you don't need to use lookahead/lookbehind here, it's more usual and simpler to use capture-groups to describe your regex; in this case the regex would be
错误来自这样一个事实,即您在单引号正则表达式字符串文字中的句点周围有单引号;这实际上不是正确的语法,我认为您的意思是使用转义句点,即r'(?=,)*(?<=\.)
. 但是,您不需要在这里使用前瞻/后视,使用捕获组来描述您的正则表达式更为常见和简单;在这种情况下,正则表达式将是
df_Train['Name'].str.extract(", (\w*)\.")