pandas 熊猫将列表拆分为带有正则表达式的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46928636/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas split list into columns with regex
提问by sheldonzy
I have a string list:
我有一个字符串列表:
content
01/09/15, 10:07 - message1
01/09/15, 10:32 - message2
01/09/15, 10:44 - message3
I want a data frame, like:
我想要一个数据框,例如:
date message
01/09/15, 10:07 message1
01/09/15, 10:32 message2
01/09/15, 10:44 message3
Considering the fact that all my strings in the list starts in that format, I can just split by -
, but I rather look for a smarter way to do so.
考虑到我在列表中的所有字符串都以这种格式开始这一事实,我可以只用 分割-
,但我宁愿寻找一种更聪明的方法来做到这一点。
history = pd.DataFrame([line.split(" - ", 1) for line in content], columns=['date', 'message'])
(I'll convert the date to date time afterwards)
(之后我会将日期转换为日期时间)
Any help would be appreciated.
任何帮助,将不胜感激。
回答by Zero
You can use str.extract
- where named groups can become column names
您可以使用str.extract
- 其中命名组可以成为列名
In [5827]: df['content'].str.extract('(?P<date>[\s\S]+) - (?P<message>[\s\S]+)',
expand=True)
Out[5827]:
date message
0 01/09/15, 10:07 message1
1 01/09/15, 10:32 message2
2 01/09/15, 10:44 message3
Details
细节
In [5828]: df
Out[5828]:
content
0 01/09/15, 10:07 - message1
1 01/09/15, 10:32 - message2
2 01/09/15, 10:44 - message3
回答by jezrael
Use str.split
by \s+-\s+
- \s+
is one or more whitespaces:
Use str.split
by \s+-\s+
-\s+
是一个或多个空格:
df[['date','message']] = df['content'].str.split('\s+-\s+', expand=True)
print (df)
content date message
0 01/09/15, 10:07 - message1 01/09/15, 10:07 message1
1 01/09/15, 10:32 - message2 01/09/15, 10:32 message2
2 01/09/15, 10:44 - message3 01/09/15, 10:44 message3
If need remove content
column add DataFrame.pop
:
如果需要删除content
列添加DataFrame.pop
:
df[['date','message']] = df.pop('content').str.split('\s+-\s+', expand=True)
print (df)
date message
0 01/09/15, 10:07 message1
1 01/09/15, 10:32 message2
2 01/09/15, 10:44 message3