pandas 熊猫在正则表达式上分裂
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48919003/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas split on regex
提问by Parseltongue
I have pandas df with a column containing comma-delimited characteristics like so:
我有一个包含逗号分隔特征的列的Pandas df,如下所示:
Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect
I would like to split this column into multiple dummy-variable columns, but cannot figure out how to start this process. I am trying to split on columns like so:
我想将此列拆分为多个虚拟变量列,但无法弄清楚如何开始此过程。我试图像这样拆分列:
df['incident_characteristics'].str.split(',', expand=True)
This doesn't work, however, because there are commas in the middle of descriptions. Instead, I need to split based on a regex match of a comma followed by a space and a capital letter. Can str.split take regex? If so, how is this done?
但是,这不起作用,因为描述中间有逗号。相反,我需要根据逗号后跟空格和大写字母的正则表达式匹配进行拆分。str.split 可以使用正则表达式吗?如果是这样,这是如何完成的?
I think this Regex will do what I need:
我认为这个正则表达式会做我需要的:
,\s[A-Z]
回答by Wiktor Stribi?ew
Yes, split
ssupports regex. According to your requirements
是的,split
支持正则表达式。根据您的要求
split based on a regex match of a comma followed by a space and a capital letter
基于逗号后跟空格和大写字母的正则表达式匹配进行拆分
you may use
你可以使用
df['incident_characteristics'].str.split(r'\s*,\s*(?=[A-Z])', expand=True)
See the regex demo.
请参阅正则表达式演示。
Details
细节
\s*,\s*
- a comma enclosed with 0+ whitespaces(?=[A-Z])
- only if followed with an uppercase ASCII letter
\s*,\s*
- 用 0+ 个空格括起来的逗号(?=[A-Z])
- 仅当后跟大写 ASCII 字母时
However, it seems you also don't want to match the comma inside parentheses, add (?![^()]*\))
lookahead that fails the match if, immediately to the right of the current location, there are 0+ chars other than (
and )
and then a )
:
然而,似乎还不想匹配括号内的逗号,加上(?![^()]*\))
先行如果失败了比赛,马上到当前位置的右边,还有比其他0+字符(
和)
再)
:
r'\s*,\s*(?=[A-Z])(?![^()]*\))'
and it will prevent matching commas before capitalized words inside parentheses (that has no parentheses inside).
它会阻止在括号内的大写单词之前匹配逗号(里面没有括号)。
See another regex demo.
参见另一个正则表达式演示。
回答by pe-perry
You can try .str.extractall
(but I think there are better patterns than mine).
你可以试试.str.extractall
(但我认为有比我更好的模式)。
import pandas as pd
txt = 'Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect)'
df = pd.DataFrame({'incident_characteristics': [txt]})
df['incident_characteristics'].str.extractall(r'([\w\+\-\/ ]+(\([\w\+\-\/\, ]+\))?)')[0]
Output:
输出:
# match
# 0 0 Shot - Wounded/Injured
# 1 Shot - Dead (murder, accidental, suicide)
# 2 Suicide - Attempt
# 3 Murder/Suicide
# 4 Attempted Murder/Suicide (one variable unsucc...
# 5 Institution/Group/Business
# 6 Mass Murder (4+ deceased victims excluding th...
# 7 Mass Shooting (4+ victims injured or killed e...
# Name: 0, dtype: object
If you use .str.split
, the first letter will be removed as it is used as a part of delimiter.
如果使用.str.split
,第一个字母将被删除,因为它被用作分隔符的一部分。
df['incident_characteristics'].str.split(r',\s[A-Z]', expand=True)
Output:
输出:
# 0 1 \
# 0 Shot - Wounded/Injured hot - Dead (murder, accidental, suicide)
# 2 3 \
# 0 uicide - Attempt urder/Suicide
# 4 \
# 0 ttempted Murder/Suicide (one variable unsucces...
# 5 \
# 0 nstitution/Group/Business
# 6 \
# 0 ass Murder (4+ deceased victims excluding the ...
# 7
# 0 ass Shooting (4+ victims injured or killed exc...
回答by Jan
I would first create the data and then feed it into a dataframe, like so
我会首先创建数据,然后将其输入到数据帧中,就像这样
import pandas as pd, re
junk = """Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect"""
rx = re.compile(r'\([^()]+\)|,(\s+)')
data = [x
for nugget in rx.split(junk) if nugget
for x in [nugget.strip()] if x]
df = pd.DataFrame({'incident_characteristics': data})
print(df)
This yields
这产生
incident_characteristics
0 Shot - Wounded/Injured
1 Shot - Dead
2 Suicide - Attempt
3 Murder/Suicide
4 Attempted Murder/Suicide
5 Institution/Group/Business
6 Mass Murder
7 Mass Shooting (4+ victims injured or killed ex...
Additionally, this assumes that commas in parentheses should be ignored when splitting.
此外,这假定在拆分时应忽略括号中的逗号。