pandas 熊猫在正则表达式上分裂

Question

提问by Parseltongue

I have pandas df with a column containing comma-delimited characteristics like so:

我有一个包含逗号分隔特征的列的Pandas df，如下所示：

Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect

I would like to split this column into multiple dummy-variable columns, but cannot figure out how to start this process. I am trying to split on columns like so:

我想将此列拆分为多个虚拟变量列，但无法弄清楚如何开始此过程。我试图像这样拆分列：

df['incident_characteristics'].str.split(',', expand=True)

This doesn't work, however, because there are commas in the middle of descriptions. Instead, I need to split based on a regex match of a comma followed by a space and a capital letter. Can str.split take regex? If so, how is this done?

但是，这不起作用，因为描述中间有逗号。相反，我需要根据逗号后跟空格和大写字母的正则表达式匹配进行拆分。str.split 可以使用正则表达式吗？如果是这样，这是如何完成的？

I think this Regex will do what I need:

我认为这个正则表达式会做我需要的：

,\s[A-Z]

Answer 1

回答by Wiktor Stribi?ew

Yes, splitssupports regex. According to your requirements

是的，split支持正则表达式。根据您的要求

split based on a regex match of a comma followed by a space and a capital letter

基于逗号后跟空格和大写字母的正则表达式匹配进行拆分

you may use

你可以使用

df['incident_characteristics'].str.split(r'\s*,\s*(?=[A-Z])', expand=True)

See the regex demo.

请参阅正则表达式演示。

Details

细节

\s*,\s*- a comma enclosed with 0+ whitespaces
(?=[A-Z])- only if followed with an uppercase ASCII letter

\s*,\s*- 用 0+ 个空格括起来的逗号
(?=[A-Z])- 仅当后跟大写 ASCII 字母时

However, it seems you also don't want to match the comma inside parentheses, add (?![^()]*\))lookahead that fails the match if, immediately to the right of the current location, there are 0+ chars other than (and )and then a ):

然而，似乎还不想匹配括号内的逗号，加上(?![^()]*\))先行如果失败了比赛，马上到当前位置的右边，还有比其他0+字符(和)再)：

r'\s*,\s*(?=[A-Z])(?![^()]*\))'

and it will prevent matching commas before capitalized words inside parentheses (that has no parentheses inside).

它会阻止在括号内的大写单词之前匹配逗号（里面没有括号）。

See another regex demo.

参见另一个正则表达式演示。

Answer 2

回答by pe-perry

You can try .str.extractall(but I think there are better patterns than mine).

你可以试试.str.extractall（但我认为有比我更好的模式）。

import pandas as pd

txt = 'Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect)'
df = pd.DataFrame({'incident_characteristics': [txt]})
df['incident_characteristics'].str.extractall(r'([\w\+\-\/ ]+(\([\w\+\-\/\, ]+\))?)')[0]

Output:

输出：

#    match
# 0  0                                   Shot - Wounded/Injured
#    1                Shot - Dead (murder, accidental, suicide)
#    2                                        Suicide - Attempt
#    3                                           Murder/Suicide
#    4         Attempted Murder/Suicide (one variable unsucc...
#    5                               Institution/Group/Business
#    6         Mass Murder (4+ deceased victims excluding th...
#    7         Mass Shooting (4+ victims injured or killed e...
# Name: 0, dtype: object

If you use .str.split, the first letter will be removed as it is used as a part of delimiter.

如果使用.str.split，第一个字母将被删除，因为它被用作分隔符的一部分。

df['incident_characteristics'].str.split(r',\s[A-Z]', expand=True)

Output:

输出：

#                         0                                         1  \
# 0  Shot - Wounded/Injured  hot - Dead (murder, accidental, suicide)
#                   2              3  \
# 0  uicide - Attempt  urder/Suicide
#                                                    4  \
# 0  ttempted Murder/Suicide (one variable unsucces...
#                            5  \
# 0  nstitution/Group/Business
#                                                    6  \
# 0  ass Murder (4+ deceased victims excluding the ...
#                                                    7
# 0  ass Shooting (4+ victims injured or killed exc...

Answer 3

回答by Jan

I would first create the data and then feed it into a dataframe, like so

我会首先创建数据，然后将其输入到数据帧中，就像这样

import pandas as pd, re

junk = """Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect"""

rx = re.compile(r'\([^()]+\)|,(\s+)')

data = [x 
        for nugget in rx.split(junk) if nugget
        for x in [nugget.strip()] if x]

df = pd.DataFrame({'incident_characteristics': data})
print(df)

This yields

这产生

                            incident_characteristics
0                             Shot - Wounded/Injured
1                                        Shot - Dead
2                                  Suicide - Attempt
3                                     Murder/Suicide
4                           Attempted Murder/Suicide
5                         Institution/Group/Business
6                                        Mass Murder
7  Mass Shooting (4+ victims injured or killed ex...

Additionally, this assumes that commas in parentheses should be ignored when splitting.

此外，这假定在拆分时应忽略括号中的逗号。

pandas 熊猫在正则表达式上分裂

提问by Parseltongue

回答by Wiktor Stribi?ew

回答by pe-perry

回答by Jan

相关推荐

最近更新

标签

pandas 熊猫在正则表达式上分裂

提问by Parseltongue

回答by Wiktor Stribi?ew

回答by pe-perry

回答by Jan

相关推荐

将 Pandas 数据帧保存到 pickle 和 csv 之间有什么区别？

如何在 Python pandas DataFrame 中对列值进行切片

Pandas 的 EMA 与股票的 EMA 不匹配？

pandas 如何将python列表转换为Pandas系列

相关推荐

最近更新

标签