pandas 如何在pandas中使用groupby和两个条件的.loc
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49164935/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use .loc with groupby and two conditions in pandas
提问by Martin Bobak
I asked a similar question here, but I want to expand on this question because I'm asked to do something a little different where I cannot use .duplicates()
我在这里问了一个类似的问题,但我想扩展这个问题,因为我被要求做一些不同的事情,我不能使用 .duplicates()
I have a df that's grouped by 'Key'. I want to flag any row within a group where the discharge date matches the admit date AND between those rows, the row with the discharge date has a num1 value in the range of 5-12.
我有一个按“键”分组的 df。我想标记出院日期与录取日期匹配的组中的任何行,并且在这些行之间,出院日期的行的 num1 值在 5-12 的范围内。
df = pd.DataFrame({'Key': ['10003', '10003', '10003', '10003', '10003','10003','10034', '10034'],
'Num1': [12,13,13,13,12,13,15,12],
'Num2': [121,122,122,124,125,126,127,128],
'admit': [20120506, 20120508, 20121010,20121010,20121010,20121110,20120520,20120520], 'discharge': [20120508, 20120510, 20121012,20121016,20121023,20121111,20120520,20120520]})
df['admit'] = pd.to_datetime(df['admit'], format='%Y%m%d')
df['discharge'] = pd.to_datetime(df['discharge'], format='%Y%m%d')
initial df
初始 df
Key Num1 Num2 admit discharge
0 10003 12 121 2012-05-06 2012-05-08
1 10003 13 122 2012-05-08 2012-05-10
2 10003 13 122 2012-10-10 2012-10-12
3 10003 13 124 2012-10-10 2012-10-16
4 10003 12 125 2012-10-10 2012-10-23
5 10003 13 126 2012-11-10 2012-11-11
6 10034 15 127 2012-05-20 2012-05-20
7 10034 12 128 2012-05-20 2012-05-20
final df
最终 df
Key Num1 Num2 admit discharge flag
0 10003 12 121 2012-05-06 2012-05-08 1
1 10003 13 122 2012-05-08 2012-05-10 1
2 10003 13 122 2012-10-10 2012-10-12 0
3 10003 13 124 2012-10-10 2012-10-16 0
4 10003 12 125 2012-10-10 2012-10-23 0
5 10003 13 126 2012-11-10 2012-11-11 0
6 10034 15 127 2012-05-20 2012-05-20 1
7 10034 12 128 2012-05-20 2012-05-20 1
I was trying to use filter() but I can't quite figure out how to apply any() to the discharge date. My logic was to pick the first admit date in a group and then check that date among each discharge date and once there is a match then check if the row that has the same discharge date has a value in Num1 in the range of 5-12.
我试图使用 filter() 但我不太清楚如何将 any() 应用于出院日期。我的逻辑是选择一组中的第一个录取日期,然后在每个出院日期中检查该日期,一旦匹配,则检查具有相同出院日期的行在 Num1 中的值是否在 5-12 范围内.
num1_range = [5,6,7,8,9,10,11,12]
df.loc[df.groupby(['Key']).filter(lambda x : (x['admit'] == x['discharge'].any())&(x['Num1'].isin(num1_range).any())),'flag']=1
I'm getting an error
我收到一个错误
ValueError: cannot set a Timestamp with a non-timestamp
采纳答案by jpp
I believe you are looking for either of 2 conditions to be satisfied for flag = True
:
我相信您正在寻找满足以下两个条件之一flag = True
:
- Admit date is equal to anydischarge date within the group (
Key
). - Discharge date is equal to anyadmit date within the group, provided
Num1
is in the range 5 to 12 inclusive.
- 入院日期等于组内的任何出院日期 (
Key
)。 - 出院日期等于组内的任何录取日期,前提
Num1
是在 5 到 12 之间(含)。
The below logic produces the result in line with your desired output.
以下逻辑产生符合您所需输出的结果。
Solution
解决方案
d1 = df.groupby('Key')['admit'].apply(set).to_dict()
d2 = df.groupby('Key')['discharge'].apply(set).to_dict()
def flagger(row):
match1, match2 = row['discharge'] in d1[row['Key']], row['admit'] in d2[row['Key']]
return match2 or (match1 and (row['Num1'] in range(5, 13)))
df['flag'] = df.apply(flagger, axis=1).astype(int)
Result
结果
Key Num1 Num2 admit discharge flag
0 10003 12 121 2012-05-06 2012-05-08 1
1 10003 13 122 2012-05-08 2012-05-10 1
2 10003 13 122 2012-10-10 2012-10-12 0
3 10003 13 124 2012-10-10 2012-10-16 0
4 10003 12 125 2012-10-10 2012-10-23 0
5 10003 13 126 2012-11-10 2012-11-11 0
6 10034 15 127 2012-05-20 2012-05-20 1
7 10034 12 128 2012-05-20 2012-05-20 1
Explanation
解释
- Create 2 dictionary mapping Key -> Admit dates and Key -> Discharge dates respectively.
- Use these 2 dictionaries to apply the criteria specified by row using
pd.DataFrame.apply
.
- 分别创建 2 个字典映射 Key -> Admit 日期和 Key -> Discharge 日期。
- 使用这两个字典应用 row using 指定的条件
pd.DataFrame.apply
。
回答by DJK
Lets break down the filtration into a few steps. First, create the conditions to be filtered
让我们将过滤分解为几个步骤。一、创建要过滤的条件
conditions = "(x['discharge'].isin(x['admit'])) & (x['Num1'] >= 5) & (x['Num1'] <= 12)"
I Chose to store the conditions
as string because it looks cleaner for formatting the next step. However, the filter command will be used to check, by key
in the dataframe, if there are any discharge times equal to admit times. And will also check if Num1
at the discharge
time is between 5 and 12. Now we run the groupby
operation and evaluate conditions
我选择将conditions
as 字符串存储,因为它在下一步格式化时看起来更清晰。但是,过滤命令将用于检查key
数据帧中是否有任何放电时间等于准入时间。同时还将检查是否Num1
在discharge
时间为5和12之间。现在我们运行groupby
操作和评估conditions
filter = df.groupby('Key').apply(lambda x: pd.eval(conditions))
filter.index = filter.index.droplevel(0)
filter
will output this
filter
将输出这个
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
dtype: bool
filter
provides some Boolean flags were the conditions
hold true. The last step is to add the flags for admit
times equlivant to dischagre
times which can be done by merging the initial dataframe with the filtered locations which is used to extract the indices of admit
flags.
filter
提供一些布尔标志是conditions
成立的。最后一步是将admit
时间相等的标志添加到dischagre
时间,这可以通过将初始数据帧与用于提取admit
标志索引的过滤位置合并来完成。
dex = df.merge(df[filter.values],left_on=['Key','admit'],right_on=['Key','discharge'],how='left').dropna().index
Finally set the flags
where either condition is True
最后设置flags
其中任一条件是True
df['flag'] = (filter | df.index.isin(dex)).astype(int)
Full code:
完整代码:
conditions = "(x['discharge'].isin(x['admit'])) & (x['Num1'] >= 5) & (x['Num1'] <= 12)"
filter = df.groupby('Key').apply(lambda x: pd.eval(conditions))
filter.index = filter.index.droplevel(0)
dex = df.merge(df[filter.values],left_on=['Key','admit'],right_on=['Key','discharge'],how='left').dropna().index
df['flag'] = (filter | df.index.isin(dex)).astype(int)
Output:
输出:
Key Num1 Num2 admit discharge flag
0 10003 12 121 2012-05-06 2012-05-08 1
1 10003 13 122 2012-05-08 2012-05-10 1
2 10003 13 122 2012-10-10 2012-10-12 0
3 10003 13 124 2012-10-10 2012-10-16 0
4 10003 12 125 2012-10-10 2012-10-23 0
5 10003 13 126 2012-11-10 2012-11-11 0
6 10034 15 127 2012-05-20 2012-05-20 1
7 10034 12 128 2012-05-20 2012-05-20 1
回答by Victor Chubukov
If I understand the logic correctly, I would write it this way:
如果我正确理解逻辑,我会这样写:
num1_range = [5,6,7,8,9,10,11,12]
def get_flags(group):
flagged_discharge_dates=group.loc[group['Num1'].isin(num1_range),'discharge']
flag=group['admit'].isin(flagged_discharge_dates)
flag=flag.astype(int)
return flag
df['flag']=df.groupby('Key',group_keys=False).apply(get_flags)
df
This returns
这返回
Key Num1 Num2 admit discharge flag
0 10003 12 121 2012-05-06 2012-05-08 0
1 10003 13 122 2012-05-08 2012-05-10 1
2 10003 13 122 2012-10-10 2012-10-12 0
3 10003 13 124 2012-10-10 2012-10-16 0
4 10003 12 125 2012-10-10 2012-10-23 0
5 10003 13 126 2012-11-10 2012-11-11 0
6 10034 15 127 2012-05-20 2012-05-20 1
7 10034 12 128 2012-05-20 2012-05-20 1
which I believe is correct according to the logic specified (but differs from the desired outcome in the original post)
根据指定的逻辑,我认为这是正确的(但与原始帖子中的预期结果不同)
回答by manoj
Edit:-This solution does not use groupby but uses .loc
编辑:-此解决方案不使用 groupby 而是使用 .loc
I believe that what you want is to set flag as '1' when both admit and discharge day are same and also when the Num1 is between 5 and 12 (inclusive)
我相信您想要的是在录取和出院日相同以及 Num1 介于 5 和 12(含)之间时将标志设置为“1”
Here is the code that works for the logic.
这是适用于逻辑的代码。
df.loc[(df['admit'] == df['discharge'] ) & (df['Num1'].isin(num1_range)), 'flag'] = 1
df.loc[~((df['admit'] == df['discharge'] ) & (df['Num1'].isin(num1_range))), 'flag'] = 0
print(df)
The output is:
输出是:
Key Num1 Num2 admit discharge flag
0 10003 12 121 2012-05-06 2012-05-08 0.0
1 10003 13 122 2012-05-08 2012-05-10 0.0
2 10003 13 122 2012-10-10 2012-10-12 0.0
3 10003 13 124 2012-10-10 2012-10-16 0.0
4 10003 12 125 2012-10-10 2012-10-23 0.0
5 10003 13 126 2012-11-10 2012-11-11 0.0
6 10034 15 127 2012-05-20 2012-05-20 0.0
7 10034 12 128 2012-05-20 2012-05-20 1.0
You can see that only the last row satisfies the condition and has flag set as '1'.
Hope this helps.
您可以看到只有最后一行满足条件并且标志设置为“1”。
希望这可以帮助。