pandas 如何在pandas中使用groupby和两个条件的.loc

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49164935/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:17:07  来源:igfitidea点击:

How to use .loc with groupby and two conditions in pandas

pythonpandas

提问by Martin Bobak

I asked a similar question here, but I want to expand on this question because I'm asked to do something a little different where I cannot use .duplicates()

在这里问了一个类似的问题,但我想扩展这个问题,因为我被要求做一些不同的事情,我不能使用 .duplicates()

I have a df that's grouped by 'Key'. I want to flag any row within a group where the discharge date matches the admit date AND between those rows, the row with the discharge date has a num1 value in the range of 5-12.

我有一个按“键”分组的 df。我想标记出院日期与录取日期匹配的组中的任何行,并且在这些行之间,出院日期的行的 num1 值在 5-12 的范围内。

df =  pd.DataFrame({'Key': ['10003', '10003', '10003', '10003', '10003','10003','10034', '10034'], 
   'Num1': [12,13,13,13,12,13,15,12],
   'Num2': [121,122,122,124,125,126,127,128],
  'admit': [20120506, 20120508, 20121010,20121010,20121010,20121110,20120520,20120520],  'discharge': [20120508, 20120510, 20121012,20121016,20121023,20121111,20120520,20120520]})
df['admit'] = pd.to_datetime(df['admit'], format='%Y%m%d')
df['discharge'] = pd.to_datetime(df['discharge'], format='%Y%m%d')

initial df

初始 df

    Key     Num1    Num2    admit       discharge
0   10003   12      121     2012-05-06  2012-05-08
1   10003   13      122     2012-05-08  2012-05-10
2   10003   13      122     2012-10-10  2012-10-12
3   10003   13      124     2012-10-10  2012-10-16
4   10003   12      125     2012-10-10  2012-10-23
5   10003   13      126     2012-11-10  2012-11-11
6   10034   15      127     2012-05-20  2012-05-20
7   10034   12      128     2012-05-20  2012-05-20

final df

最终 df

    Key     Num1    Num2    admit       discharge   flag
0   10003   12      121     2012-05-06  2012-05-08  1
1   10003   13      122     2012-05-08  2012-05-10  1
2   10003   13      122     2012-10-10  2012-10-12  0
3   10003   13      124     2012-10-10  2012-10-16  0
4   10003   12      125     2012-10-10  2012-10-23  0
5   10003   13      126     2012-11-10  2012-11-11  0
6   10034   15      127     2012-05-20  2012-05-20  1
7   10034   12      128     2012-05-20  2012-05-20  1

I was trying to use filter() but I can't quite figure out how to apply any() to the discharge date. My logic was to pick the first admit date in a group and then check that date among each discharge date and once there is a match then check if the row that has the same discharge date has a value in Num1 in the range of 5-12.

我试图使用 filter() 但我不太清楚如何将 any() 应用于出院日期。我的逻辑是选择一组中的第一个录取日期,然后在每个出院日期中检查该日期,一旦匹配,则检查具有相同出院日期的行在 Num1 中的值是否在 5-12 范围内.

num1_range = [5,6,7,8,9,10,11,12]
df.loc[df.groupby(['Key']).filter(lambda x : (x['admit'] == x['discharge'].any())&(x['Num1'].isin(num1_range).any())),'flag']=1

I'm getting an error

我收到一个错误

ValueError: cannot set a Timestamp with a non-timestamp

采纳答案by jpp

I believe you are looking for either of 2 conditions to be satisfied for flag = True:

我相信您正在寻找满足以下两个条件之一flag = True

  1. Admit date is equal to anydischarge date within the group (Key).
  2. Discharge date is equal to anyadmit date within the group, provided Num1is in the range 5 to 12 inclusive.
  1. 入院日期等于组内的任何出院日期 ( Key)。
  2. 出院日期等于组内的任何录取日期,前提Num1是在 5 到 12 之间(含)。

The below logic produces the result in line with your desired output.

以下逻辑产生符合您所需输出的结果。

Solution

解决方案

d1 = df.groupby('Key')['admit'].apply(set).to_dict()
d2 = df.groupby('Key')['discharge'].apply(set).to_dict()

def flagger(row):
    match1, match2 = row['discharge'] in d1[row['Key']], row['admit'] in d2[row['Key']]
    return match2 or (match1 and (row['Num1'] in range(5, 13)))

df['flag'] = df.apply(flagger, axis=1).astype(int)

Result

结果

     Key  Num1  Num2      admit  discharge  flag
0  10003    12   121 2012-05-06 2012-05-08     1
1  10003    13   122 2012-05-08 2012-05-10     1
2  10003    13   122 2012-10-10 2012-10-12     0
3  10003    13   124 2012-10-10 2012-10-16     0
4  10003    12   125 2012-10-10 2012-10-23     0
5  10003    13   126 2012-11-10 2012-11-11     0
6  10034    15   127 2012-05-20 2012-05-20     1
7  10034    12   128 2012-05-20 2012-05-20     1

Explanation

解释

  • Create 2 dictionary mapping Key -> Admit dates and Key -> Discharge dates respectively.
  • Use these 2 dictionaries to apply the criteria specified by row using pd.DataFrame.apply.
  • 分别创建 2 个字典映射 Key -> Admit 日期和 Key -> Discharge 日期。
  • 使用这两个字典应用 row using 指定的条件pd.DataFrame.apply

回答by DJK

Lets break down the filtration into a few steps. First, create the conditions to be filtered

让我们将过滤分解为几个步骤。一、创建要过滤的条件

conditions = "(x['discharge'].isin(x['admit'])) & (x['Num1'] >= 5) & (x['Num1'] <= 12)"

I Chose to store the conditionsas string because it looks cleaner for formatting the next step. However, the filter command will be used to check, by keyin the dataframe, if there are any discharge times equal to admit times. And will also check if Num1at the dischargetime is between 5 and 12. Now we run the groupbyoperation and evaluate conditions

我选择将conditionsas 字符串存储,因为它在下一步格式化时看起来更清晰。但是,过滤命令将用于检查key数据帧中是否有任何放电时间等于准入时间。同时还将检查是否Num1discharge时间为5和12之间。现在我们运行groupby操作和评估conditions

filter = df.groupby('Key').apply(lambda x: pd.eval(conditions))
filter.index = filter.index.droplevel(0)

filterwill output this

filter将输出这个

0     True
1    False
2    False
3    False
4    False
5    False
6    False
7     True
dtype: bool

filterprovides some Boolean flags were the conditionshold true. The last step is to add the flags for admittimes equlivant to dischagretimes which can be done by merging the initial dataframe with the filtered locations which is used to extract the indices of admitflags.

filter提供一些布尔标志是conditions成立的。最后一步是将admit时间相等的标志添加到dischagre时间,这可以通过将初始数据帧与用于提取admit标志索引的过滤位置合并来完成。

dex = df.merge(df[filter.values],left_on=['Key','admit'],right_on=['Key','discharge'],how='left').dropna().index

Finally set the flagswhere either condition is True

最后设置flags其中任一条件是True

df['flag'] = (filter | df.index.isin(dex)).astype(int)


Full code:

完整代码:

conditions = "(x['discharge'].isin(x['admit'])) & (x['Num1'] >= 5) & (x['Num1'] <= 12)"
filter = df.groupby('Key').apply(lambda x: pd.eval(conditions))
filter.index = filter.index.droplevel(0)
dex = df.merge(df[filter.values],left_on=['Key','admit'],right_on=['Key','discharge'],how='left').dropna().index
df['flag'] = (filter | df.index.isin(dex)).astype(int)

Output:

输出:

     Key  Num1  Num2      admit  discharge  flag
0  10003    12   121 2012-05-06 2012-05-08     1
1  10003    13   122 2012-05-08 2012-05-10     1
2  10003    13   122 2012-10-10 2012-10-12     0
3  10003    13   124 2012-10-10 2012-10-16     0
4  10003    12   125 2012-10-10 2012-10-23     0
5  10003    13   126 2012-11-10 2012-11-11     0
6  10034    15   127 2012-05-20 2012-05-20     1
7  10034    12   128 2012-05-20 2012-05-20     1

回答by Victor Chubukov

If I understand the logic correctly, I would write it this way:

如果我正确理解逻辑,我会这样写:

num1_range = [5,6,7,8,9,10,11,12]

def get_flags(group):
    flagged_discharge_dates=group.loc[group['Num1'].isin(num1_range),'discharge']
    flag=group['admit'].isin(flagged_discharge_dates)
    flag=flag.astype(int)
    return flag

df['flag']=df.groupby('Key',group_keys=False).apply(get_flags)
df

This returns

这返回

    Key Num1    Num2    admit   discharge   flag
0   10003   12  121 2012-05-06  2012-05-08  0
1   10003   13  122 2012-05-08  2012-05-10  1
2   10003   13  122 2012-10-10  2012-10-12  0
3   10003   13  124 2012-10-10  2012-10-16  0
4   10003   12  125 2012-10-10  2012-10-23  0
5   10003   13  126 2012-11-10  2012-11-11  0
6   10034   15  127 2012-05-20  2012-05-20  1
7   10034   12  128 2012-05-20  2012-05-20  1

which I believe is correct according to the logic specified (but differs from the desired outcome in the original post)

根据指定的逻辑,我认为这是正确的(但与原始帖子中的预期结果不同)

回答by manoj

Edit:-This solution does not use groupby but uses .loc

编辑:-此解决方案不使用 groupby 而是使用 .loc

I believe that what you want is to set flag as '1' when both admit and discharge day are same and also when the Num1 is between 5 and 12 (inclusive)

我相信您想要的是在录取和出院日相同以及 Num1 介于 5 和 12(含)之间时将标志设置为“1”

Here is the code that works for the logic.

这是适用于逻辑的代码。

df.loc[(df['admit'] == df['discharge'] ) & (df['Num1'].isin(num1_range)), 'flag'] = 1
df.loc[~((df['admit'] == df['discharge'] ) & (df['Num1'].isin(num1_range))), 'flag'] = 0
print(df)

The output is:

输出是:

     Key  Num1  Num2      admit  discharge  flag
0  10003    12   121 2012-05-06 2012-05-08   0.0
1  10003    13   122 2012-05-08 2012-05-10   0.0
2  10003    13   122 2012-10-10 2012-10-12   0.0
3  10003    13   124 2012-10-10 2012-10-16   0.0
4  10003    12   125 2012-10-10 2012-10-23   0.0
5  10003    13   126 2012-11-10 2012-11-11   0.0
6  10034    15   127 2012-05-20 2012-05-20   0.0
7  10034    12   128 2012-05-20 2012-05-20   1.0

You can see that only the last row satisfies the condition and has flag set as '1'.
Hope this helps.

您可以看到只有最后一行满足条件并且标志设置为“1”。
希望这可以帮助。