Pandas:当组中的值满足所需条件时从数据中删除组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34690756/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:28:58  来源:igfitidea点击:

Pandas: remove group from the data when a value in the group meets a required condition

pythonpandasdataframegrouping

提问by nrcjea001

I have groupings of values in the data and within each group, I would like to check if a value within the group is below 8. If this condition is met, the entire group is removed from the data set.

我在数据和每个组内都有值分组,我想检查组内的值是否低于8。如果满足此条件,则整个组将从数据集中删除。

Please note the value I'm referring to lies in another column to the groupings column.

请注意我所指的值位于分组列的另一列中。

Example Input:

示例输入:

Groups Count
  1      7
  1      11
  1      9 
  2      12
  2      15
  2      21 

Output:

输出:

Groups Count
  2      12
  2      15
  2      21 

回答by 2342G456DI8

Based on what you described in the question, as long as there is at least one value is below 8 within the group, then that group should be dropped. So the equivalent statement is that as long as the minimum value within that group is below 8, that group should be dropped.

根据您在问题中的描述,只要该组中至少有一个值低于 8,则应删除该组。所以等效的说法是,只要该组中的最小值低于 8,就应该删除该组。

By using the filter feature, the actual code can be reduced to only one line, please refer to Filtration, you may use the following code:

通过使用过滤功能,实际代码可以减少到只有一行,请参考过滤,可以使用如下代码:

dfnew = df.groupby('Groups').filter(lambda x: x['Count'].min()>8 )
dfnew.reset_index(drop=True, inplace=True) # reset index
dfnew = dfnew[['Groups','Count']] # rearrange the column sequence
print(dfnew)

Output:
   Groups  Count
0       2     12
1       2     15
2       2     21

回答by jezrael

You can use isin, locand uniquewith selecting subset by inverted mask. Last you can reset_index:

您可以使用isin,locunique通过反转掩码选择子集。最后你可以reset_index

print df

  Groups  Count
0       1      7
1       1     11
2       1      9
3       2     12
4       2     15
5       2     21

print df.loc[df['Count'] < 8, 'Groups'].unique()
[1]

print ~df['Groups'].isin(df.loc[df['Count'] < 8, 'Groups'].unique())

0    False
1    False
2    False
3     True
4     True
5     True
Name: Groups, dtype: bool

df1 = df[~df['Groups'].isin(df.loc[df['Count'] < 8, 'Groups'].unique())]
print df1.reset_index(drop=True)

   Groups  Count
0       2     12
1       2     15
2       2     21

回答by ALollz

Create a Boolean Series with your condition then groupby+ transform('any')to form a mask for the original DataFrame. This allows you to simply slice the original DataFrame.

使用您的条件创建一个布尔系列,然后groupby+transform('any')以形成原始 DataFrame 的掩码。这允许您简单地切片原始 DataFrame。

df[~df.Count.lt(8).groupby(df.Groups).transform('any')]
#   Groups  Count
#3       2     12
#4       2     15
#5       2     21


While the syntax of groupby+ filteris more straightforward, it performs much worse for a large number of groups, so creating the Boolean mask with transformis preferred. In this example there's over a 1000x improvement. The .isinmethod works extremely fast for a single column but would require switching to a merge if grouping on multiple columns.

虽然groupby+的语法filter更直接,但它对于大量组的表现要差得多,因此transform首选使用创建布尔掩码。在这个例子中,有超过 1000 倍的改进。该.isin方法对单列的工作速度非常快,但如果对多列进行分组,则需要切换到合并。

import pandas as pd
import numpy as np

np.random.seed(123)
N = 50000
df = pd.DataFrame({'Groups': [*range(N//2)]*2,
                   'Count': np.random.randint(0, 1000, N)})

# Double check both are equivalent
(df.groupby('Groups').filter(lambda x: x['Count'].min() >= 8)
  == df[~df.Count.lt(8).groupby(df.Groups).transform('any')]).all().all()
#True

%timeit df.groupby('Groups').filter(lambda x: x['Count'].min() >= 8)
#8.15 s ± 80.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df[~df.Count.lt(8).groupby(df.Groups).transform('any')]
#6.54 ms ± 143 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df[~df['Groups'].isin(df.loc[df['Count'] < 8, 'Groups'].unique())]
#2.88 ms ± 24 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)