Python pandas - 在 groupby 后过滤行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27488080/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:51:47  来源:igfitidea点击:

Python pandas - filter rows after groupby

pythonpandasfilterlambdagroup-by

提问by jirinovo

For example I have following table:

例如我有下表:

index,A,B
0,0,0
1,0,8
2,0,8
3,1,0
4,1,5

After grouping by A:

分组后A

0:
index,A,B
0,0,0
1,0,8
2,0,8

1:
index,A,B
3,1,5
4,1,3

What I need is to drop rows from each group, where the number in column Bis less than maximum value from all rows from group's column B. Well I have a problem translating and formulating this problem to English so here is the example:

我需要的是从每个组中删除行,其中 column 中的数字B小于 group's column 中所有行的最大值B。好吧,我在将这个问题翻译和表述为英语时遇到了问题,所以这里是示例:

Maximum value from rows in column Bin group 0: 8

B组中列中行的最大值08

So I want to drop row with index 0and keep rows with indexes 1and 2

所以我想删除带有索引的0行并保留带有索引的行12

Maximum value from rows in column Bin group 1: 5

B组中列中行的最大值15

So I want to drop row with index 4and keep row with index 3

所以我想删除带有索引的4行并保留带有索引的行3

I have tried to use pandas filter function, but the problem is that it is operating on all rows in group at one time:

我曾尝试使用 pandas 过滤器功能,但问题是它一次对组中的所有行进行操作:

data = <example table>
grouped = data.groupby("A")
filtered = grouped.filter(lambda x: x["B"] == x["B"].max())

So what I ideally need is some filter, which iterates through all rows in group.

所以我理想中需要的是一些过滤器,它遍历组中的所有行。

Thanks for help!

感谢帮助!

P.S. Is there also way to only delete rows in groups and do not return DataFrameobject?

PS有没有办法只删除组中的行而不返回DataFrame对象?

采纳答案by Paul H

You just need to use applyon the groupbyobject. I modified your example data to make this a little more clear:

你只需要applygroupby对象上使用。我修改了您的示例数据,使其更清楚一点:

import pandas
from io import StringIO

csv = StringIO("""index,A,B
0,1,0.0
1,1,3.0
2,1,6.0
3,2,0.0
4,2,5.0
5,2,7.0""")

df = pandas.read_csv(csv, index_col='index')
groups = df.groupby(by=['A'])
print(groups.apply(lambda g: g[g['B'] == g['B'].max()]))

Which prints:

哪个打印:

         A  B
A index      
1 2      1  6
2 4      2  7

回答by JoeCondron

EDIT: I just learned a much neater way to do this using the .transformgroup by method:

编辑:我刚刚学会了一种使用.transformgroup by 方法的更简洁的方法:

def get_max_rows(df):
    B_maxes = df.groupby('A').B.transform(max)
    return df[df.B == B_maxes] 

B_maxesis a series which identically indexed as the original dfcontaining the maximum value of Bfor each Agroup. You can pass lots of functions to the transform method. I think once they have output either as a scalar or vector of the same length. You can even pass some strings as common function names like 'median'. This is slightly different to Paul H's method in that 'A' won't be an index in the result, but you can easily set that after.

B_maxes是一个与原始索引相同的系列,其中df包含B每个A组的最大值。您可以将许多函数传递给转换方法。我认为一旦他们将输出作为相同长度的标量或向量。您甚至可以将一些字符串作为常用函数名称传递,例如'median'. 这与 Paul H 的方法略有不同,因为 'A' 不会成为结果中的索引,但您可以在之后轻松设置。

import numpy as np
import pandas as pd
df_lots_groups = pd.DataFrame(np.random.rand(30000, 3), columns = list('BCD')
df_lots_groups['A'] = np.random.choice(range(10000), 30000)

%timeit get_max_rows(df_lots_groups)
100 loops, best of 3: 2.86 ms per loop

%timeit df_lots_groups.groupby('A').apply(lambda df: df[ df.B == df.B.max()])
1 loops, best of 3: 5.83 s per loop

EDIT:

编辑:

Here's a abstraction which allows you to select rows from groups using any valid comparison operator and any valid groupby method:

这是一个抽象,它允许您使用任何有效的比较运算符和任何有效的 groupby 方法从组中选择行:

def get_group_rows(df, group_col, condition_col, func=max, comparison='=='):
    g = df.groupby(group_col)[condition_col]
    condition_limit = g.transform(func)
    df.query('condition_col {} @condition_limit'.format(comparison))

So, for example, if you want all rows in above the median B-value in each A-group you call

因此,例如,如果您希望调用的每个 A 组中的所有行都高于 B 值中位数

get_group_rows(df, 'A', 'B', 'median', '>')

A few examples:

几个例子:

%timeit get_group_rows(df_lots_small_groups, 'A', 'B', 'max', '==')
100 loops, best of 3: 2.84 ms per loop
%timeit get_group_rows(df_lots_small_groups, 'A', 'B', 'mean', '!=')
100 loops, best of 3: 2.97 ms per loop

回答by Surya

Here's the other example for : Filtering the rows with maximum value after groupby operation using idxmax() and .loc()

这是另一个示例:使用 idxmax() 和 .loc() 在 groupby 操作后过滤具有最大值的行

In [465]: import pandas as pd

In [466]:   df = pd.DataFrame({
               'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2'],
               'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4'], 
               'value' : [3,2,5,8,10,1]     
                })

In [467]: df
Out[467]: 
   mt   sp  value
0  S1  MM1      3
1  S1  MM1      2
2  S3  MM1      5
3  S3  MM2      8
4  S4  MM2     10
5  S4  MM2      1

### Here, idxmax() finds the indices of the rows with max value within groups,
### and .loc() filters the rows using those indices :
In [468]: df.loc[df.groupby(["mt"])["value"].idxmax()]                                                                                                                           
Out[468]: 
   mt   sp  value
0  S1  MM1      3
3  S3  MM2      8
4  S4  MM2     10

回答by Briford Wylie

All of these answers are good but I wanted the following:

所有这些答案都很好,但我想要以下内容:

(DataframeGroupby object) --> filter some rows out --> (DataframeGroupby object)

Shrug, it appears that is harder and more interesting than I expected. So this one liner accomplishes what I wanted but it's probably not the most efficient way :)

耸耸肩,这似乎比我想象的更难更有趣。所以这个班轮完成了我想要的,但它可能不是最有效的方式:)

gdf.apply(lambda g: g[g['team'] == 'A']).reset_index(drop=True).groupby(gdf.grouper.names) 

Working Code Example:

工作代码示例:

import pandas as pd

def print_groups(gdf): 
    for name, g in gdf: 
        print('\n'+name) 
        print(g)

df = pd.DataFrame({'name': ['sue', 'jim', 'ted', 'moe'],
                   'team': ['A', 'A', 'B', 'B'], 
                   'fav_food': ['tacos', 'steak', 'tacos', 'steak']})                               

gdf = df.groupby('fav_food')                                                                                                                                           
print_groups(gdf)                                                                                                                                                      

    steak
        name team fav_food
    1  jim    A    steak
    3  moe    B    steak

    tacos
        name team fav_food
    0  sue    A    tacos
    2  ted    B    tacos

fgdf = gdf.apply(lambda g: g[g['team'] == 'A']).reset_index(drop=True).groupby(gdf.grouper.names)                                                                      
print_groups(fgdf)                                                                                                                                                     

    steak
      name team fav_food
    0  jim    A    steak

    tacos
      name team fav_food
    1  sue    A    tacos