Python pandas - 在 groupby 后过滤行

Question

提问by jirinovo

For example I have following table:

例如我有下表：

index,A,B
0,0,0
1,0,8
2,0,8
3,1,0
4,1,5

After grouping by A:

分组后A：

0:
index,A,B
0,0,0
1,0,8
2,0,8

1:
index,A,B
3,1,5
4,1,3

What I need is to drop rows from each group, where the number in column Bis less than maximum value from all rows from group's column B. Well I have a problem translating and formulating this problem to English so here is the example:

我需要的是从每个组中删除行，其中 column 中的数字B小于 group's column 中所有行的最大值B。好吧，我在将这个问题翻译和表述为英语时遇到了问题，所以这里是示例：

Maximum value from rows in column Bin group 0: 8

B组中列中行的最大值0：8

So I want to drop row with index 0and keep rows with indexes 1and 2

所以我想删除带有索引的0行并保留带有索引的行1和2

Maximum value from rows in column Bin group 1: 5

B组中列中行的最大值1：5

So I want to drop row with index 4and keep row with index 3

所以我想删除带有索引的4行并保留带有索引的行3

I have tried to use pandas filter function, but the problem is that it is operating on all rows in group at one time:

我曾尝试使用 pandas 过滤器功能，但问题是它一次对组中的所有行进行操作：

data = <example table>
grouped = data.groupby("A")
filtered = grouped.filter(lambda x: x["B"] == x["B"].max())

So what I ideally need is some filter, which iterates through all rows in group.

所以我理想中需要的是一些过滤器，它遍历组中的所有行。

Thanks for help!

感谢帮助！

P.S. Is there also way to only delete rows in groups and do not return DataFrameobject?

PS有没有办法只删除组中的行而不返回DataFrame对象？

Answer 1

采纳答案by Paul H

You just need to use applyon the groupbyobject. I modified your example data to make this a little more clear:

你只需要apply在groupby对象上使用。我修改了您的示例数据，使其更清楚一点：

import pandas
from io import StringIO

csv = StringIO("""index,A,B
0,1,0.0
1,1,3.0
2,1,6.0
3,2,0.0
4,2,5.0
5,2,7.0""")

df = pandas.read_csv(csv, index_col='index')
groups = df.groupby(by=['A'])
print(groups.apply(lambda g: g[g['B'] == g['B'].max()]))

Which prints:

哪个打印：

         A  B
A index      
1 2      1  6
2 4      2  7

Answer 2

回答by JoeCondron

EDIT: I just learned a much neater way to do this using the .transformgroup by method:

编辑：我刚刚学会了一种使用.transformgroup by 方法的更简洁的方法：

def get_max_rows(df):
    B_maxes = df.groupby('A').B.transform(max)
    return df[df.B == B_maxes]

B_maxesis a series which identically indexed as the original dfcontaining the maximum value of Bfor each Agroup. You can pass lots of functions to the transform method. I think once they have output either as a scalar or vector of the same length. You can even pass some strings as common function names like 'median'. This is slightly different to Paul H's method in that 'A' won't be an index in the result, but you can easily set that after.

B_maxes是一个与原始索引相同的系列，其中df包含B每个A组的最大值。您可以将许多函数传递给转换方法。我认为一旦他们将输出作为相同长度的标量或向量。您甚至可以将一些字符串作为常用函数名称传递，例如'median'. 这与 Paul H 的方法略有不同，因为 'A' 不会成为结果中的索引，但您可以在之后轻松设置。

import numpy as np
import pandas as pd
df_lots_groups = pd.DataFrame(np.random.rand(30000, 3), columns = list('BCD')
df_lots_groups['A'] = np.random.choice(range(10000), 30000)

%timeit get_max_rows(df_lots_groups)
100 loops, best of 3: 2.86 ms per loop

%timeit df_lots_groups.groupby('A').apply(lambda df: df[ df.B == df.B.max()])
1 loops, best of 3: 5.83 s per loop

EDIT:

编辑：

Here's a abstraction which allows you to select rows from groups using any valid comparison operator and any valid groupby method:

这是一个抽象，它允许您使用任何有效的比较运算符和任何有效的 groupby 方法从组中选择行：

def get_group_rows(df, group_col, condition_col, func=max, comparison='=='):
    g = df.groupby(group_col)[condition_col]
    condition_limit = g.transform(func)
    df.query('condition_col {} @condition_limit'.format(comparison))

So, for example, if you want all rows in above the median B-value in each A-group you call

因此，例如，如果您希望调用的每个 A 组中的所有行都高于 B 值中位数

get_group_rows(df, 'A', 'B', 'median', '>')

A few examples:

几个例子：

%timeit get_group_rows(df_lots_small_groups, 'A', 'B', 'max', '==')
100 loops, best of 3: 2.84 ms per loop
%timeit get_group_rows(df_lots_small_groups, 'A', 'B', 'mean', '!=')
100 loops, best of 3: 2.97 ms per loop

Answer 3

回答by Surya

Here's the other example for : Filtering the rows with maximum value after groupby operation using idxmax() and .loc()

这是另一个示例：使用 idxmax() 和 .loc() 在 groupby 操作后过滤具有最大值的行

In [465]: import pandas as pd

In [466]:   df = pd.DataFrame({
               'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2'],
               'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4'], 
               'value' : [3,2,5,8,10,1]     
                })

In [467]: df
Out[467]: 
   mt   sp  value
0  S1  MM1      3
1  S1  MM1      2
2  S3  MM1      5
3  S3  MM2      8
4  S4  MM2     10
5  S4  MM2      1

### Here, idxmax() finds the indices of the rows with max value within groups,
### and .loc() filters the rows using those indices :
In [468]: df.loc[df.groupby(["mt"])["value"].idxmax()]                                                                                                                           
Out[468]: 
   mt   sp  value
0  S1  MM1      3
3  S3  MM2      8
4  S4  MM2     10

Answer 4

回答by Briford Wylie

All of these answers are good but I wanted the following:

所有这些答案都很好，但我想要以下内容：

(DataframeGroupby object) --> filter some rows out --> (DataframeGroupby object)

Shrug, it appears that is harder and more interesting than I expected. So this one liner accomplishes what I wanted but it's probably not the most efficient way :)

耸耸肩，这似乎比我想象的更难更有趣。所以这个班轮完成了我想要的，但它可能不是最有效的方式:)

gdf.apply(lambda g: g[g['team'] == 'A']).reset_index(drop=True).groupby(gdf.grouper.names)

Working Code Example:

工作代码示例：

import pandas as pd

def print_groups(gdf): 
    for name, g in gdf: 
        print('\n'+name) 
        print(g)

df = pd.DataFrame({'name': ['sue', 'jim', 'ted', 'moe'],
                   'team': ['A', 'A', 'B', 'B'], 
                   'fav_food': ['tacos', 'steak', 'tacos', 'steak']})                               

gdf = df.groupby('fav_food')                                                                                                                                           
print_groups(gdf)                                                                                                                                                      

    steak
        name team fav_food
    1  jim    A    steak
    3  moe    B    steak

    tacos
        name team fav_food
    0  sue    A    tacos
    2  ted    B    tacos

fgdf = gdf.apply(lambda g: g[g['team'] == 'A']).reset_index(drop=True).groupby(gdf.grouper.names)                                                                      
print_groups(fgdf)                                                                                                                                                     

    steak
      name team fav_food
    0  jim    A    steak

    tacos
      name team fav_food
    1  sue    A    tacos

Python pandas - 在 groupby 后过滤行

提问by jirinovo

采纳答案by Paul H

回答by JoeCondron

回答by Surya

回答by Briford Wylie

相关推荐

最近更新

标签

Python pandas - 在 groupby 后过滤行

提问by jirinovo

采纳答案by Paul H

回答by JoeCondron

回答by Surya

回答by Briford Wylie

相关推荐

Python 交换 numpy 数组中的两个值。

Python 如何修复ImportError：没有名为packages.urllib3 的模块？

我如何解决 NameError: name 'threading' is not defined in python 3.3

Python Django 模型多项选择

相关推荐

最近更新

标签