Python pandas - 在 groupby 后过滤行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27488080/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python pandas - filter rows after groupby
提问by jirinovo
For example I have following table:
例如我有下表:
index,A,B
0,0,0
1,0,8
2,0,8
3,1,0
4,1,5
After grouping by A
:
分组后A
:
0:
index,A,B
0,0,0
1,0,8
2,0,8
1:
index,A,B
3,1,5
4,1,3
What I need is to drop rows from each group, where the number in column B
is less than maximum value from all rows from group's column B
. Well I have a problem translating and formulating this problem to English so here is the example:
我需要的是从每个组中删除行,其中 column 中的数字B
小于 group's column 中所有行的最大值B
。好吧,我在将这个问题翻译和表述为英语时遇到了问题,所以这里是示例:
Maximum value from rows in column B
in group 0
: 8
B
组中列中行的最大值0
:8
So I want to drop row with index 0
and keep rows with indexes 1
and 2
所以我想删除带有索引的0
行并保留带有索引的行1
和2
Maximum value from rows in column B
in group 1
: 5
B
组中列中行的最大值1
:5
So I want to drop row with index 4
and keep row with index 3
所以我想删除带有索引的4
行并保留带有索引的行3
I have tried to use pandas filter function, but the problem is that it is operating on all rows in group at one time:
我曾尝试使用 pandas 过滤器功能,但问题是它一次对组中的所有行进行操作:
data = <example table>
grouped = data.groupby("A")
filtered = grouped.filter(lambda x: x["B"] == x["B"].max())
So what I ideally need is some filter, which iterates through all rows in group.
所以我理想中需要的是一些过滤器,它遍历组中的所有行。
Thanks for help!
感谢帮助!
P.S. Is there also way to only delete rows in groups and do not return DataFrame
object?
PS有没有办法只删除组中的行而不返回DataFrame
对象?
采纳答案by Paul H
You just need to use apply
on the groupby
object. I modified your example data to make this a little more clear:
你只需要apply
在groupby
对象上使用。我修改了您的示例数据,使其更清楚一点:
import pandas
from io import StringIO
csv = StringIO("""index,A,B
0,1,0.0
1,1,3.0
2,1,6.0
3,2,0.0
4,2,5.0
5,2,7.0""")
df = pandas.read_csv(csv, index_col='index')
groups = df.groupby(by=['A'])
print(groups.apply(lambda g: g[g['B'] == g['B'].max()]))
Which prints:
哪个打印:
A B
A index
1 2 1 6
2 4 2 7
回答by JoeCondron
EDIT: I just learned a much neater way to do this using the .transform
group by method:
编辑:我刚刚学会了一种使用.transform
group by 方法的更简洁的方法:
def get_max_rows(df):
B_maxes = df.groupby('A').B.transform(max)
return df[df.B == B_maxes]
B_maxes
is a series which identically indexed as the original df
containing the maximum value of B
for each A
group. You can pass lots of functions to the transform method. I think once they have output either as a scalar or vector of the same length. You can even pass some strings as common function names like 'median'
.
This is slightly different to Paul H's method in that 'A' won't be an index in the result, but you can easily set that after.
B_maxes
是一个与原始索引相同的系列,其中df
包含B
每个A
组的最大值。您可以将许多函数传递给转换方法。我认为一旦他们将输出作为相同长度的标量或向量。您甚至可以将一些字符串作为常用函数名称传递,例如'median'
. 这与 Paul H 的方法略有不同,因为 'A' 不会成为结果中的索引,但您可以在之后轻松设置。
import numpy as np
import pandas as pd
df_lots_groups = pd.DataFrame(np.random.rand(30000, 3), columns = list('BCD')
df_lots_groups['A'] = np.random.choice(range(10000), 30000)
%timeit get_max_rows(df_lots_groups)
100 loops, best of 3: 2.86 ms per loop
%timeit df_lots_groups.groupby('A').apply(lambda df: df[ df.B == df.B.max()])
1 loops, best of 3: 5.83 s per loop
EDIT:
编辑:
Here's a abstraction which allows you to select rows from groups using any valid comparison operator and any valid groupby method:
这是一个抽象,它允许您使用任何有效的比较运算符和任何有效的 groupby 方法从组中选择行:
def get_group_rows(df, group_col, condition_col, func=max, comparison='=='):
g = df.groupby(group_col)[condition_col]
condition_limit = g.transform(func)
df.query('condition_col {} @condition_limit'.format(comparison))
So, for example, if you want all rows in above the median B-value in each A-group you call
因此,例如,如果您希望调用的每个 A 组中的所有行都高于 B 值中位数
get_group_rows(df, 'A', 'B', 'median', '>')
A few examples:
几个例子:
%timeit get_group_rows(df_lots_small_groups, 'A', 'B', 'max', '==')
100 loops, best of 3: 2.84 ms per loop
%timeit get_group_rows(df_lots_small_groups, 'A', 'B', 'mean', '!=')
100 loops, best of 3: 2.97 ms per loop
回答by Surya
Here's the other example for : Filtering the rows with maximum value after groupby operation using idxmax() and .loc()
这是另一个示例:使用 idxmax() 和 .loc() 在 groupby 操作后过滤具有最大值的行
In [465]: import pandas as pd
In [466]: df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4'],
'value' : [3,2,5,8,10,1]
})
In [467]: df
Out[467]:
mt sp value
0 S1 MM1 3
1 S1 MM1 2
2 S3 MM1 5
3 S3 MM2 8
4 S4 MM2 10
5 S4 MM2 1
### Here, idxmax() finds the indices of the rows with max value within groups,
### and .loc() filters the rows using those indices :
In [468]: df.loc[df.groupby(["mt"])["value"].idxmax()]
Out[468]:
mt sp value
0 S1 MM1 3
3 S3 MM2 8
4 S4 MM2 10
回答by Briford Wylie
All of these answers are good but I wanted the following:
所有这些答案都很好,但我想要以下内容:
(DataframeGroupby object) --> filter some rows out --> (DataframeGroupby object)
Shrug, it appears that is harder and more interesting than I expected. So this one liner accomplishes what I wanted but it's probably not the most efficient way :)
耸耸肩,这似乎比我想象的更难更有趣。所以这个班轮完成了我想要的,但它可能不是最有效的方式:)
gdf.apply(lambda g: g[g['team'] == 'A']).reset_index(drop=True).groupby(gdf.grouper.names)
Working Code Example:
工作代码示例:
import pandas as pd
def print_groups(gdf):
for name, g in gdf:
print('\n'+name)
print(g)
df = pd.DataFrame({'name': ['sue', 'jim', 'ted', 'moe'],
'team': ['A', 'A', 'B', 'B'],
'fav_food': ['tacos', 'steak', 'tacos', 'steak']})
gdf = df.groupby('fav_food')
print_groups(gdf)
steak
name team fav_food
1 jim A steak
3 moe B steak
tacos
name team fav_food
0 sue A tacos
2 ted B tacos
fgdf = gdf.apply(lambda g: g[g['team'] == 'A']).reset_index(drop=True).groupby(gdf.grouper.names)
print_groups(fgdf)
steak
name team fav_food
0 jim A steak
tacos
name team fav_food
1 sue A tacos