pandas 在 groupby 熊猫之后过滤行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41821430/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Filter rows after groupby pandas
提问by Shubham R
I have a table in pandas:
我在Pandas中有一张桌子:
import pandas as pd
df = pd.DataFrame({
'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
'pidx':[10,10,300,10,30,40,20,10,30,45,20],
'pidy':[20,20,400,20,15,20,12,43,54,112,23],
'count':[10,20,30,40,80,10,20,50,30,10,70],
'score':[10,10,10,22,22,3,4,5,9,0,1]
})
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
2 2 30 300 400 10
3 1 40 10 20 22
4 3 80 30 15 22
5 3 10 40 20 3
6 1 20 20 12 4
7 6 50 10 43 5
8 3 30 20 54 9
9 5 10 45 112 0
10 1 70 20 23 1
I want to do a groupby
and then filter the rows where occurrence of pidx
is greater than 2.
我想做一个groupby
然后过滤出现pidx
大于2的行。
That is, filter rows where pidx
is 10 and 20.
也就是说,过滤pidx
10 和 20 的行。
I tried using df.groupby('pidx').count()
but it didn't helped me. Also for those rows I have to do 0.4*count+0.6*score.
我尝试使用df.groupby('pidx').count()
但它没有帮助我。同样对于那些行,我必须做 0.4*count+0.6*score。
Desired output is:
期望的输出是:
LeafID count pidx pidy final_score
1 10 10 20
1 20 10 20
1 40 10 20
6 50 10 43
1 20 20 12
3 30 20 54
1 70 20 23
采纳答案by jezrael
You can use value_counts
with boolean indexing
and isin
:
您可以value_counts
与boolean indexing
和一起使用isin
:
df = pd.DataFrame({
'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
'pidx':[10,10,300,10,30,40,20,10,30,45,20],
'pidy':[20,20,400,20,15,20,12,43,54,112,23],
'count':[10,20,30,40,80,10,20,50,30,10,70],
'score':[10,10,10,22,22,3,4,5,9,0,1]
})
print (df)
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
2 2 30 300 400 10
3 1 40 10 20 22
4 3 80 30 15 22
5 3 10 40 20 3
6 1 20 20 12 4
7 6 50 10 43 5
8 3 30 30 54 9
9 5 10 45 112 0
10 1 70 20 23 1
s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
3 1 40 10 20 22
7 6 50 10 43 5
Timings:
时间:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'count':np.random.randint(1000, size=N)})
print (df)
print (df.groupby('pidx').filter(lambda x: len(x) > 120))
def jez(df):
s = df.pidx.value_counts()
return df[df.pidx.isin(s[s>120].index)]
print (jez(df))
In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop
In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop
In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop
In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop
In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop
For final_score
you can use:
因为final_score
您可以使用:
df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))
回答by Ted Petrou
This is a straightforward application of filter after doing a groupby. In the data you provided, a value of 20 for pidx only occurred twice so it was filtered out.
这是执行 groupby 后过滤器的直接应用。在您提供的数据中,pidx 的值 20 仅出现两次,因此被过滤掉了。
df.groupby('pidx').filter(lambda x: len(x) > 2)
LeafID count pidx pidy
0 1 10 10 20
1 1 20 10 20
3 1 40 10 20
7 6 50 10 43
回答by piRSquared
pandas
pandas
df[df.groupby('pidx').pidx.transform('count') > 2]
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
3 1 40 10 20 22
7 6 50 10 43 5
回答by V Shreyas
First of all, your output shows you don't want to do a groupby. Read up on what groupby does. What you need is:
首先,您的输出表明您不想进行分组。阅读 groupby 的作用。你需要的是:
df2 = df[df['pidx']<=20]
df2.sort_index(by = 'pidx')
This will give you your exact result. Read up on pandas indexing and functions. In fact go and read the whole introduction on pandas. It will not take much time.
这将为您提供确切的结果。阅读 Pandas 索引和函数。事实上,去阅读关于大Pandas的整个介绍。不会花太多时间。
Row operations are also simple using indexing:
使用索引的行操作也很简单:
df2['final_score']= 0.4*df2['count'] + 0.6*df2['score']