pandas 在 groupby 熊猫之后过滤行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41821430/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:51:10  来源:igfitidea点击:

Filter rows after groupby pandas

pythonpandas

提问by Shubham R

I have a table in pandas:

我在Pandas中有一张桌子:

import pandas as pd

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})

LeafID  count       pidx     pidy   score
0   1       10           10        20     10
1   1       20           10        20     10
2   2       30          300       400     10
3   1       40           10        20     22
4   3       80           30        15     22
5   3       10           40        20      3
6   1       20           20        12      4
7   6       50           10        43      5
8   3       30           20        54      9
9   5       10           45       112      0
10  1       70           20        23      1

I want to do a groupbyand then filter the rows where occurrence of pidxis greater than 2.

我想做一个groupby然后过滤出现pidx大于2的行。

That is, filter rows where pidxis 10 and 20.

也就是说,过滤pidx10 和 20 的行。

I tried using df.groupby('pidx').count()but it didn't helped me. Also for those rows I have to do 0.4*count+0.6*score.

我尝试使用df.groupby('pidx').count()但它没有帮助我。同样对于那些行,我必须做 0.4*count+0.6*score。

Desired output is:

期望的输出是:

LeafID    count       pidx     pidy    final_score
   1       10           10        20
   1       20           10        20
   1       40           10        20
   6       50           10        43
   1       20           20        12
   3       30           20        54
   1       70           20        23

采纳答案by jezrael

You can use value_countswith boolean indexingand isin:

您可以value_countsboolean indexing和一起使用isin

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})
print (df)
    LeafID  count  pidx  pidy  score
0        1     10    10    20     10
1        1     20    10    20     10
2        2     30   300   400     10
3        1     40    10    20     22
4        3     80    30    15     22
5        3     10    40    20      3
6        1     20    20    12      4
7        6     50    10    43      5
8        3     30    30    54      9
9        5     10    45   112      0
10       1     70    20    23      1

s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

Timings:

时间

np.random.seed(123)
N = 1000000


L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
                   'pidx': np.random.randint(10000, size=N),
                   'pidy': np.random.choice(L2, N),
                   'count':np.random.randint(1000, size=N)})
print (df)


print (df.groupby('pidx').filter(lambda x: len(x) > 120))

def jez(df):
    s = df.pidx.value_counts()
    return df[df.pidx.isin(s[s>120].index)]

print (jez(df))

In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop

In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop

In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop

In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop

In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop

For final_scoreyou can use:

因为final_score您可以使用:

df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))

回答by Ted Petrou

This is a straightforward application of filter after doing a groupby. In the data you provided, a value of 20 for pidx only occurred twice so it was filtered out.

这是执行 groupby 后过滤器的直接应用。在您提供的数据中,pidx 的值 20 仅出现两次,因此被过滤掉了。

df.groupby('pidx').filter(lambda x: len(x) > 2)

   LeafID  count  pidx  pidy
0       1     10    10    20
1       1     20    10    20
3       1     40    10    20
7       6     50    10    43

回答by piRSquared

pandas

pandas

df[df.groupby('pidx').pidx.transform('count') > 2]


   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

回答by V Shreyas

First of all, your output shows you don't want to do a groupby. Read up on what groupby does. What you need is:

首先,您的输出表明您不想进行分组。阅读 groupby 的作用。你需要的是:

df2 = df[df['pidx']<=20]
df2.sort_index(by = 'pidx')

This will give you your exact result. Read up on pandas indexing and functions. In fact go and read the whole introduction on pandas. It will not take much time.

这将为您提供确切的结果。阅读 Pandas 索引和函数。事实上,去阅读关于大Pandas的整个介绍。不会花太多时间。

Row operations are also simple using indexing:

使用索引的行操作也很简单:

df2['final_score']= 0.4*df2['count'] + 0.6*df2['score']