pandas 在 groupby 熊猫之后过滤行

Question

提问by Shubham R

I have a table in pandas:

我在Pandas中有一张桌子：

import pandas as pd

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})

LeafID  count       pidx     pidy   score
0   1       10           10        20     10
1   1       20           10        20     10
2   2       30          300       400     10
3   1       40           10        20     22
4   3       80           30        15     22
5   3       10           40        20      3
6   1       20           20        12      4
7   6       50           10        43      5
8   3       30           20        54      9
9   5       10           45       112      0
10  1       70           20        23      1

I want to do a groupbyand then filter the rows where occurrence of pidxis greater than 2.

我想做一个groupby然后过滤出现pidx大于2的行。

That is, filter rows where pidxis 10 and 20.

也就是说，过滤pidx10 和 20 的行。

I tried using df.groupby('pidx').count()but it didn't helped me. Also for those rows I have to do 0.4*count+0.6*score.

我尝试使用df.groupby('pidx').count()但它没有帮助我。同样对于那些行，我必须做 0.4*count+0.6*score。

Desired output is:

期望的输出是：

LeafID    count       pidx     pidy    final_score
   1       10           10        20
   1       20           10        20
   1       40           10        20
   6       50           10        43
   1       20           20        12
   3       30           20        54
   1       70           20        23

Answer 1

采纳答案by jezrael

You can use value_countswith boolean indexingand isin:

您可以value_counts与boolean indexing和一起使用isin：

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})
print (df)
    LeafID  count  pidx  pidy  score
0        1     10    10    20     10
1        1     20    10    20     10
2        2     30   300   400     10
3        1     40    10    20     22
4        3     80    30    15     22
5        3     10    40    20      3
6        1     20    20    12      4
7        6     50    10    43      5
8        3     30    30    54      9
9        5     10    45   112      0
10       1     70    20    23      1

s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

Timings:

时间：

np.random.seed(123)
N = 1000000


L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
                   'pidx': np.random.randint(10000, size=N),
                   'pidy': np.random.choice(L2, N),
                   'count':np.random.randint(1000, size=N)})
print (df)


print (df.groupby('pidx').filter(lambda x: len(x) > 120))

def jez(df):
    s = df.pidx.value_counts()
    return df[df.pidx.isin(s[s>120].index)]

print (jez(df))

In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop

In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop

In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop

In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop

In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop

For final_scoreyou can use:

因为final_score您可以使用：

df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))

Answer 2

回答by Ted Petrou

This is a straightforward application of filter after doing a groupby. In the data you provided, a value of 20 for pidx only occurred twice so it was filtered out.

这是执行 groupby 后过滤器的直接应用。在您提供的数据中，pidx 的值 20 仅出现两次，因此被过滤掉了。

df.groupby('pidx').filter(lambda x: len(x) > 2)

   LeafID  count  pidx  pidy
0       1     10    10    20
1       1     20    10    20
3       1     40    10    20
7       6     50    10    43

Answer 3

回答by piRSquared

pandas

df[df.groupby('pidx').pidx.transform('count') > 2]


   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

Answer 4

回答by V Shreyas

First of all, your output shows you don't want to do a groupby. Read up on what groupby does. What you need is:

首先，您的输出表明您不想进行分组。阅读 groupby 的作用。你需要的是：

df2 = df[df['pidx']<=20]
df2.sort_index(by = 'pidx')

This will give you your exact result. Read up on pandas indexing and functions. In fact go and read the whole introduction on pandas. It will not take much time.

这将为您提供确切的结果。阅读 Pandas 索引和函数。事实上，去阅读关于大Pandas的整个介绍。不会花太多时间。

Row operations are also simple using indexing:

使用索引的行操作也很简单：

df2['final_score']= 0.4*df2['count'] + 0.6*df2['score']

pandas 在 groupby 熊猫之后过滤行

提问by Shubham R

采纳答案by jezrael

回答by Ted Petrou

回答by piRSquared

回答by V Shreyas

相关推荐

最近更新

标签

pandas 在 groupby 熊猫之后过滤行

提问by Shubham R

采纳答案by jezrael

回答by Ted Petrou

回答by piRSquared

回答by V Shreyas

相关推荐

pandas 数据框中的熊猫聚合计数

pandas 根据列索引重命名 Dataframe 列

pandas 打印数据帧名称

pandas 如何在熊猫中选择不以某些 str 开头的行？

相关推荐

最近更新

标签