Pandas GroupBy:如何根据列获取前 n 个值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34138634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:20:31  来源:igfitidea点击:

Pandas GroupBy : How to get top n values based on a column

pythonpandascountgroup-bydataframe

提问by AbtPst

forgive me if this is a basic question but i am new to pandas. I have a dataframe with with a column A and i would like to get the top n rows based on the count in Column A. For instance the raw data looks like

如果这是一个基本问题,请原谅我,但我是Pandas的新手。我有一个带有 A 列的数据框,我想根据 A 列中的计数获得前 n 行。例如,原始数据看起来像

A  B  C
x 12  ere
x 34  bfhg
z 6   bgn
z 8   rty
y 567 hmmu,,u
x 545 fghfgj
x 44  zxcbv

Note that this is just a small sample of the data that i am actually working with.

请注意,这只是我实际使用的数据的一小部分样本。

So if we look at Column A, value x appears 4 times,y appears 2 times and z appears 1 time. How can i get the top n values for Column A based on this count?

因此,如果我们查看 A 列,值 x 出现了 4 次,y 出现了 2 次,z 出现了 1 次。如何根据此计数获得 A 列的前 n 个值?

print df.groupby(['A']).sum()

this gives me

这给了我

A      B

x      6792117

but when i do

但是当我这样做的时候

print len(df.groupby(['A']).get_group('x'))

i get

我明白了

21

furthermore

此外

len(df.index) 

gives me

给我

23657

so how can the count of 'A' == 'x'be 6792117as seen in the result of group by? what am i missing?

这样的怎么算可以'A' == 'x'6792117在该组的结果可知?我错过了什么?

Update

更新

consider

考虑

print df.groupby(['A']).describe()

gives me

给我

     Tags           DocID

x    count      21.000000
     mean   323434.142857
     std     35677.410292
     min    266631.000000
     25%    292054.000000
     50%    325575.000000
     75%    347450.000000
     max    380286.000000

which makes sense. i just want to get the row which has the max count as per column A.

这是有道理的。我只想获得 A 列中具有最大计数的行。

Update2

更新2

i did

我做了

print df.groupby(['A'],as_index=False).count()

i get

我明白了

         A       B      C
0        x       21     21
1        y       11     11
2        z        8      8

so basically, for Column A, tag x has 21 entries in Column B and 21 in Column C. ColumnsB and C are unique in my case. which is good. now how do i get the top n rows with respect to column C?

所以基本上,对于 A 列,标签 x 在 B 列中有 21 个条目,在 C 列中有 21 个条目。在我的情况下,ColumnsB 和 C 是唯一的。这很好。现在我如何获得关于 C 列的前 n 行?

Update3

更新3

So i tried

所以我试过了

import heapq
print heapq.nlargest(3,df.groupby(['A'],as_index=False).count()['C'])

and i get

我明白了

[151, 85, 72]

so i know that for Column A, i have the above counts as the top 3 counts. But i still dont know which value of Column A do these counts refer to? For example which value in Column A has a count of 151? Is there any way to link this information?

所以我知道对于 A 列,我将上述计数作为前 3 个计数。但我仍然不知道这些计数指的是 A 列的哪个值?例如,A 列中的哪个值的计数为 151?有没有办法链接这些信息?

回答by jezrael

IIUC you can use function nlargest.

IIUC 你可以使用函数nlargest

I try your sample data and get top 2 rows by column C:

我尝试您的示例数据并按列获取前 2 行C

print df
   A    B        C
0  x   12      ere
1  x   34     bfhg
2  z    6      bgn
3  z    8      rty
4  y  567  hmmu,,u
5  x  545   fghfgj
6  x   44    zxcbv

dcf = df.groupby(['A'],as_index=False).count()
print dcf
   A  B  C
0  x  4  4
1  y  1  1
2  z  2  2

#get 2 largest rows by column C
print dcf.nlargest(2,'C')
   A  B  C
0  x  4  4
2  z  2  2

回答by AbtPst

one approach that i tried

我尝试过的一种方法

import heapq

dcf =  df.groupby(['A'],as_index=False).count()
print dcf.loc[dcf['C'].isin(heapq.nlargest(5,dcf['C']))].sort(['C'],ascending=False)

gives me

给我

      A       B      C
1664  g       151    151
1887  k       85     85
1533  q       72     72
53    y       68     68
1793  t       62     62

verified by

经核实

print len(df.loc[df["A"]=="g"])

gives me

给我

151

so i get the desired results as i can see the top 5 values based on the count from Column A. but surely there must be a better way of doing this?

所以我得到了想要的结果,因为我可以看到基于 A 列计数的前 5 个值。但肯定必须有更好的方法来做到这一点?