Pandas GroupBy:如何根据列获取前 n 个值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34138634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas GroupBy : How to get top n values based on a column
提问by AbtPst
forgive me if this is a basic question but i am new to pandas. I have a dataframe with with a column A and i would like to get the top n rows based on the count in Column A. For instance the raw data looks like
如果这是一个基本问题,请原谅我,但我是Pandas的新手。我有一个带有 A 列的数据框,我想根据 A 列中的计数获得前 n 行。例如,原始数据看起来像
A B C
x 12 ere
x 34 bfhg
z 6 bgn
z 8 rty
y 567 hmmu,,u
x 545 fghfgj
x 44 zxcbv
Note that this is just a small sample of the data that i am actually working with.
请注意,这只是我实际使用的数据的一小部分样本。
So if we look at Column A, value x appears 4 times,y appears 2 times and z appears 1 time. How can i get the top n values for Column A based on this count?
因此,如果我们查看 A 列,值 x 出现了 4 次,y 出现了 2 次,z 出现了 1 次。如何根据此计数获得 A 列的前 n 个值?
print df.groupby(['A']).sum()
this gives me
这给了我
A B
x 6792117
but when i do
但是当我这样做的时候
print len(df.groupby(['A']).get_group('x'))
i get
我明白了
21
furthermore
此外
len(df.index)
gives me
给我
23657
so how can the count of 'A' == 'x'
be 6792117
as seen in the result of group by? what am i missing?
这样的怎么算可以'A' == 'x'
是6792117
在该组的结果可知?我错过了什么?
Update
更新
consider
考虑
print df.groupby(['A']).describe()
gives me
给我
Tags DocID
x count 21.000000
mean 323434.142857
std 35677.410292
min 266631.000000
25% 292054.000000
50% 325575.000000
75% 347450.000000
max 380286.000000
which makes sense. i just want to get the row which has the max count as per column A.
这是有道理的。我只想获得 A 列中具有最大计数的行。
Update2
更新2
i did
我做了
print df.groupby(['A'],as_index=False).count()
i get
我明白了
A B C
0 x 21 21
1 y 11 11
2 z 8 8
so basically, for Column A, tag x has 21 entries in Column B and 21 in Column C. ColumnsB and C are unique in my case. which is good. now how do i get the top n rows with respect to column C?
所以基本上,对于 A 列,标签 x 在 B 列中有 21 个条目,在 C 列中有 21 个条目。在我的情况下,ColumnsB 和 C 是唯一的。这很好。现在我如何获得关于 C 列的前 n 行?
Update3
更新3
So i tried
所以我试过了
import heapq
print heapq.nlargest(3,df.groupby(['A'],as_index=False).count()['C'])
and i get
我明白了
[151, 85, 72]
so i know that for Column A, i have the above counts as the top 3 counts. But i still dont know which value of Column A do these counts refer to? For example which value in Column A has a count of 151? Is there any way to link this information?
所以我知道对于 A 列,我将上述计数作为前 3 个计数。但我仍然不知道这些计数指的是 A 列的哪个值?例如,A 列中的哪个值的计数为 151?有没有办法链接这些信息?
回答by jezrael
IIUC you can use function nlargest
.
IIUC 你可以使用函数nlargest
。
I try your sample data and get top 2 rows by column C
:
我尝试您的示例数据并按列获取前 2 行C
:
print df
A B C
0 x 12 ere
1 x 34 bfhg
2 z 6 bgn
3 z 8 rty
4 y 567 hmmu,,u
5 x 545 fghfgj
6 x 44 zxcbv
dcf = df.groupby(['A'],as_index=False).count()
print dcf
A B C
0 x 4 4
1 y 1 1
2 z 2 2
#get 2 largest rows by column C
print dcf.nlargest(2,'C')
A B C
0 x 4 4
2 z 2 2
回答by AbtPst
one approach that i tried
我尝试过的一种方法
import heapq
dcf = df.groupby(['A'],as_index=False).count()
print dcf.loc[dcf['C'].isin(heapq.nlargest(5,dcf['C']))].sort(['C'],ascending=False)
gives me
给我
A B C
1664 g 151 151
1887 k 85 85
1533 q 72 72
53 y 68 68
1793 t 62 62
verified by
经核实
print len(df.loc[df["A"]=="g"])
gives me
给我
151
so i get the desired results as i can see the top 5 values based on the count from Column A. but surely there must be a better way of doing this?
所以我得到了想要的结果,因为我可以看到基于 A 列计数的前 5 个值。但肯定必须有更好的方法来做到这一点?