Python 使用 Pandas 进行计数和排序
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40454030/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Count and Sort with Pandas
提问by Rubans
I have a dataframe for values form a file by which I have grouped by two columns, which return a count of the aggregation. Now I want to sort by the max count value, however I get the following error:
我有一个值的数据框形成一个文件,我通过该文件按两列分组,这些列返回聚合的计数。现在我想按最大计数值排序,但是出现以下错误:
KeyError: 'count'
关键错误:'计数'
Looks the group by agg count column is some sort of index so not sure how to do this, I'm a beginner to Python and Panda. Here's the actual code, please let me know if you need more detail:
看起来 group by agg count 列是某种索引,所以不知道该怎么做,我是 Python 和 Panda 的初学者。这是实际代码,如果您需要更多详细信息,请告诉我:
def answer_five():
df = census_df#.set_index(['STNAME'])
df = df[df['SUMLEV'] == 50]
df = df[['STNAME','CTYNAME']].groupby(['STNAME']).agg(['count']).sort(['count'])
#df.set_index(['count'])
print(df.index)
# get sorted count max item
return df.head(5)
回答by jezrael
I think you need add reset_index
, then parameter ascending=False
to sort_values
because sort
return:
我认为你需要 add reset_index
, then parameter ascending=False
tosort_values
因为sort
返回:
FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....) .sort_values(['count'], ascending=False)
FutureWarning: sort(columns=....) 已弃用,使用 sort_values(by=.....) .sort_values(['count'], Ascending=False)
df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] \
.count() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False) \
.head(5)
Sample:
样本:
df = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
'CTYNAME':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})
print (df)
CTYNAME STNAME
0 4 a
1 5 b
2 6 s
3 5 c
4 6 s
5 2 c
6 3 b
7 4 c
8 5 d
9 6 b
10 4 c
11 5 s
12 4 s
13 3 c
14 6 a
15 5 e
df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] \
.count() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False) \
.head(5)
print (df)
STNAME count
2 c 5
5 s 4
1 b 3
0 a 2
3 d 1
But it seems you need Series.nlargest
:
但似乎你需要Series.nlargest
:
df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'].count().nlargest(5)
or:
或者:
df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'].size().nlargest(5)
The difference between
size
andcount
is:
之间的区别
size
和count
是:
Sample:
样本:
df = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
'CTYNAME':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})
print (df)
CTYNAME STNAME
0 4 a
1 5 b
2 6 s
3 5 c
4 6 s
5 2 c
6 3 b
7 4 c
8 5 d
9 6 b
10 4 c
11 5 s
12 4 s
13 3 c
14 6 a
15 5 e
df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME']
.size()
.nlargest(5)
.reset_index(name='top5')
print (df)
STNAME top5
0 c 5
1 s 4
2 b 3
3 a 2
4 d 1
回答by Christoph Schranz
I don't know exactly how your df looks like. But if you have to sort the frequency of several categories by its count, it is easier to slice a Series from the df and sort the series:
我不知道你的 df 到底长什么样。但是,如果您必须按计数对多个类别的频率进行排序,则更容易从 df 中切出一个系列并对系列进行排序:
series = df.count().sort_values(ascending=False)
series.head()
Note that this series will use the name of the category as index!
注意本系列将使用分类名称作为索引!
回答by Angelin Nadar
I agree with @Christoph Schranz to take slice a series from dataframe
我同意@Christoph Schranz 从数据帧中截取一个系列
df[['STNAME','CTYNAME']].groupby('STNAME')['CTYNAME'].count().nlargest(3)