Python 使用 Pandas 进行计数和排序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40454030/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:34:08  来源:igfitidea点击:

Count and Sort with Pandas

pythonsortingpandascountgroup-by

提问by Rubans

I have a dataframe for values form a file by which I have grouped by two columns, which return a count of the aggregation. Now I want to sort by the max count value, however I get the following error:

我有一个值的数据框形成一个文件,我通过该文件按两列分组,这些列返回聚合的计数。现在我想按最大计数值排序,但是出现以下错误:

KeyError: 'count'

关键错误:'计数'

Looks the group by agg count column is some sort of index so not sure how to do this, I'm a beginner to Python and Panda. Here's the actual code, please let me know if you need more detail:

看起来 group by agg count 列是某种索引,所以不知道该怎么做,我是 Python 和 Panda 的初学者。这是实际代码,如果您需要更多详细信息,请告诉我:

def answer_five():
    df = census_df#.set_index(['STNAME'])
    df = df[df['SUMLEV'] == 50]
    df = df[['STNAME','CTYNAME']].groupby(['STNAME']).agg(['count']).sort(['count'])
    #df.set_index(['count'])
    print(df.index)
    # get sorted count max item
    return df.head(5)

回答by jezrael

I think you need add reset_index, then parameter ascending=Falseto sort_valuesbecause sortreturn:

我认为你需要 add reset_index, then parameter ascending=Falsetosort_values因为sort返回:

FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....) .sort_values(['count'], ascending=False)

FutureWarning: sort(columns=....) 已弃用,使用 sort_values(by=.....) .sort_values(['count'], Ascending=False)

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False) \
                             .head(5)

Sample:

样本:

df = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
                   'CTYNAME':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})

print (df)
    CTYNAME STNAME
0         4      a
1         5      b
2         6      s
3         5      c
4         6      s
5         2      c
6         3      b
7         4      c
8         5      d
9         6      b
10        4      c
11        5      s
12        4      s
13        3      c
14        6      a
15        5      e

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False) \
                             .head(5)

print (df)
  STNAME  count
2      c      5
5      s      4
1      b      3
0      a      2
3      d      1


But it seems you need Series.nlargest:

但似乎你需要Series.nlargest

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'].count().nlargest(5)

or:

或者:

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'].size().nlargest(5)

The difference between sizeand countis:

sizecounts NaNvalues, countdoes not.

之间的区别sizecount是:

sizeNaN数值,count不。

Sample:

样本:

df = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
                   'CTYNAME':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})

print (df)
    CTYNAME STNAME
0         4      a
1         5      b
2         6      s
3         5      c
4         6      s
5         2      c
6         3      b
7         4      c
8         5      d
9         6      b
10        4      c
11        5      s
12        4      s
13        3      c
14        6      a
15        5      e

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME']
                             .size()
                             .nlargest(5)
                             .reset_index(name='top5')
print (df)
  STNAME  top5
0      c     5
1      s     4
2      b     3
3      a     2
4      d     1

回答by Christoph Schranz

I don't know exactly how your df looks like. But if you have to sort the frequency of several categories by its count, it is easier to slice a Series from the df and sort the series:

我不知道你的 df 到底长什么样。但是,如果您必须按计数对多个类别的频率进行排序,则更容易从 df 中切出一个系列并对系列进行排序:

series = df.count().sort_values(ascending=False)
series.head()

Note that this series will use the name of the category as index!

注意本系列将使用分类名称作为索引!

回答by Angelin Nadar

I agree with @Christoph Schranz to take slice a series from dataframe

我同意@Christoph Schranz 从数据帧中截取一个系列

df[['STNAME','CTYNAME']].groupby('STNAME')['CTYNAME'].count().nlargest(3)