Python Pandas groupby nlargest sum

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40390634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:29:34  来源:igfitidea点击:

Pandas groupby nlargest sum

pythonpandasgroup-bysum

提问by user7102752

I am trying to use groupby, nlargest, and sumfunctions in Pandas together, but having trouble making it work.

我试图在 Pandas 中一起使用groupbynlargestsum函数,但无法使其正常工作。

State    County    Population
Alabama  a         100
Alabama  b         50
Alabama  c         40
Alabama  d         5
Alabama  e         1
...
Wyoming  a.51      180
Wyoming  b.51      150
Wyoming  c.51      56
Wyoming  d.51      5

I want to use groupbyto select by state, then get the top 2 counties by population. Then use only those top 2 county population numbers to get a sum for that state.

我想使用groupby按州选择,然后按人口获取前 2 个县。然后仅使用前 2 个县的人口数字来计算该州的总和。

In the end, I'll have a list that will have the state and the population (of it's top 2 counties).

最后,我将列出一个包含州和人口(前 2 个县)的列表。

I can get the groupbyand nlargestto work, but getting the sum of the nlargest(2)is a challenge.

我可以让groupbynlargest工作,但获得总和nlargest(2)是一个挑战。

The line I have right now is simply: df.groupby('State')['Population'].nlargest(2)

我现在的线路很简单: df.groupby('State')['Population'].nlargest(2)

回答by root

You can use applyafter performing the groupby:

您可以apply在执行以下操作后使用groupby

df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())

I think this issue you're having is that df.groupby('State')['Population'].nlargest(2)will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply/agg.

我认为您遇到的这个问题df.groupby('State')['Population'].nlargest(2)将返回一个 DataFrame,因此您不能再进行组级操作。通常,如果要在一个组中执行多个操作,则需要使用apply/ agg

The resulting output:

结果输出:

State
Alabama    150
Wyoming    330

EDIT

编辑

A slightly cleaner approach, as suggested by @c???s????:

正如@c???s???? 所建议的,一种更简洁的方法:

df.groupby('State')['Population'].nlargest(2).sum(level=0)

This is slightly slower than using applyon larger DataFrames though.

不过,这比apply在较大的 DataFrame 上使用稍慢。

Using the following setup:

使用以下设置:

import numpy as np
import pandas as pd
from string import ascii_letters

n = 10**6
df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
                   'B': np.random.randint(10**7, size=n)})

I get the following timings:

我得到以下时间:

In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The slower performance is potentially caused by the levelkwarg in sumperforming a second groupbyunder the hood.

较慢的性能可能是由levelkwarg 在引擎盖下sum执行一秒钟造成的groupby

回答by aquaraga

Using agg, the grouping logic looks like:

使用agg,分组逻辑如下所示:

df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})

df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})

This results in another dataframe object; which you could query to find the most populous states, etc.

这会产生另一个数据框对象;您可以查询以查找人口最多的州等。

           Population
State
Alabama    150
Wyoming    330