Python Pandas groupby nlargest sum

Question

提问by user7102752

I am trying to use groupby, nlargest, and sumfunctions in Pandas together, but having trouble making it work.

我试图在 Pandas 中一起使用groupby、nlargest和sum函数，但无法使其正常工作。

State    County    Population
Alabama  a         100
Alabama  b         50
Alabama  c         40
Alabama  d         5
Alabama  e         1
...
Wyoming  a.51      180
Wyoming  b.51      150
Wyoming  c.51      56
Wyoming  d.51      5

I want to use groupbyto select by state, then get the top 2 counties by population. Then use only those top 2 county population numbers to get a sum for that state.

我想使用groupby按州选择，然后按人口获取前 2 个县。然后仅使用前 2 个县的人口数字来计算该州的总和。

In the end, I'll have a list that will have the state and the population (of it's top 2 counties).

最后，我将列出一个包含州和人口（前 2 个县）的列表。

I can get the groupbyand nlargestto work, but getting the sum of the nlargest(2)is a challenge.

我可以让groupby和nlargest工作，但获得总和nlargest(2)是一个挑战。

The line I have right now is simply: df.groupby('State')['Population'].nlargest(2)

我现在的线路很简单： df.groupby('State')['Population'].nlargest(2)

Answer 1

回答by root

You can use applyafter performing the groupby:

您可以apply在执行以下操作后使用groupby：

df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())

I think this issue you're having is that df.groupby('State')['Population'].nlargest(2)will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply/agg.

我认为您遇到的这个问题df.groupby('State')['Population'].nlargest(2)将返回一个 DataFrame，因此您不能再进行组级操作。通常，如果要在一个组中执行多个操作，则需要使用apply/ agg。

The resulting output:

结果输出：

State
Alabama    150
Wyoming    330

EDIT

编辑

A slightly cleaner approach, as suggested by @c???s????:

正如@c???s???? 所建议的，一种更简洁的方法：

df.groupby('State')['Population'].nlargest(2).sum(level=0)

This is slightly slower than using applyon larger DataFrames though.

不过，这比apply在较大的 DataFrame 上使用稍慢。

Using the following setup:

使用以下设置：

import numpy as np
import pandas as pd
from string import ascii_letters

n = 10**6
df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
                   'B': np.random.randint(10**7, size=n)})

I get the following timings:

我得到以下时间：

In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The slower performance is potentially caused by the levelkwarg in sumperforming a second groupbyunder the hood.

较慢的性能可能是由levelkwarg 在引擎盖下sum执行一秒钟造成的groupby。

Answer 2

回答by aquaraga

Using agg, the grouping logic looks like:

使用agg，分组逻辑如下所示：

df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})

This results in another dataframe object; which you could query to find the most populous states, etc.

这会产生另一个数据框对象；您可以查询以查找人口最多的州等。

           Population
State
Alabama    150
Wyoming    330

Python Pandas groupby nlargest sum

提问by user7102752

回答by root

回答by aquaraga

相关推荐

最近更新

标签

Python Pandas groupby nlargest sum

提问by user7102752

回答by root

回答by aquaraga

相关推荐

Python pandas 如何检查数据框中所有列的 dtype？

Python 类型错误：generatecode() 采用 0 个位置参数，但给出了 1 个

Python 如何将自定义函数应用于每行的熊猫数据框

Python 无法从 Django 中的另一个应用程序导入模型

相关推荐

最近更新

标签