pandas 熊猫:组内最大值和最小值之间的差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40183800/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:16:03  来源:igfitidea点击:

Pandas: Difference between largest and smallest value within group

pythonpandasnumpy

提问by David

Given a data frame that looks like this

给定一个看起来像这样的数据框

GROUP VALUE
  1     5
  2     2
  1     10
  2     20
  1     7

I would like to compute the difference between the largest and smallest value within each group. That is, the result should be

我想计算每组中最大值和最小值之间的差异。也就是说,结果应该是

GROUP   DIFF
  1      5
  2      18

What is an easy way to do this in Pandas?

在 Pandas 中有什么简单的方法可以做到这一点?

What is a fast way to do this in Pandas for a data frame with about 2 million rows and 1 million groups?

在 Pandas 中,对于大约有 200 万行和 100 万个组的数据框,有什么快速的方法可以做到这一点?

回答by piRSquared

Using @unutbu 's df

使用 @unutbu 的 df

per timing
unutbu's solution is best over large data sets

每个时间
unutbu 的解决方案最适合大型数据集

import pandas as pd
import numpy as np

df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})

df.groupby('GROUP')['VALUE'].agg(np.ptp)

GROUP
1     5
2    18
Name: VALUE, dtype: int64


np.ptpdocsreturns the range of an array

np.ptpdocs返回数组的范围



timing
small df

定时
df

enter image description here

在此处输入图片说明

large df
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 100, VALUE=np.random.rand(1000000)))

大的 df
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 100, VALUE=np.random.rand(1000000)))

enter image description here

在此处输入图片说明

large df
many groups
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 10000, VALUE=np.random.rand(1000000)))

df
许多组
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 10000, VALUE=np.random.rand(1000000)))

enter image description here

在此处输入图片说明

回答by unutbu

groupby/agggenerally performs best when you take advantage of the built-in aggregators such as 'max'and 'min'. So to obtain the difference, first compute the maxand minand then subtract:

groupby/agg当您利用内置的聚合如通常表现最好'max''min'。因此获得的区别,首先计算maxmin,然后减去:

import pandas as pd
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
result = df.groupby('GROUP')['VALUE'].agg(['max','min'])
result['diff'] = result['max']-result['min']
print(result[['diff']])

yields

产量

       diff
GROUP      
1         5
2        18

回答by ASGM

You can use groupby(), min(), and max():

您可以使用groupby()min()以及max()

df.groupby('GROUP')['VALUE'].apply(lambda g: g.max() - g.min())