pandas 熊猫数据框有条件的 .mean() 取决于特定列中的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29778370/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:14:28  来源:igfitidea点击:

Panda dataframe conditional .mean() depending on values in certain column

pythonpandasconditionalmean

提问by tpapz

I'm trying to create a new column which returns the mean of values from an existing column in the same df. However the mean should be computed based on a grouping in three other columns.

我正在尝试创建一个新列,该列返回同一 df 中现有列的值的平均值。然而,平均值应该基于其他三列中的分组来计算。

Out[184]: 
   YEAR daytype hourtype  scenario  option_value    
0  2015     SAT     of_h         0      0.134499       
1  2015     SUN     of_h         1     63.019250      
2  2015     WD      of_h         2     52.113516       
3  2015     WD      pk_h         3     43.126513       
4  2015     SAT     of_h         4     56.431392 

I basically would like to have a new column 'mean' which compute the mean of "option value", when "YEAR", "daytype", and "hourtype" are similar.

当“YEAR”、“daytype”和“hourtype”相似时,我基本上想要一个新列“mean”来计算“option value”的平均值。

I tried the following approach but without success ...

我尝试了以下方法但没有成功......

In [185]: o2['premium']=o2.groupby(['YEAR', 'daytype', 'hourtype'])['option_cf'].mean()

TypeError: incompatible index of inserted column with frame index

采纳答案by Zero

Here's one way to do it

这是一种方法

In [19]: def cust_mean(grp):
   ....:     grp['mean'] = grp['option_value'].mean()
   ....:     return grp
   ....:

In [20]: o2.groupby(['YEAR', 'daytype', 'hourtype']).apply(cust_mean)
Out[20]:
   YEAR daytype hourtype  scenario  option_value       mean
0  2015     SAT     of_h         0      0.134499  28.282946
1  2015     SUN     of_h         1     63.019250  63.019250
2  2015      WD     of_h         2     52.113516  52.113516
3  2015      WD     pk_h         3     43.126513  43.126513
4  2015     SAT     of_h         4     56.431392  28.282946

So, what was going wrong with your attempt?

那么,你的尝试出了什么问题?

It returns an aggregate with different shape from the original dataframe.

它返回一个与原始数据帧形状不同的聚合。

In [21]: o2.groupby(['YEAR', 'daytype', 'hourtype'])['option_value'].mean()
Out[21]:
YEAR  daytype  hourtype
2015  SAT      of_h        28.282946
      SUN      of_h        63.019250
      WD       of_h        52.113516
               pk_h        43.126513
Name: option_value, dtype: float64

Oruse transform

使用transform

In [1461]: o2['premium'] = (o2.groupby(['YEAR', 'daytype', 'hourtype'])['option_value']
                              .transform('mean'))

In [1462]: o2
Out[1462]:
   YEAR daytype hourtype  scenario  option_value    premium
0  2015     SAT     of_h         0      0.134499  28.282946
1  2015     SUN     of_h         1     63.019250  63.019250
2  2015      WD     of_h         2     52.113516  52.113516
3  2015      WD     pk_h         3     43.126513  43.126513
4  2015     SAT     of_h         4     56.431392  28.282946

回答by KieranPC

You can do it the way you intended by tweaking your code in the following way:

您可以通过以下方式调整代码来按照您的预期方式完成此操作:

o2 = o2.set_index(['YEAR', 'daytype', 'hourtype'])

o2['premium'] = o2.groupby(level=['YEAR', 'daytype', 'hourtype'])['option_value'].mean()

Why the original error? As explained by John Galt, the data coming out of groupby().mean() is not the same shape (length) as the original DataFrame.

为什么是原始错误?正如 John Galt 所解释的,来自 groupby().mean() 的数据与原始 DataFrame 的形状(长度)不同。

Pandas can handle this cleverly if you first start with the 'grouping columns' in the index. Then it knows how to propogate the mean data correctly.

如果您首先从索引中的“分组列”开始,Pandas 可以巧妙地处理这个问题。然后它知道如何正确地传播平均数据。

John's solution follows the same logic, because groupby naturally puts the grouping columns in the index during execution.

John 的解决方案遵循相同的逻辑,因为 groupby 在执行过程中自然会将分组列放在索引中。