pandas 使用熊猫数据框计算加权平均值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33657809/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:13:13  来源:igfitidea点击:

Calculate weighted average with pandas dataframe

pythonpandasweighted-average

提问by Cecilia

I have te following pandas dataframe:

我有以下Pandas数据框:

data_df = pd.DataFrame({'ind':['la','p','la','la','p','g','g','la'],
                        'dist':[10.,5.,7.,8.,7.,2.,5.,3.],
                        'diff':[0.54,3.2,8.6,7.2,2.1,1.,3.5,4.5],
                        'cas':[1.,2.,3.,4.,5.,6.,7.,8.]})  

that is

那是

    cas  diff  dist ind
0    1  0.54    10  la
1    2  3.20     5   p
2    3  8.60     7  la
3    4  7.20     8  la
4    5  2.10     7   p
5    6  1.00     2   g
6    7  3.50     5   g
7    8  4.50     3  la

I need to compute the weighted average of all the columns where the weights are in the 'dist' column and group the values by 'ind'.

我需要计算权重在“dist”列中的所有列的加权平均值,并按“ind”对值进行分组。

For example for 'ind'='la' and the 'diff' column:

例如对于 'ind'='la' 和 'diff' 列:

((10*0.54)+(8.60*7)+(7.20*8)+(4.50*3))/(10+7+8+3) = 4.882143

The result I want to obtain is the following

我想获得的结果如下

        cas      diff
ind                    
 g    6.714286  2.785714
 la   3.107143  4.882143
 p    3.750000  2.558333

which is obtained by multiplying each value of each colums by the corrisponding value in the 'dist' column, sum the results with the same 'ind' and then dividing the result by the sum of all the 'dist' values corrisponding to the same ind.

这是通过将每个列的每个值乘以 'dist' 列中的相应值获得的,将具有相同 'ind' 的结果相加,然后将结果除以对应于相同 ind 的所有 'dist' 值的总和.

I thought this would have been an easy task done by the dataframe 'groupby' method, but actually it's kind of tricky.

我认为这将是数据框“groupby”方法完成的一项简单任务,但实际上它有点棘手。

Can someone please help me?

有人可以帮帮我吗?

采纳答案by behzad.nouri

You may obtain within groups normalized weightsby using transform:

您可能获得小组归一化的权重中使用transform

>>> df['weight'] = df['dist'] / df.groupby('ind')['dist'].transform('sum')
>>> df['weight']
0    0.357143
1    0.416667
2    0.250000
3    0.285714
4    0.583333
5    0.285714
6    0.714286
7    0.107143
Name: weight, dtype: float64

Then, you just need to multiply these weight by the values, and take the sum:

然后,您只需要将这些权重乘以这些值,然后求和:

>>> df['wcas'], df['wdiff'] = (df[n] * df['weight'] for n in ('cas', 'diff'))
>>> df.groupby('ind')[['wcas', 'wdiff']].sum()
         wcas     wdiff
ind                    
g    6.714286  2.785714
la   3.107143  4.882143
p    3.750000  2.558333


Edit: with in-place mutation:

编辑:就地突变:

>>> backup = df.copy()     # make a backup copy to mutate in place
>>> cols = df.columns[:2]  # cas, diff
>>> df[cols] = df['weight'].values[:, None] * df[cols]
>>> df.groupby('ind')[cols].sum()
          cas      diff
ind                    
g    6.714286  2.785714
la   3.107143  4.882143
p    3.750000  2.558333