使用 Pandas 中的数据透视表加权平均值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/22032153/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Weighted average using pivot tables in pandas
提问by user308827
I have written some code to compute a weighted average using pivot tables in pandas. However, I am not sure how to add the actual column which performs the weighted averaging (Add a new column where each row contains value of 'cumulative'/'COUNT').
我已经编写了一些代码来使用 Pandas 中的数据透视表计算加权平均值。但是,我不确定如何添加执行加权平均的实际列(添加一个新列,其中每行包含 'cumulative'/'COUNT' 的值)。
The data looks like so:
数据看起来像这样:
VALUE   COUNT   GRID    agb
1       43      1476    1051
2       212     1476    2983
5       7       1477    890
4       1361    1477    2310
Here is my code:
这是我的代码:
# Read input data
lup_df  = pandas.DataFrame.from_csv(o_dir+LUP+'.csv',index_col=False)
# Insert a new column with area * variable
lup_df['cumulative'] = lup_df['COUNT']*lup_df['agb']
# Create and output pivot table
lup_pvt = pandas.pivot_table(lup_df, 'agb', rows=['GRID'])         
# TODO: Add a new column where each row contains value of 'cumulative'/'COUNT'
lup_pvt.to_csv(o_dir+PIVOT+'.csv',index=True,header=True,sep=',')
How can I do this?
我怎样才能做到这一点?
回答by 8one6
So you want, for each value of grid, the weighted average of the agbcolumn where the weights are the values in the countcolumn.  If that interpretation is correct, I think this does the trick with groupby:
因此,对于 的每个值,您需要列grid的加权平均值,agb其中权重是列中的值count。如果这种解释是正确的,我认为这可以解决以下问题groupby:
import numpy as np
import pandas as pd
np.random.seed(0)
n = 50
df = pd.DataFrame({'count': np.random.choice(np.arange(10)+1, n),
                   'grid': np.random.choice(np.arange(10)+50, n),
                   'value': np.random.randn(n) + 12})
df['prod'] = df['count'] * df['value']
grouped = df.groupby('grid').sum()
grouped['wtdavg'] = grouped['prod'] / grouped['count']
print grouped
      count       value        prod     wtdavg
grid                                          
50       22   57.177042  243.814417  11.082474
51       27   58.801386  318.644085  11.801633
52       11   34.202619  135.127942  12.284358
53       24   59.340084  272.836636  11.368193
54       39  137.268317  482.954857  12.383458
55       47   79.468986  531.122652  11.300482
56       17   38.624369  214.188938  12.599349
57       22   38.572429  279.948202  12.724918
58       27   36.492929  327.315518  12.122797
59       34   60.851671  408.306429  12.009013
Or, if you want to be a bit slick and write a weighted average function you can use over and over:
或者,如果您想变得灵活一点并编写一个加权平均函数,您可以一遍又一遍地使用:
import numpy as np
import pandas as pd
np.random.seed(0)
n = 50
df = pd.DataFrame({'count': np.random.choice(np.arange(10)+1, n),
                   'grid': np.random.choice(np.arange(10)+50, n),
                   'value': np.random.randn(n) + 12})
def wavg(val_col_name, wt_col_name):
    def inner(group):
        return (group[val_col_name] * group[wt_col_name]).sum() / group[wt_col_name].sum()
    inner.__name__ = 'wtd_avg'
    return inner
slick = df.groupby('grid').apply(wavg('value', 'count'))
print slick
grid
50      11.082474
51      11.801633
52      12.284358
53      11.368193
54      12.383458
55      11.300482
56      12.599349
57      12.724918
58      12.122797
59      12.009013
dtype: float64

