Python 熊猫数据框中的分组加权平均值和总和

Question

提问by samsri

I have a dataframe ,

我有一个数据框，

    Out[78]: 
   contract month year  buys  adjusted_lots    price
0         W     Z    5  Sell             -5   554.85
1         C     Z    5  Sell             -3   424.50
2         C     Z    5  Sell             -2   424.00
3         C     Z    5  Sell             -2   423.75
4         C     Z    5  Sell             -3   423.50
5         C     Z    5  Sell             -2   425.50
6         C     Z    5  Sell             -3   425.25
7         C     Z    5  Sell             -2   426.00
8         C     Z    5  Sell             -2   426.75
9        CC     U    5   Buy              5  3328.00
10       SB     V    5   Buy              5    11.65
11       SB     V    5   Buy              5    11.64
12       SB     V    5   Buy              2    11.60

I need a sum of adjusted_lots , price which is weighted average , of price and ajusted_lots , grouped by all the other columns , ie. grouped by (contract, month , year and buys)

我需要一个adjusted_lots 的总和， price 是加权平均值， price 和 ajusted_lots ，按所有其他列分组，即。按（合同、月、年和购买）分组

Similiar solution on R was achieved by following code, using dplyr, however unable to do the same in pandas.

R 上的类似解决方案是通过以下代码实现的，使用 dplyr，但无法在 Pandas 中执行相同操作。

> newdf = df %>%
  select ( contract , month , year , buys , adjusted_lots , price ) %>%
  group_by( contract , month , year ,  buys) %>%
  summarise(qty = sum( adjusted_lots) , avgpx = weighted.mean(x = price , w = adjusted_lots) , comdty = "Comdty" )

> newdf
Source: local data frame [4 x 6]

  contract month year comdty qty     avgpx
1        C     Z    5 Comdty -19  424.8289
2       CC     U    5 Comdty   5 3328.0000
3       SB     V    5 Comdty  12   11.6375
4        W     Z    5 Comdty  -5  554.8500

is the same possible by groupby or any other solution ?

groupby 或任何其他解决方案是否可能相同？

Answer 1

采纳答案by jrjc

EDIT:update aggregation so it works with recent version of pandas

编辑：更新聚合，使其适用于最新版本的熊猫

To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:

要将多个函数传递给 groupby 对象，您需要传递一个包含聚合函数和该函数适用的列的元组：

# Define a lambda function to compute the weighted mean:
wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"])

# Define a dictionary with the functions to apply for a given column:
# the following is deprecated since pandas 0.20:
# f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} }
# df.groupby(["contract", "month", "year", "buys"]).agg(f)

# Groupby and [aggregate with namedAgg][1]:
df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"),  
                                                      price_weighted_mean=("price", wm))

                          adjusted_lots  price_weighted_mean
contract month year buys                                    
C        Z     5    Sell            -19           424.828947
CC       U     5    Buy               5          3328.000000
SB       V     5    Buy              12            11.637500
W        Z     5    Sell             -5           554.850000

You can see more here:

你可以在这里看到更多：

http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once

http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once

and in a similar question here:

在一个类似的问题中：

Apply multiple functions to multiple groupby columns

将多个函数应用于多个 groupby 列

Hope this helps

希望这可以帮助

Answer 2

回答by ErnestScribbler

Doing weighted average by groupby(...).apply(...) can be very slow (100x from the following). See my answer (and others) on this thread.

按 groupby(...).apply(...) 进行加权平均可能会非常慢（以下是 100 倍）。在此线程上查看我的回答（和其他人）。

def weighted_average(df,data_col,weight_col,by_col):
    df['_data_times_weight'] = df[data_col]*df[weight_col]
    df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col])
    g = df.groupby(by_col)
    result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
    del df['_data_times_weight'], df['_weight_where_notnull']
    return result

Answer 3

回答by Mark Greenwood

The solution that uses a dict of aggregation functions will be deprecated in a future version of pandas (version 0.22):

使用聚合函数字典的解决方案将在熊猫的未来版本（0.22 版）中被弃用：

FutureWarning: using a dict with renaming is deprecated and will be removed in a future 
version return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)

Use a groupby apply and return a Series to rename columns as discussed in: Rename result columns from Pandas aggregation ("FutureWarning: using a dict with renaming is deprecated")

使用 groupby 应用并返回一个系列来重命名列，如：重命名 Pandas 聚合中的结果列（“FutureWarning：不推荐使用重命名的字典”）

def my_agg(x):
    names = {'weighted_ave_price': (x['adjusted_lots'] * x['price']).sum()/x['adjusted_lots'].sum()}
    return pd.Series(names, index=['weighted_ave_price'])

produces the same result:

产生相同的结果：

>df.groupby(["contract", "month", "year", "buys"]).apply(my_agg)

                          weighted_ave_price
contract month year buys                    
C        Z     5    Sell          424.828947
CC       U     5    Buy          3328.000000
SB       V     5    Buy            11.637500
W        Z     5    Sell          554.850000

Python 熊猫数据框中的分组加权平均值和总和

提问by samsri

采纳答案by jrjc

回答by ErnestScribbler

回答by Mark Greenwood

相关推荐

最近更新

标签

Python 熊猫数据框中的分组加权平均值和总和

提问by samsri

采纳答案by jrjc

回答by ErnestScribbler

回答by Mark Greenwood

相关推荐

我在做什么的 Python 多处理进程或池？

使用 OpenCV Python 从 Android 智能手机捕获视频

如何在python中选择特定的json元素

Python 导入错误：没有名为 matplotlib.pyplot 的模块

相关推荐

最近更新

标签