对组对象中的不同项目应用不同的函数:Python pandas
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15262134/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Apply different functions to different items in group object: Python pandas
提问by kunitomo
Suppose I have a dataframe as follows:
假设我有一个如下的数据框:
In [1]: test_dup_df
Out[1]:
exe_price exe_vol flag
2008-03-13 14:41:07 84.5 200 yes
2008-03-13 14:41:37 85.0 10000 yes
2008-03-13 14:41:38 84.5 69700 yes
2008-03-13 14:41:39 84.5 1200 yes
2008-03-13 14:42:00 84.5 1000 yes
2008-03-13 14:42:08 84.5 300 yes
2008-03-13 14:42:10 84.5 88100 yes
2008-03-13 14:42:10 84.5 11900 yes
2008-03-13 14:42:15 84.5 5000 yes
2008-03-13 14:42:16 84.5 3200 yes
I want to group a duplicate data at time 14:42:10and apply different functions to exe_priceand exe_vol(e.g., sum the exe_voland compute volume weighted average of exe_price). I know that I can do
我想组在时间重复的数据14:42:10和应用不同的功能,以exe_price及exe_vol(例如,求和exe_vol加权平均的和计算体积exe_price)。我知道我能做到
In [2]: grouped = test_dup_df.groupby(level=0)
to group the duplicate indices and then use the first()or last()functions to get either the first or the last row but this is not really what I want.
对重复的索引进行分组,然后使用first()或last()函数获取第一行或最后一行,但这并不是我真正想要的。
Is there a way to group and then apply different (written by me) functions to values in different column?
有没有办法分组然后将不同的(由我编写的)函数应用于不同列中的值?
回答by waitingkuo
Apply your own function:
应用你自己的函数:
In [12]: def func(x):
exe_price = (x['exe_price']*x['exe_vol']).sum() / x['exe_vol'].sum()
exe_vol = x['exe_vol'].sum()
flag = True
return Series([exe_price, exe_vol, flag], index=['exe_price', 'exe_vol', 'flag'])
In [13]: test_dup_df.groupby(test_dup_df.index).apply(func)
Out[13]:
exe_price exe_vol flag
date_time
2008-03-13 14:41:07 84.5 200 True
2008-03-13 14:41:37 85 10000 True
2008-03-13 14:41:38 84.5 69700 True
2008-03-13 14:41:39 84.5 1200 True
2008-03-13 14:42:00 84.5 1000 True
2008-03-13 14:42:08 84.5 300 True
2008-03-13 14:42:10 20.71 100000 True
2008-03-13 14:42:15 84.5 5000 True
2008-03-13 14:42:16 84.5 3200 True
回答by unutbu
I like @waitingkuo's answer because it is very clear and readable.
我喜欢@waitingkuo 的回答,因为它非常清晰易读。
I'm keeping this around anyway because it does appear to be faster -- at least with Pandas version 0.10.0. The situation may (hopefully) change in the future, so be sure to rerun the benchmark especially if you are using a different version of Pandas.
无论如何我都会保留它,因为它看起来确实更快——至少在 Pandas 0.10.0 版中。情况可能(希望)在未来发生变化,因此请务必重新运行基准测试,尤其是在您使用不同版本的 Pandas 时。
import pandas as pd
import io
import timeit
data = '''\
date time exe_price exe_vol flag
2008-03-13 14:41:07 84.5 200 yes
2008-03-13 14:41:37 85.0 10000 yes
2008-03-13 14:41:38 84.5 69700 yes
2008-03-13 14:41:39 84.5 1200 yes
2008-03-13 14:42:00 84.5 1000 yes
2008-03-13 14:42:08 84.5 300 yes
2008-03-13 14:42:10 10 88100 yes
2008-03-13 14:42:10 100 11900 yes
2008-03-13 14:42:15 84.5 5000 yes
2008-03-13 14:42:16 84.5 3200 yes'''
df = pd.read_table(io.BytesIO(data), sep='\s+', parse_dates=[[0, 1]],
index_col=0)
def func(subf):
exe_vol = subf['exe_vol'].sum()
exe_price = ((subf['exe_price']*subf['exe_vol']).sum()
/ exe_vol)
flag = True
return pd.Series([exe_price, exe_vol, flag],
index=['exe_price', 'exe_vol', 'flag'])
# return exe_price
def using_apply():
return df.groupby(df.index).apply(func)
def using_helper_column():
df['weight'] = df['exe_price'] * df['exe_vol']
grouped = df.groupby(level=0, group_keys=True)
result = grouped.agg({'weight': 'sum', 'exe_vol': 'sum'})
result['exe_price'] = result['weight'] / result['exe_vol']
result['flag'] = True
result = result.drop(['weight'], axis=1)
return result
result = using_apply()
print(result)
result = using_helper_column()
print(result)
time_apply = timeit.timeit('m.using_apply()',
'import __main__ as m ',
number=1000)
time_helper = timeit.timeit('m.using_helper_column()',
'import __main__ as m ',
number=1000)
print('using_apply: {t}'.format(t = time_apply))
print('using_helper_column: {t}'.format(t = time_helper))
yields
产量
exe_vol exe_price flag
date_time
2008-03-13 14:41:07 200 84.50 True
2008-03-13 14:41:37 10000 85.00 True
2008-03-13 14:41:38 69700 84.50 True
2008-03-13 14:41:39 1200 84.50 True
2008-03-13 14:42:00 1000 84.50 True
2008-03-13 14:42:08 300 84.50 True
2008-03-13 14:42:10 100000 20.71 True
2008-03-13 14:42:15 5000 84.50 True
2008-03-13 14:42:16 3200 84.50 True
with timeit benchmarks of:
使用 timeit 基准:
using_apply: 3.0081038475
using_helper_column: 1.35300707817
回答by askewchan
Not terribly familiar with pandas, but in pure numpy you could do:
不是很熟悉pandas,但在纯 numpy 中你可以这样做:
tot_vol = np.sum(grouped['exe_vol'])
avg_price = np.average(grouped['exe_price'], weights=grouped['exe_vol'])

