对组对象中的不同项目应用不同的函数：Python pandas

Question

提问by kunitomo

Suppose I have a dataframe as follows:

假设我有一个如下的数据框：

In [1]: test_dup_df

Out[1]:
                  exe_price exe_vol flag 
2008-03-13 14:41:07  84.5    200     yes
2008-03-13 14:41:37  85.0    10000   yes
2008-03-13 14:41:38  84.5    69700   yes
2008-03-13 14:41:39  84.5    1200    yes
2008-03-13 14:42:00  84.5    1000    yes
2008-03-13 14:42:08  84.5    300     yes
2008-03-13 14:42:10  84.5    88100   yes
2008-03-13 14:42:10  84.5    11900   yes
2008-03-13 14:42:15  84.5    5000    yes
2008-03-13 14:42:16  84.5    3200    yes

I want to group a duplicate data at time 14:42:10and apply different functions to exe_priceand exe_vol(e.g., sum the exe_voland compute volume weighted average of exe_price). I know that I can do

我想组在时间重复的数据14:42:10和应用不同的功能，以exe_price及exe_vol（例如，求和exe_vol加权平均的和计算体积exe_price）。我知道我能做到

In [2]: grouped = test_dup_df.groupby(level=0)

to group the duplicate indices and then use the first()or last()functions to get either the first or the last row but this is not really what I want.

对重复的索引进行分组，然后使用first()或last()函数获取第一行或最后一行，但这并不是我真正想要的。

Is there a way to group and then apply different (written by me) functions to values in different column?

有没有办法分组然后将不同的（由我编写的）函数应用于不同列中的值？

Answer 1

回答by waitingkuo

Apply your own function:

应用你自己的函数：

In [12]: def func(x):
             exe_price = (x['exe_price']*x['exe_vol']).sum() / x['exe_vol'].sum()
             exe_vol = x['exe_vol'].sum()
             flag = True        
             return Series([exe_price, exe_vol, flag], index=['exe_price', 'exe_vol', 'flag'])


In [13]: test_dup_df.groupby(test_dup_df.index).apply(func)
Out[13]:
                    exe_price exe_vol  flag
date_time                                  
2008-03-13 14:41:07      84.5     200  True 
2008-03-13 14:41:37        85   10000  True
2008-03-13 14:41:38      84.5   69700  True
2008-03-13 14:41:39      84.5    1200  True
2008-03-13 14:42:00      84.5    1000  True
2008-03-13 14:42:08      84.5     300  True
2008-03-13 14:42:10     20.71  100000  True
2008-03-13 14:42:15      84.5    5000  True
2008-03-13 14:42:16      84.5    3200  True

Answer 2

回答by unutbu

I like @waitingkuo's answer because it is very clear and readable.

我喜欢@waitingkuo 的回答，因为它非常清晰易读。

I'm keeping this around anyway because it does appear to be faster -- at least with Pandas version 0.10.0. The situation may (hopefully) change in the future, so be sure to rerun the benchmark especially if you are using a different version of Pandas.

无论如何我都会保留它，因为它看起来确实更快——至少在 Pandas 0.10.0 版中。情况可能（希望）在未来发生变化，因此请务必重新运行基准测试，尤其是在您使用不同版本的 Pandas 时。

import pandas as pd
import io
import timeit

data = '''\
date time       exe_price    exe_vol flag
2008-03-13 14:41:07  84.5    200     yes
2008-03-13 14:41:37  85.0    10000   yes
2008-03-13 14:41:38  84.5    69700   yes
2008-03-13 14:41:39  84.5    1200    yes
2008-03-13 14:42:00  84.5    1000    yes
2008-03-13 14:42:08  84.5    300     yes
2008-03-13 14:42:10  10    88100   yes
2008-03-13 14:42:10  100    11900   yes
2008-03-13 14:42:15  84.5    5000    yes
2008-03-13 14:42:16  84.5    3200    yes'''

df = pd.read_table(io.BytesIO(data), sep='\s+', parse_dates=[[0, 1]],
                   index_col=0)


def func(subf):
    exe_vol = subf['exe_vol'].sum()
    exe_price = ((subf['exe_price']*subf['exe_vol']).sum()
                 / exe_vol)
    flag = True
    return pd.Series([exe_price, exe_vol, flag],
                     index=['exe_price', 'exe_vol', 'flag'])
    # return exe_price

def using_apply():
    return df.groupby(df.index).apply(func)

def using_helper_column():
    df['weight'] = df['exe_price'] * df['exe_vol']
    grouped = df.groupby(level=0, group_keys=True)
    result = grouped.agg({'weight': 'sum', 'exe_vol': 'sum'})
    result['exe_price'] = result['weight'] / result['exe_vol']
    result['flag'] = True
    result = result.drop(['weight'], axis=1)
    return result

result = using_apply()
print(result)
result = using_helper_column()
print(result)

time_apply = timeit.timeit('m.using_apply()',
                      'import __main__ as m ',
                      number=1000)
time_helper = timeit.timeit('m.using_helper_column()',
                      'import __main__ as m ',
                      number=1000)
print('using_apply: {t}'.format(t = time_apply))
print('using_helper_column: {t}'.format(t = time_helper))

yields

产量

                     exe_vol  exe_price  flag
date_time                                    
2008-03-13 14:41:07      200      84.50  True
2008-03-13 14:41:37    10000      85.00  True
2008-03-13 14:41:38    69700      84.50  True
2008-03-13 14:41:39     1200      84.50  True
2008-03-13 14:42:00     1000      84.50  True
2008-03-13 14:42:08      300      84.50  True
2008-03-13 14:42:10   100000      20.71  True
2008-03-13 14:42:15     5000      84.50  True
2008-03-13 14:42:16     3200      84.50  True

with timeit benchmarks of:

使用 timeit 基准：

using_apply: 3.0081038475
using_helper_column: 1.35300707817

Answer 3

回答by askewchan

Not terribly familiar with pandas, but in pure numpy you could do:

不是很熟悉pandas，但在纯 numpy 中你可以这样做：

tot_vol = np.sum(grouped['exe_vol'])
avg_price = np.average(grouped['exe_price'], weights=grouped['exe_vol'])

对组对象中的不同项目应用不同的函数：Python pandas

提问by kunitomo

回答by waitingkuo

回答by unutbu

回答by askewchan

相关推荐

最近更新

标签

对组对象中的不同项目应用不同的函数：Python pandas

提问by kunitomo

回答by waitingkuo

回答by unutbu

回答by askewchan

相关推荐

Pandas 未记录的 DataFrame.keys() 方法

Pandas 在布尔索引中使用行标签

Pandas DataFrame 索引的自动递增选项

pandas 用 groupby 方法替换值

相关推荐

最近更新

标签