Pandas 高效的 VWAP 计算
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29298789/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas Efficient VWAP Calculation
提问by Zhubarb
I have the below code, using which I can calculate the volume-weighted average price by three lines of Pandas code.
我有下面的代码,使用它我可以通过三行 Pandas 代码计算成交量加权平均价格。
import numpy as np
import pandas as pd
from pandas.io.data import DataReader
import datetime as dt
df = DataReader(['AAPL'], 'yahoo', dt.datetime(2013, 12, 30), dt.datetime(2014, 12, 30))
df['Cum_Vol'] = df['Volume'].cumsum()
df['Cum_Vol_Price'] = (df['Volume'] * (df['High'] + df['Low'] + df['Close'] ) /3).cumsum()
df['VWAP'] = df['Cum_Vol_Price'] / df['Cum_Vol']
I am trying to find a way to code this without using cumsum()as an exercise. I am trying to find a solution which gives the VWAPcolumn in one pass. I have tried the below line, using .apply(). The logic is there, but the issue is I am not able to store values in row n in order to use in row (n+1). How do you approach this in pandas- just use an external tuplet or dictionary for temporary storage of cumulative values?
我试图找到一种方法来编码这个而不cumsum()用作练习。我正在尝试找到一种解决方案,可以VWAP一次性提供该列。我已经尝试了下面的行,使用.apply(). 逻辑就在那里,但问题是我无法在第 n 行中存储值以便在第 (n+1) 行中使用。您如何解决这个问题pandas- 只需使用外部连音或字典来临时存储累积值?
df['Cum_Vol']= np.nan
df['Cum_Vol_Price'] = np.nan
# calculate running cumulatives by apply - assume df row index is 0 to N
df['Cum_Vol'] = df.apply(lambda x: df.iloc[x.name-1]['Cum_Vol'] + x['Volume'] if int(x.name)>0 else x['Volume'], axis=1)
Is there a one-pass solution to the above problem?
上述问题是否有一次性解决方案?
EDIT:
编辑:
My main motivation is to understand what is happening under the hood. So, it is mainly for exercise than any valid reason. I believe each cumsum on a Series of size N has time complexity N (?). So I was wondering, instead of running two separate cumsum's, can we calculate both in one pass - along the lines of this. Very happy to accept an answer to this - rather than working code.
我的主要动机是了解幕后发生的事情。所以,它主要是为了锻炼而不是任何正当理由。我相信一系列大小为 N 的 cumsum 的时间复杂度为 N (?)。所以我想知道,不是运行两个单独的 cumsum,我们可以一次计算两者 - 沿着this. 很高兴接受对此的答案 - 而不是工作代码。
回答by JohnE
Getting into one pass vs one line starts to get a little semantical. How about this for a distinction: you can do it with 1 line of pandas, 1 line of numpy, or several lines of numba.
进入一次通过与一行开始变得有点语义化。如何区分:你可以用 1 行 Pandas、1 行 numpy 或几行 numba 来做。
from numba import jit
df=pd.DataFrame( np.random.randn(10000,3), columns=['v','h','l'] )
df['vwap_pandas'] = (df.v*(df.h+df.l)/2).cumsum() / df.v.cumsum()
@jit
def vwap():
tmp1 = np.zeros_like(v)
tmp2 = np.zeros_like(v)
for i in range(0,len(v)):
tmp1[i] = tmp1[i-1] + v[i] * ( h[i] + l[i] ) / 2.
tmp2[i] = tmp2[i-1] + v[i]
return tmp1 / tmp2
v = df.v.values
h = df.h.values
l = df.l.values
df['vwap_numpy'] = np.cumsum(v*(h+l)/2) / np.cumsum(v)
df['vwap_numba'] = vwap()
Timings:
时间:
%timeit (df.v*(df.h+df.l)/2).cumsum() / df.v.cumsum() # pandas
1000 loops, best of 3: 829 μs per loop
%timeit np.cumsum(v*(h+l)/2) / np.cumsum(v) # numpy
10000 loops, best of 3: 165 μs per loop
%timeit vwap() # numba
10000 loops, best of 3: 87.4 μs per loop
回答by Ran Aroussi
Quick Edit: Just wanted to thank John for the original post :)
快速编辑:只是想感谢约翰的原始帖子:)
You can get even faster results by @jit-ing numpy's version:
你可以通过@jit-ing numpy 的版本获得更快的结果:
@jit
def np_vwap():
return np.cumsum(v*(h+l)/2) / np.cumsum(v)
This got me 50.9 μs per loopas opposed to 74.5 μs per loopusing the vwap version above.
这让我50.9 μs per loop没有74.5 μs per loop使用上面的 vwap 版本。

