pandas 使用 python 中的矢量化解决方案计算最大回撤

Question

提问by piRSquared

Maximum Drawdownis a common risk metric used in quantitative finance to assess the largest negative return that has been experienced.

最大回撤是量化金融中常用的风险指标，用于评估所经历的最大负回报。

Recently, I became impatient with the time to calculate max drawdown using my looped approach.

最近，我对使用循环方法计算最大回撤的时间变得不耐烦了。

def max_dd_loop(returns):
    """returns is assumed to be a pandas series"""
    max_so_far = None
    start, end = None, None
    r = returns.add(1).cumprod()
    for r_start in r.index:
        for r_end in r.index:
            if r_start < r_end:
                current = r.ix[r_end] / r.ix[r_start] - 1
                if (max_so_far is None) or (current < max_so_far):
                    max_so_far = current
                    start, end = r_start, r_end
    return max_so_far, start, end

I'm familiar with the common perception that a vectorized solution would be better.

我熟悉矢量化解决方案会更好的普遍看法。

The questions are:

问题是：

can I vectorize this problem?
What does this solution look like?
How beneficial is it?

我可以矢量化这个问题吗？
这个解决方案是什么样的？
它有多大好处？

Edit

编辑

I modified Alexander's answer into the following function:

我将亚历山大的答案修改为以下功能：

def max_dd(returns):
    """Assumes returns is a pandas Series"""
    r = returns.add(1).cumprod()
    dd = r.div(r.cummax()).sub(1)
    mdd = dd.min()
    end = dd.argmin()
    start = r.loc[:end].argmax()
    return mdd, start, end

Answer 1

回答by Alexander

df_returnsis assumed to be a dataframe of returns, where each column is a seperate strategy/manager/security, and each row is a new date (e.g. monthly or daily).

df_returns假设是一个返回数据框，其中每一列是一个单独的策略/经理/证券，每一行是一个新日期（例如每月或每天）。

cum_returns = (1 + df_returns).cumprod()
drawdown =  1 - cum_returns.div(cum_returns.cummax())

Answer 2

回答by Stefan

I had first suggested using .expanding()window but that's obviously not necessary with the .cumprod()and .cummax()built ins to calculate max drawdown up to any given point:

我首先建议使用.expanding()window ，但这显然没有必要使用.cumprod()和.cummax()内置来计算任何给定点的最大回撤：

df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)}, index=pd.date_range(start=date(2016,1,1), periods=1000, freq='D'))

df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)},
                  index=pd.date_range(start=date(2016, 1, 1), periods=1000, freq='D'))
df['cumulative_return'] = df.returns.add(1).cumprod().subtract(1)
df['max_drawdown'] = df.cumulative_return.add(1).div(df.cumulative_return.cummax().add(1)).subtract(1)

            returns  cumulative_return  max_drawdown
2016-01-01 -0.014522          -0.014522      0.000000
2016-01-02 -0.022769          -0.036960     -0.022769
2016-01-03  0.026735          -0.011214      0.000000
2016-01-04  0.054129           0.042308      0.000000
2016-01-05 -0.017562           0.024004     -0.017562
2016-01-06  0.055254           0.080584      0.000000
2016-01-07  0.023135           0.105583      0.000000
2016-01-08 -0.072624           0.025291     -0.072624
2016-01-09 -0.055799          -0.031919     -0.124371
2016-01-10  0.129059           0.093020     -0.011363
2016-01-11  0.056123           0.154364      0.000000
2016-01-12  0.028213           0.186932      0.000000
2016-01-13  0.026914           0.218878      0.000000
2016-01-14 -0.009160           0.207713     -0.009160
2016-01-15 -0.017245           0.186886     -0.026247
2016-01-16  0.003357           0.190869     -0.022979
2016-01-17 -0.009284           0.179813     -0.032050
2016-01-18 -0.027361           0.147533     -0.058533
2016-01-19 -0.058118           0.080841     -0.113250
2016-01-20 -0.049893           0.026914     -0.157492
2016-01-21 -0.013382           0.013173     -0.168766
2016-01-22 -0.020350          -0.007445     -0.185681
2016-01-23 -0.085842          -0.092648     -0.255584
2016-01-24  0.022406          -0.072318     -0.238905
2016-01-25  0.044079          -0.031426     -0.205356
2016-01-26  0.045782           0.012917     -0.168976
2016-01-27 -0.018443          -0.005764     -0.184302
2016-01-28  0.021461           0.015573     -0.166797
2016-01-29 -0.062436          -0.047836     -0.218819
2016-01-30 -0.013274          -0.060475     -0.229189
...              ...                ...           ...
2018-08-28  0.002124           0.559122     -0.478738
2018-08-29 -0.080303           0.433921     -0.520597
2018-08-30 -0.009798           0.419871     -0.525294
2018-08-31 -0.050365           0.348359     -0.549203
2018-09-01  0.080299           0.456631     -0.513004
2018-09-02  0.013601           0.476443     -0.506381
2018-09-03 -0.009678           0.462153     -0.511158
2018-09-04 -0.026805           0.422960     -0.524262
2018-09-05  0.040832           0.481062     -0.504836
2018-09-06 -0.035492           0.428496     -0.522411
2018-09-07 -0.011206           0.412489     -0.527762
2018-09-08  0.069765           0.511031     -0.494817
2018-09-09  0.049546           0.585896     -0.469787
2018-09-10 -0.060201           0.490423     -0.501707
2018-09-11 -0.018913           0.462235     -0.511131
2018-09-12 -0.094803           0.323611     -0.557477
2018-09-13  0.025736           0.357675     -0.546088
2018-09-14 -0.049468           0.290514     -0.568542
2018-09-15  0.018146           0.313932     -0.560713
2018-09-16 -0.034118           0.269104     -0.575700
2018-09-17  0.012191           0.284576     -0.570527
2018-09-18 -0.014888           0.265451     -0.576921
2018-09-19  0.041180           0.317562     -0.559499
2018-09-20  0.001988           0.320182     -0.558623
2018-09-21 -0.092268           0.198372     -0.599348
2018-09-22 -0.015386           0.179933     -0.605513
2018-09-23 -0.021231           0.154883     -0.613888
2018-09-24 -0.023536           0.127701     -0.622976
2018-09-25  0.030160           0.161712     -0.611605
2018-09-26  0.025528           0.191368     -0.601690

Answer 3

回答by piRSquared

Given a time series of returns, we need to evaluate the aggregate return for every combination of starting point to ending point.

给定一个时间序列的回报，我们需要评估起点到终点的每个组合的总回报。

The first trick is to convert a time series of returns into a series of return indices. Given a series of return indices, I can calculate the return over any sub-period with the return index at the beginning ri_0 and at the end ri_1. The calculation is: ri_1 / ri_0 - 1.

第一个技巧是将回报的时间序列转换为一系列回报指数。给定一系列回报指数，我可以计算任何子周期的回报，回报指数位于 ri_0 开始和 ri_1 结束。计算为：ri_1 / ri_0 - 1。

The second trick is to produce a second series of inverses of return indices. If r is my series of return indices then 1 / r is my series of inverses.

第二个技巧是产生第二个系列的回报指数的倒数。如果 r 是我的回报指数系列，那么 1 / r 是我的倒数系列。

The third trick is to take the matrix product of r * (1 / r).Transpose.

第三个技巧是取r * (1 / r).转置的矩阵乘积。

r is an n x 1 matrix. (1 / r).Transpose is a 1 x n matrix. The resulting product contains every combination of ri_j / ri_k. Just subtract 1 and I've actually got returns.

r 是一个 nx 1 矩阵。(1 / r).转置是一个 1 xn 矩阵。生成的乘积包含 ri_j / ri_k 的每个组合。只需减去1，我实际上就得到了回报。

The fourth trick is to ensure that I'm constraining my denominator to represent periods prior to those being represented by the numerator.

第四个技巧是确保我限制我的分母代表在由分子代表的时期之前的时期。

Below is my vectorized function.

下面是我的矢量化函数。

import numpy as np
import pandas as pd

def max_dd(returns):
    # make into a DataFrame so that it is a 2-dimensional
    # matrix such that I can perform an nx1 by 1xn matrix
    # multiplication and end up with an nxn matrix
    r = pd.DataFrame(returns).add(1).cumprod()

    # I copy r.T to ensure r's index is not the same
    # object as 1 / r.T's columns object
    x = r.dot(1 / r.T.copy()) - 1
    x.columns.name, x.index.name = 'start', 'end'

    # let's make sure we only calculate a return when start
    # is less than end.
    y = x.stack().reset_index()
    y = y[y.start < y.end]

    # my choice is to return the periods and the actual max
    # draw down
    z = y.set_index(['start', 'end']).iloc[:, 0]
    return z.min(), z.argmin()[0], z.argmin()[1]

How does this perform?

这表现如何？

for the vectorized solution I ran 10 iterations over the time series of lengths [10, 50, 100, 150, 200]. The time it took is below:

对于矢量化解决方案，我在长度 [10, 50, 100, 150, 200] 的时间序列上运行了 10 次迭代。花费的时间如下：

10:   0.032 seconds
50:   0.044 seconds
100:  0.055 seconds
150:  0.082 seconds
200:  0.047 seconds

The same test for the looped solution is below:

循环解决方案的相同测试如下：

10:   0.153 seconds
50:   3.169 seconds
100: 12.355 seconds
150: 27.756 seconds
200: 49.726 seconds

Edit

编辑

Alexander's answer provides superior results. Same test using modified code

亚历山大的回答提供了卓越的结果。使用修改后的代码进行相同的测试

10:   0.000 seconds
50:   0.000 seconds
100:  0.004 seconds
150:  0.007 seconds
200:  0.008 seconds

I modified his code into the following function:

我将他的代码修改为以下函数：

def max_dd(returns):
    r = returns.add(1).cumprod()
    dd = r.div(r.cummax()).sub(1)
    mdd = drawdown.min()
    end = drawdown.argmin()
    start = r.loc[:end].argmax()
    return mdd, start, end

Answer 4

回答by xicocaio

I recently had a similar issue, but instead of a global MDD, I was required to find the MDD for the interval after each peak. Also, in my case, I was supposed to take the MDD of each strategy alone and thus wasn't required to apply the cumprod. My vectorized implementation is also based on Investopedia.

我最近有一个类似的问题，但不是全局 MDD，我需要找到每个峰值后间隔的 MDD。此外，就我而言，我应该单独采用每个策略的 MDD，因此不需要应用cumprod. 我的矢量化实现也是基于Investopedia。

def calc_MDD(networth):
  df = pd.Series(networth, name="nw").to_frame()

  max_peaks_idx = df.nw.expanding(min_periods=1).apply(lambda x: x.argmax()).fillna(0).astype(int)
  df['max_peaks_idx'] = pd.Series(max_peaks_idx).to_frame()

  nw_peaks = pd.Series(df.nw.iloc[max_peaks_idx.values].values, index=df.nw.index)

  df['dd'] = ((df.nw-nw_peaks)/nw_peaks)
  df['mdd'] = df.groupby('max_peaks_idx').dd.apply(lambda x: x.expanding(min_periods=1).apply(lambda y: y.min())).fillna(0)

  return df

Here is an sample after running this code:

这是运行此代码后的示例：

        nw      max_peaks_idx       dd          mdd
0   10000.000       0           0.000000    0.000000
1   9696.948        0           -0.030305   -0.030305
2   9538.576        0           -0.046142   -0.046142
3   9303.953        0           -0.069605   -0.069605
4   9247.259        0           -0.075274   -0.075274
5   9421.519        0           -0.057848   -0.075274
6   9315.938        0           -0.068406   -0.075274
7   9235.775        0           -0.076423   -0.076423
8   9091.121        0           -0.090888   -0.090888
9   9033.532        0           -0.096647   -0.096647
10  8947.504        0           -0.105250   -0.105250
11  8841.551        0           -0.115845   -0.115845

And here is an image of the complete applied to the complete dataset.

这是应用于完整数据集的完整图像。

Although vectorized, this code is probably slower than the other, because for each time-series, there should be many peaks, and each one of these requires calculation, and so O(n_peaks*n_intervals).

虽然向量化了，但这段代码可能比另一个慢，因为对于每个时间序列，应该有很多峰值，并且每个峰值都需要计算，所以 O(n_peaks*n_intervals)。

PS: I could have eliminated the zero values in the ddand mddcolumns, but I find it useful that these values help indicate when a new peak was observed in the time-series.

PS：我本可以消除dd和mdd列中的零值，但我发现这些值有助于指示在时间序列中观察到新峰值的时间。

pandas 使用 python 中的矢量化解决方案计算最大回撤

提问by piRSquared

Edit

编辑

回答by Alexander

回答by Stefan

回答by piRSquared

Edit

编辑

回答by xicocaio

相关推荐

最近更新

标签

pandas 使用 python 中的矢量化解决方案计算最大回撤

提问by piRSquared

Edit

编辑

回答by Alexander

回答by Stefan

回答by piRSquared

Edit

编辑

回答by xicocaio

相关推荐

在 Pandas 中使用多处理读取 csv 文件的最简单方法

Pandas 字符串替换

将稀疏矩阵 (csc_matrix) 转换为 Pandas 数据帧

pandas 将数据框列中的日期转换为 MM/DD/YYYY

相关推荐

最近更新

标签