使用 Numba 处理 Pandas DataFrame 时间序列的有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23630162/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:02:43  来源:igfitidea点击:

Efficient way to process pandas DataFrame timeseries with Numba

pythonpython-2.7pandasnumba

提问by JasonEdinburgh

I have a DataFrame with 1,500,000 rows. It's one-minute level stock market data that I bought from QuantQuote.com. (Open, High, Low, Close, Volume). I'm trying to run some home-made backtests of stockmarket trading strategies. Straight python code to process the transactions is too slow and I wanted to try to use numba to speed things up. The trouble is that numba doesn't seem to work with pandas functions.

我有一个包含 1,500,000 行的 DataFrame。这是我从 QuantQuote.com 购买的一分钟级股市数据。(开盘价、最高价、最低价、收盘价、成交量)。我正在尝试对股票市场交易策略进行一些自制的回溯测试。处理事务的直接 python 代码太慢了,我想尝试使用 numba 来加快速度。问题是numba 似乎不适用于Pandas 函数

Google searches uncover a surprising lack of information about using numba with pandas. Which makes me wonder if I'm making a mistake by considering it.

Google 搜索发现有关将 numba 与 pandas 一起使用的信息令人惊讶地缺乏。这让我怀疑我是否在考虑它时犯了一个错误。

My setup is Numba 0.13.0-1, Pandas 0.13.1-1. Windows 7, MS VS2013 with PTVS, Python 2.7, Enthought Canopy

我的设置是 Numba 0.13.0-1,Pandas 0.13.1-1。Windows 7、MS VS2013 与 PTVS、Python 2.7、Enthought Canopy

My existing Python+Pandas innerloop has the following general structure

我现有的 Python+Pandas 内循环具有以下一般结构

  • Compute "indicator" columns, (with pd.ewma, pd.rolling_max, pd.rolling_min etc.)
  • Compute "event" columns for predetermined events such as moving average crosses, new highs etc.
  • 计算“指标”列,(使用 pd.ewma、pd.rolling_max、pd.rolling_min 等)
  • 计算预定事件的“事件”列,例如移动平均线交叉、新高等。

I then use DataFrame.iterrows to process the DataFrame.

然后我使用 DataFrame.iterrows 来处理 DataFrame。

I've tried various optimizations but it's still not as fast as I would like. And the optimizations are causing bugs.

我尝试了各种优化,但仍然没有我想要的那么快。并且优化导致了错误。

I want to use numba to process the rows. Are there preferred methods of approaching this?

我想使用 numba 来处理行。有没有首选的方法来解决这个问题?

Because my DataFrame is really just a rectangle of floats, I was considering using something like DataFrame.values to get access to the data and then write a series of functions that use numba to access the rows. But that removes all the timestamps and I don't think it is a reversible operation. I'm not sure if the values matrix that I get from DataFrame.values is guaranteed to not be a copy of the data.

因为我的 DataFrame 实际上只是一个浮点矩形,我正在考虑使用类似 DataFrame.values 的东西来访问数据,然后编写一系列使用 numba 访问行的函数。但这会删除所有时间戳,我认为这不是可逆操作。我不确定我从 DataFrame.values 获得的值矩阵是否保证不是数据的副本。

Any help is greatly appreciated.

任何帮助是极大的赞赏。

回答by Peque

Numba is a NumPy-aware just-in-time compiler. You can pass NumPy arrays as parameters to your Numba-compiled functions, but not Pandas series.

Numba 是一个支持 NumPy 的即时编译器。您可以将 NumPy 数组作为参数传递给 Numba 编译的函数,但不能传递 Pandas 系列。

Your only option, still as of 2017-06-27, is to use the Pandas series values, which are actually NumPy arrays.

直到 2017 年 6 月 27 日,您唯一的选择是使用 Pandas 系列值,它们实际上是 NumPy 数组。

Also, you ask if the values are "guaranteed to not be a copy of the data". They are not a copy, you can verify that:

此外,您询问这些值是否“保证不是数据的副本”。它们不是副本,您可以验证:

import pandas


df = pandas.DataFrame([0, 1, 2, 3])
df.values[2] = 8
print(df)  # Should show you the value `8`

In my opinion, Numba is a great (if not the best) approach to processing market data and you want to stick to Python only. If you want to see great performance gains, make sure to use @numba.jit(nopython=True)(note that this will not allow you to use dictionaries and other Python types inside the JIT-compiled functions, but will make the code run much faster).

在我看来,Numba 是处理市场数据的一种很好的(如果不是最好的)方法,并且您只想坚持使用 Python。如果您想获得巨大的性能提升,请务必使用@numba.jit(nopython=True)(请注意,这将不允许您在 JIT 编译的函数中使用字典和其他 Python 类型,但会使代码运行得更快)。

Note that some of those indicators you are working with may already have an efficient implementation in Pandas, so consider pre-computing them with Pandas and then pass the values (the NumPy array) to your Numba backtesting function.

请注意,您正在使用的一些指标可能已经在 Pandas 中有效实现,因此请考虑使用 Pandas 预先计算它们,然后将值(NumPy 数组)传递给您的 Numba 回测函数。