使用 Numba 处理 Pandas DataFrame 时间序列的有效方法

Question

提问by JasonEdinburgh

I have a DataFrame with 1,500,000 rows. It's one-minute level stock market data that I bought from QuantQuote.com. (Open, High, Low, Close, Volume). I'm trying to run some home-made backtests of stockmarket trading strategies. Straight python code to process the transactions is too slow and I wanted to try to use numba to speed things up. The trouble is that numba doesn't seem to work with pandas functions.

我有一个包含 1,500,000 行的 DataFrame。这是我从 QuantQuote.com 购买的一分钟级股市数据。（开盘价、最高价、最低价、收盘价、成交量）。我正在尝试对股票市场交易策略进行一些自制的回溯测试。处理事务的直接 python 代码太慢了，我想尝试使用 numba 来加快速度。问题是numba 似乎不适用于Pandas 函数。

Google searches uncover a surprising lack of information about using numba with pandas. Which makes me wonder if I'm making a mistake by considering it.

Google 搜索发现有关将 numba 与 pandas 一起使用的信息令人惊讶地缺乏。这让我怀疑我是否在考虑它时犯了一个错误。

My setup is Numba 0.13.0-1, Pandas 0.13.1-1. Windows 7, MS VS2013 with PTVS, Python 2.7, Enthought Canopy

我的设置是 Numba 0.13.0-1，Pandas 0.13.1-1。Windows 7、MS VS2013 与 PTVS、Python 2.7、Enthought Canopy

My existing Python+Pandas innerloop has the following general structure

我现有的 Python+Pandas 内循环具有以下一般结构

Compute "indicator" columns, (with pd.ewma, pd.rolling_max, pd.rolling_min etc.)
Compute "event" columns for predetermined events such as moving average crosses, new highs etc.

计算“指标”列，（使用 pd.ewma、pd.rolling_max、pd.rolling_min 等）
计算预定事件的“事件”列，例如移动平均线交叉、新高等。

I then use DataFrame.iterrows to process the DataFrame.

然后我使用 DataFrame.iterrows 来处理 DataFrame。

I've tried various optimizations but it's still not as fast as I would like. And the optimizations are causing bugs.

我尝试了各种优化，但仍然没有我想要的那么快。并且优化导致了错误。

I want to use numba to process the rows. Are there preferred methods of approaching this?

我想使用 numba 来处理行。有没有首选的方法来解决这个问题？

Because my DataFrame is really just a rectangle of floats, I was considering using something like DataFrame.values to get access to the data and then write a series of functions that use numba to access the rows. But that removes all the timestamps and I don't think it is a reversible operation. I'm not sure if the values matrix that I get from DataFrame.values is guaranteed to not be a copy of the data.

因为我的 DataFrame 实际上只是一个浮点矩形，我正在考虑使用类似 DataFrame.values 的东西来访问数据，然后编写一系列使用 numba 访问行的函数。但这会删除所有时间戳，我认为这不是可逆操作。我不确定我从 DataFrame.values 获得的值矩阵是否保证不是数据的副本。

Any help is greatly appreciated.

任何帮助是极大的赞赏。

Answer 1

回答by Peque

Numba is a NumPy-aware just-in-time compiler. You can pass NumPy arrays as parameters to your Numba-compiled functions, but not Pandas series.

Numba 是一个支持 NumPy 的即时编译器。您可以将 NumPy 数组作为参数传递给 Numba 编译的函数，但不能传递 Pandas 系列。

Your only option, still as of 2017-06-27, is to use the Pandas series values, which are actually NumPy arrays.

直到 2017 年 6 月 27 日，您唯一的选择是使用 Pandas 系列值，它们实际上是 NumPy 数组。

Also, you ask if the values are "guaranteed to not be a copy of the data". They are not a copy, you can verify that:

此外，您询问这些值是否“保证不是数据的副本”。它们不是副本，您可以验证：

import pandas


df = pandas.DataFrame([0, 1, 2, 3])
df.values[2] = 8
print(df)  # Should show you the value `8`

In my opinion, Numba is a great (if not the best) approach to processing market data and you want to stick to Python only. If you want to see great performance gains, make sure to use @numba.jit(nopython=True)(note that this will not allow you to use dictionaries and other Python types inside the JIT-compiled functions, but will make the code run much faster).

在我看来，Numba 是处理市场数据的一种很好的（如果不是最好的）方法，并且您只想坚持使用 Python。如果您想获得巨大的性能提升，请务必使用@numba.jit(nopython=True)（请注意，这将不允许您在 JIT 编译的函数中使用字典和其他 Python 类型，但会使代码运行得更快）。

Note that some of those indicators you are working with may already have an efficient implementation in Pandas, so consider pre-computing them with Pandas and then pass the values (the NumPy array) to your Numba backtesting function.

请注意，您正在使用的一些指标可能已经在 Pandas 中有效实现，因此请考虑使用 Pandas 预先计算它们，然后将值（NumPy 数组）传递给您的 Numba 回测函数。

使用 Numba 处理 Pandas DataFrame 时间序列的有效方法

提问by JasonEdinburgh

回答by Peque

相关推荐

最近更新

标签

使用 Numba 处理 Pandas DataFrame 时间序列的有效方法

提问by JasonEdinburgh

回答by Peque

相关推荐

使用 Pandas 读取大文本文件

pandas 更改图表边框区域颜色

如何使用 Pandas groupby 在组上添加顺序计数器列

将 Pandas DataFrame 中的列值与“NaN”值连接起来

相关推荐

最近更新

标签