pandas 提高pandas groupby的性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47392758/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Improving the performance of pandas groupby
提问by haroba
I have a machine learning application written in Python which includes a data processing step. When I wrote it, I initially did the data processing on Pandas DataFrames, but when this lead to abysmal performance I eventually rewrote it using vanilla Python, with for loops instead of vectorized operations and lists and dicts instead of DataFrames and Series. To my surprise, the performance of the code written in vanilla Python ended up being farhigher than that of the code written using Pandas.
我有一个用 Python 编写的机器学习应用程序,其中包括一个数据处理步骤。当我写它时,我最初是在 Pandas DataFrames 上进行数据处理,但是当这导致糟糕的性能时,我最终使用普通 Python 重写它,使用 for 循环而不是矢量化操作和列表和 dicts 而不是 DataFrames 和系列。令我惊讶的是,用普通 Python 编写的代码的性能最终远高于使用 Pandas 编写的代码。
As my handcoded data processing code is substantially bigger and messier than the original Pandas code, I haven't quite given up on using Pandas, and I'm currently trying to optimize the Pandas code without much success.
由于我手工编写的数据处理代码比原始 Pandas 代码更大、更混乱,我还没有完全放弃使用 Pandas,目前我正在尝试优化 Pandas 代码,但没有取得太大成功。
The core of the data processing step consists of the following: I first divide the rows into several groups, as the data consists of several thousand time series (one for each "individual"), and I then do the same data processing on each group: a lot of summarization, combining different columns into new ones, etc.
数据处理步骤的核心包括以下内容:我首先将行分成几组,因为数据由数千个时间序列组成(每个“个体”一个),然后对每组进行相同的数据处理: 大量总结,将不同的栏目组合成新的栏目等。
I profiled my code using Jupyter Notebook's lprun
, and the bulk of the time is spent on the following and other similar lines:
我使用 Jupyter Notebook's 分析了我的代码lprun
,大部分时间都花在了以下和其他类似的行上:
grouped_data = data.groupby('pk')
data[[v + 'Diff' for v in val_cols]] = grouped_data[val_cols].transform(lambda x: x - x.shift(1)).fillna(0)
data[[v + 'Mean' for v in val_cols]] = grouped_data[val_cols].rolling(4).mean().shift(1).reset_index()[val_cols]
(...)
...a mix of vectorized and non-vectorized processing. I understand that the non-vectorized operations won't be faster than my handwritten for loops, since that's basically what they are under the hood, but how can they be so much slower? We're talking about a performance degradation of 10-20x between my handwritten code and the Pandas code.
...矢量化和非矢量化处理的混合。我知道非矢量化操作不会比我手写的 for 循环快,因为这基本上就是它们的内幕,但它们怎么能慢这么多?我们正在谈论我的手写代码和 Pandas 代码之间 10-20 倍的性能下降。
Am I doing something very, very wrong?
我是不是在做一些非常非常错误的事情?
回答by cs95
No, I don't think you should give up on pandas. There's definitely better ways to do what you're trying to. The trick is to avoid apply
/transform
in any form as much as possible. Avoid them like the plague. They're basically implemented as for loops, so you might as well directly use python for
loops which operate at C speed and give you better performance.
不,我认为你不应该放弃Pandas。肯定有更好的方法来做你想做的事。诀窍是尽可能避免apply
/transform
以任何形式。像躲避瘟疫一样避开它们。它们基本上是作为 for 循环实现的,因此您不妨直接使用for
以 C 速度运行并为您提供更好性能的python循环。
The real speed gain is where you get rid of the loops and use pandas' functions that implicitly vectorise their operations. For example, your first line of code can be simplified greatly, as I show you soon.
真正的速度增益是您摆脱循环并使用 Pandas 的函数隐式矢量化其操作。例如,您的第一行代码可以大大简化,我很快就会向您展示。
In this post I outline the setup process, and then, for each line in your question, offer an improvement, along with a side-by-side comparison of the timings and correctness.
在这篇文章中,我概述了设置过程,然后,针对您问题中的每一行,提供改进,同时并排比较时间和正确性。
Setup
设置
data = {'pk' : np.random.choice(10, 1000)}
data.update({'Val{}'.format(i) : np.random.randn(1000) for i in range(100)})
df = pd.DataFrame(data)
g = df.groupby('pk')
c = ['Val{}'.format(i) for i in range(100)]
transform
+ sub
+ shift
→ diff
transform
+ sub
+ shift
→diff
Your first line of code can be replaced with a simple diff
statement:
您的第一行代码可以替换为一个简单的diff
语句:
v1 = df.groupby('pk')[c].diff().fillna(0)
Sanity Check
完整性检查
v2 = df.groupby('pk')[c].transform(lambda x: x - x.shift(1)).fillna(0)
np.allclose(v1, v2)
True
Performance
表现
%timeit df.groupby('pk')[c].transform(lambda x: x - x.shift(1)).fillna(0)
10 loops, best of 3: 44.3 ms per loop
%timeit df.groupby('pk')[c].diff(-1).fillna(0)
100 loops, best of 3: 9.63 ms per loop
Removing redundant indexing operations
删除冗余索引操作
As far as your second line of code is concerned, I don't see too much room for improvement, although you can get rid of the reset_index()
+ [val_cols]
call if your groupby statement is not considering pk
as the index:
就您的第二行代码而言,我认为没有太大的改进空间,尽管如果您的 groupby 语句不考虑作为索引,您可以摆脱reset_index()
+[val_cols]
调用pk
:
g = df.groupby('pk', as_index=False)
Your second line of code then reduces to:
你的第二行代码然后减少到:
v3 = g[c].rolling(4).mean().shift(1)
Sanity Check
完整性检查
g2 = df.groupby('pk')
v4 = g2[c].rolling(4).mean().shift(1).reset_index()[c]
np.allclose(v3.fillna(0), v4.fillna(0))
True
Performance
表现
%timeit df.groupby('pk')[c].rolling(4).mean().shift(1).reset_index()[c]
10 loops, best of 3: 46.5 ms per loop
%timeit df.groupby('pk', as_index=False)[c].rolling(4).mean().shift(1)
10 loops, best of 3: 41.7 ms per loop
Note that timings vary on different machines, so make sure you test your code thoroughly to make sure there is indeed an improvement on your data.
请注意,不同机器上的时间会有所不同,因此请确保彻底测试您的代码,以确保您的数据确实有所改进。
While the difference this time isn't as much, you can appreciate the fact that there are improvements that you can make! This could possibly make a much larger impact for larger data.
虽然这次的差异没有那么大,但您可以感谢您可以进行改进的事实!这可能会对更大的数据产生更大的影响。
Afterword
后记
In conclusion, most operations are slow because they can be sped up. The key is to get rid of any approach that does not use vectorization.
总之,大多数操作都很慢,因为它们可以加速。关键是摆脱任何不使用矢量化的方法。
To this end, it is sometimes beneficial to step out of pandas space and step into numpy space. Operations on numpy arrays or using numpy tend to be much faster than pandas equivalents (for example, np.sum
is faster than pd.DataFrame.sum
, and np.where
is faster than pd.DataFrame.where
, and so on).
为此,走出pandas空间,踏入numpy空间,有时是有益的。上numpy的阵列或使用numpy的趋向操作要远远快于大Pandas当量(例如,np.sum
快于pd.DataFrame.sum
,且np.where
快于pd.DataFrame.where
,等等)。
Sometimes, loops cannot be avoided. In which case, you can create a basic looping function which you can then vectorise using numba or cython. Examples of that are here at Enhancing Performance, straight from the horses mouth.
有时,无法避免循环。在这种情况下,您可以创建一个基本的循环函数,然后您可以使用 numba 或 cython 对其进行矢量化。这方面的例子在提高性能,直接来自马口。
In still other cases, your data is just too big to reasonably fit into numpy arrays. In this case, it would be time to give up and switch to dask
or spark
, both of which offer high performance distributed computational frameworks for working with big data.
在其他情况下,您的数据太大而无法合理地放入 numpy 数组。在这种情况下,是时候放弃并切换到dask
or 了spark
,这两者都提供了用于处理大数据的高性能分布式计算框架。