pandas 如何使用 cython(或 numpy)加速熊猫

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30270117/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:21:26  来源:igfitidea点击:

How to speed up pandas with cython (or numpy)

pythonnumpypandascython

提问by Alexander

I am trying to use Cython to speed up a Pandas DataFrame computation which is relatively simple: iterating over each row in the DataFrame, add that row to itself and to all remaining rows in the DataFrame, sum these across each row, and yield the list of these sums. The length of these series will decrease as the rows in the DataFrame are exhausted. These series are stored as a dictionary keyed on the index row number.

我正在尝试使用 Cython 来加速 Pandas DataFrame 计算,这相对简单:迭代 DataFrame 中的每一行,将该行添加到自身和 DataFrame 中的所有剩余行中,在每行中对这些求和,并生成列表这些总和。这些系列的长度将随着 DataFrame 中的行用完而减少。这些系列存储为以索引行号为键的字典。

def foo(df):
    vals = {i: (df.iloc[i, :] + df.iloc[i:, :]).sum(axis=1).values.tolist()
            for i in range(df.shape[0])}   
    return vals

Aside from adding %%cythonat the top of this function, does anyone have a recommendation on how I'd go about using cdefsto convert the DataFrame values to doubles and then cythonize this code?

除了%%cython在此函数的顶部添加之外,是否有人建议我如何使用cdefs将 DataFrame 值转换为双精度值,然后对此代码进行 cythonize?

Below is some dummy data:

下面是一些虚拟数据:

>>> df

          A         B         C         D         E
0 -0.326403  1.173797  1.667856 -1.087655  0.427145
1 -0.797344  0.004362  1.499460  0.427453 -0.184672
2 -1.764609  1.949906 -0.968558  0.407954  0.533869
3  0.944205  0.158495 -1.049090 -0.897253  1.236081
4 -2.086274  0.112697  0.934638 -1.337545  0.248608
5 -0.356551 -1.275442  0.701503  1.073797 -0.008074
6 -1.300254  1.474991  0.206862 -0.859361  0.115754
7 -1.078605  0.157739  0.810672  0.468333 -0.851664
8  0.900971  0.021618  0.173563 -0.562580 -2.087487
9  2.155471 -0.605067  0.091478  0.242371  0.290887

and expected output:

和预期输出:

>>> foo(df)

{0: [3.7094795101205236,
  2.8039983729106,
  2.013301815968468,
  2.24717712931852,
  -0.27313665495940964,
  1.9899718844711711,
  1.4927321304935717,
  1.3612155622947018,
  0.3008239883773878,
  4.029880107986906],

. . .

 6: [-0.72401524913338,
  -0.8555318173322499,
  -1.9159233912495635,
  1.813132728359954],
 7: [-0.9870483855311194, -2.047439959448434, 1.6816161601610844],
 8: [-3.107831533365748, 0.6212245862437702],
 9: [4.350280705853288]}

回答by JohnE

If you're just trying to do it faster and not specifically using cython, I'd just do it in plain numpy (about 50x faster).

如果你只是想更快地完成它而不是专门使用 cython,我只会用普通的 numpy 来完成(大约快 50 倍)。

def numpy_foo(arr):
    vals = {i: (arr[i, :] + arr[i:, :]).sum(axis=1).tolist()
            for i in range(arr.shape[0])}   
    return vals

%timeit foo(df)
100 loops, best of 3: 7.2 ms per loop

%timeit numpy_foo(df.values)
10000 loops, best of 3: 144 μs per loop

foo(df) == numpy_foo(df.values)
Out[586]: True

Generally speaking, pandas gives you a lot of conveniences relative to numpy, but there are overhead costs. So in situations where pandas isn't really adding anything, you can generally speed things up by doing it in numpy. For another example, see this questionI asked which showed a roughly comparable speed difference (about 23x).

总的来说,pandas 相对于 numpy 给了你很多方便,但是也有开销成本。因此,在 pandas 并未真正添加任何内容的情况下,您通常可以通过在 numpy.conf 中进行操作来加快速度。再举一个例子,请看我问的这个问题,它显示了大致相当的速度差异(大约 23 倍)。