pandas 使用 Python 函数有效处理 DataFrame 行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18282988/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:05:43  来源:igfitidea点击:

Efficiently processing DataFrame rows with a Python function?

pythonnumpypandas

提问by Dun Peal

In many places in our Pandas-using code, we have some Python function process(row). That function is used over DataFrame.iterrows(), taking each row, and doing some processing, and returning a value, which we ultimate collect into a new Series.

在我们使用 Pandas 的代码的许多地方,我们有一些 Python 函数process(row)。该函数用于DataFrame.iterrows(),获取每个row,并进行一些处理,并返回一个值,我们最终将其收集到一个新的Series.

I realize this usage pattern circumvents most of the performance benefits of the numpy / Pandas stack.

我意识到这种使用模式绕过了 numpy / Pandas 堆栈的大部分性能优势。

  1. What would be the best way to make this usage pattern as efficient as possible?
  2. Can we possibly do it without rewriting most of our code?
  1. 使这种使用模式尽可能高效的最佳方法是什么?
  2. 我们可以在不重写大部分代码的情况下做到这一点吗?

Another aspect of this question: can all such functions be converted to a numpy-efficient representation? I've much to learn about the numpy / scipy / Pandas stack, but it seems that for truly arbitrary logic, you may sometimes need to just use a slow pure Python architecture like the one above. Is that the case?

这个问题的另一方面:所有这些函数都可以转换为 numpy 高效的表示吗?我有很多关于 numpy / scipy / Pandas 堆栈的知识,但似乎对于真正的任意逻辑,您有时可能只需要使用像上面那样缓慢的纯 Python 架构。是这样吗?

回答by Viktor Kerkez

You should apply your function along the axis=1. Function will receive a row as an argument, and anything it returns will be collected into a new series object

您应该沿轴 = 1 应用您的功能。函数将接收一行作为参数,它返回的任何内容都将被收集到一个新的系列对象中

df.apply(you_function, axis=1)

Example:

例子:

>>> df = pd.DataFrame({'a': np.arange(3),
                       'b': np.random.rand(3)})
>>> df
   a         b
0  0  0.880075
1  1  0.143038
2  2  0.795188
>>> def func(row):
        return row['a'] + row['b']
>>> df.apply(func, axis=1)
0    0.880075
1    1.143038
2    2.795188
dtype: float64

As for the second part of the question: row wise operations, even optimised ones, using pandas apply, are not the fastest solution there is. They are certainly a lotfaster than a python for loop, but not the fastest. You can test that by timing operations and you'll see the difference.

至于问题的第二部分:行明智的操作,即使是优化的操作,使用 pandas apply,也不是最快的解决方案。它们肯定比 python for 循环快很多,但不是最快的。您可以通过计时操作进行测试,您会看到差异。

Some operation could be converted to column oriented ones (one in my example could be easily converted to just df['a'] + df['b']), but others cannot. Especially if you have a lot of branching, special cases or other logic that should be perform on your row. In that case, if the applyis too slow for you, I would suggest "Cython-izing"your code. Cython plays really nicely with the NumPy C api and will give you the maximal speed you can achieve.

某些操作可以转换为面向列的操作(在我的示例中可以轻松转换为 just df['a'] + df['b']),但其他操作则不能。特别是如果您有很多分支、特殊情况或其他应该在您的行上执行的逻辑。在这种情况下,如果apply对您来说太慢,我建议您对代码进行“Cython 化”。Cython 与 NumPy C api 一起玩得非常好,并且会给你你可以达到的最大速度。

Or you can try numba. :)

或者你可以试试numba。:)