pandas 使用 Python 函数有效处理 DataFrame 行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/18282988/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficiently processing DataFrame rows with a Python function?
提问by Dun Peal
In many places in our Pandas-using code, we have some Python function process(row). That function is used over DataFrame.iterrows(), taking each row, and doing some processing, and returning a value, which we ultimate collect into a new Series.
在我们使用 Pandas 的代码的许多地方,我们有一些 Python 函数process(row)。该函数用于DataFrame.iterrows(),获取每个row,并进行一些处理,并返回一个值,我们最终将其收集到一个新的Series.
I realize this usage pattern circumvents most of the performance benefits of the numpy / Pandas stack.
我意识到这种使用模式绕过了 numpy / Pandas 堆栈的大部分性能优势。
- What would be the best way to make this usage pattern as efficient as possible?
 - Can we possibly do it without rewriting most of our code?
 
- 使这种使用模式尽可能高效的最佳方法是什么?
 - 我们可以在不重写大部分代码的情况下做到这一点吗?
 
Another aspect of this question: can all such functions be converted to a numpy-efficient representation? I've much to learn about the numpy / scipy / Pandas stack, but it seems that for truly arbitrary logic, you may sometimes need to just use a slow pure Python architecture like the one above. Is that the case?
这个问题的另一方面:所有这些函数都可以转换为 numpy 高效的表示吗?我有很多关于 numpy / scipy / Pandas 堆栈的知识,但似乎对于真正的任意逻辑,您有时可能只需要使用像上面那样缓慢的纯 Python 架构。是这样吗?
回答by Viktor Kerkez
You should apply your function along the axis=1. Function will receive a row as an argument, and anything it returns will be collected into a new series object
您应该沿轴 = 1 应用您的功能。函数将接收一行作为参数,它返回的任何内容都将被收集到一个新的系列对象中
df.apply(you_function, axis=1)
Example:
例子:
>>> df = pd.DataFrame({'a': np.arange(3),
                       'b': np.random.rand(3)})
>>> df
   a         b
0  0  0.880075
1  1  0.143038
2  2  0.795188
>>> def func(row):
        return row['a'] + row['b']
>>> df.apply(func, axis=1)
0    0.880075
1    1.143038
2    2.795188
dtype: float64
As for the second part of the question: row wise operations, even optimised ones, using pandas apply, are not the fastest solution there is. They are certainly a lotfaster than a python for loop, but not the fastest. You can test that by timing operations and you'll see the difference.
至于问题的第二部分:行明智的操作,即使是优化的操作,使用 pandas apply,也不是最快的解决方案。它们肯定比 python for 循环快很多,但不是最快的。您可以通过计时操作进行测试,您会看到差异。
Some operation could be converted to column oriented ones (one in my example could be easily converted to just df['a'] + df['b']), but others cannot. Especially if you have a lot of branching, special cases or other logic that should be perform on your row. In that case, if the applyis too slow for you, I would suggest "Cython-izing"your code. Cython plays really nicely with the NumPy C api and will give you the maximal speed you can achieve.
某些操作可以转换为面向列的操作(在我的示例中可以轻松转换为 just df['a'] + df['b']),但其他操作则不能。特别是如果您有很多分支、特殊情况或其他应该在您的行上执行的逻辑。在这种情况下,如果apply对您来说太慢,我建议您对代码进行“Cython 化”。Cython 与 NumPy C api 一起玩得非常好,并且会给你你可以达到的最大速度。
Or you can try numba. :)
或者你可以试试numba。:)

