Python pandas iterrows 有性能问题吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24870953/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 05:22:34  来源:igfitidea点击:

Does pandas iterrows have performance issues?

pythonperformancepandasiteration

提问by KieranPC

I have noticed very poor performance when using iterrows from pandas.

我注意到使用 Pandas 的 iterrows 时性能非常差。

Is this something that is experienced by others? Is it specific to iterrows and should this function be avoided for data of a certain size (I'm working with 2-3 million rows)?

这是其他人经历过的事情吗?它是否特定于 iterrows,对于特定大小的数据(我正在处理 2-3 百万行),是否应该避免使用此函数?

This discussionon GitHub led me to believe it is caused when mixing dtypes in the dataframe, however the simple example below shows it is there even when using one dtype (float64). This takes 36 seconds on my machine:

GitHub 上的这个讨论让我相信它是在数据帧中混合 dtypes 时引起的,但是下面的简单示例显示即使使用一种 dtype (float64) 也是存在的。在我的机器上这需要 36 秒:

import pandas as pd
import numpy as np
import time

s1 = np.random.randn(2000000)
s2 = np.random.randn(2000000)
dfa = pd.DataFrame({'s1': s1, 's2': s2})

start = time.time()
i=0
for rowindex, row in dfa.iterrows():
    i+=1
end = time.time()
print end - start

Why are vectorized operations like apply so much quicker? I imagine there must be some row by row iteration going on there too.

为什么像应用这样的矢量化操作要快得多?我想那里也必须有一些逐行迭代。

I cannot figure out how to not use iterrows in my case (this I'll save for a future question). Therefore I would appreciate hearing if you have consistently been able to avoid this iteration. I'm making calculations based on data in separate dataframes. Thank you!

我不知道如何在我的情况下不使用 iterrows(这个我会留到以后的问题中)。因此,如果您一直能够避免这种迭代,我将不胜感激。我正在根据单独数据帧中的数据进行计算。谢谢!

---Edit: simplified version of what I want to run has been added below---

---编辑:下面添加了我想要运行的简化版本---

import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b'],
      'number1':[50,-10]}

t2 = {'letter':['a','a','b','b'],
      'number2':[0.2,0.5,0.1,0.4]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])

#%% Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():   
    t2info = table2[table2.letter == row['letter']].reset_index()
    table3.ix[row_index,] = optimize(t2info,row['number1'])

#%% Define optimization
def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2']*t1info)
    maxrow = calculation.index(max(calculation))
    return t2info.ix[maxrow]

采纳答案by Jeff

Generally, iterrowsshould only be used in very very specific cases. This is the general order of precedence for performance of various operations:

一般来说,iterrows应该只在非常非常特殊的情况下使用。这是执行各种操作的一般优先顺序:

1) vectorization
2) using a custom cython routine
3) apply
    a) reductions that can be performed in cython
    b) iteration in python space
4) itertuples
5) iterrows
6) updating an empty frame (e.g. using loc one-row-at-a-time)

Using a custom cython routine is usually too complicated, so let's skip that for now.

使用自定义 cython 例程通常太复杂了,所以我们暂时跳过它。

1) Vectorization is ALWAYS ALWAYS the first and best choice. However, there are a small set of cases which cannot be vectorized in obvious ways (mostly involving a recurrence). Further, on a smallish frame, it may be faster to do other methods.

1) 矢量化永远是第一也是最好的选择。但是,有一小部分情况不能以明显的方式进行矢量化(主要涉及重复)。此外,在较小的框架上,执行其他方法可能会更快。

3) Apply involves canusually be done by an iterator in Cython space (this is done internally in pandas) (this is a) case.

3)应用包括通常是通过在用Cython空间迭代器(这在大熊猫内部完成的)来进行(这是一个)的情况下。

This is dependent on what is going on inside the apply expression. e.g. df.apply(lambda x: np.sum(x))will be executed pretty swiftly (of course df.sum(1)is even better). However something like: df.apply(lambda x: x['b'] + 1)will be executed in python space, and consequently is slower.

这取决于 apply 表达式中发生的事情。egdf.apply(lambda x: np.sum(x))将被非常迅速地执行(当然df.sum(1)更好)。但是像:df.apply(lambda x: x['b'] + 1)将在python空间中执行,因此速度较慢。

4) itertuplesdoes not box the data into a Series, just returns it as a tuple

4)itertuples不将数据装箱成系列,只是将其作为元组返回

5) iterrowsDOES box the data into a Series. Unless you really need this, use another method.

5)iterrows确实将数据装箱成一个系列。除非你真的需要这个,否则使用另一种方法。

6) updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.

6) 一次更新一个空帧。我已经看到这种方法使用太多了。它是迄今为止最慢的。这可能是常见的地方(并且对于某些 python 结构来说相当快),但是 DataFrame 对索引进行了大量检查,因此一次更新一行总是很慢。更好地创建新的结构和concat.

回答by chrisaycock

Vector operations in Numpy and pandas are much fasterthan scalar operations in vanilla Python for several reasons:

Numpy 和 Pandas 中的向量运算比普通 Python 中的标量运算快得多,原因如下:

  • Amortized type lookup: Python is a dynamically typed language, so there is runtime overhead for each element in an array. However, Numpy (and thus pandas) perform calculations in C (often via Cython). The type of the array is determined only at the start of the iteration; this savings alone is one of the biggest wins.

  • Better caching: Iterating over a C array is cache-friendly and thus very fast. A pandas DataFrame is a "column-oriented table", which means that each column is really just an array. So the native actions you can perform on a DataFrame (like summing all the elements in a column) are going to have few cache misses.

  • More opportunities for parallelism: A simple C array can be operated on via SIMD instructions. Some parts of Numpy enable SIMD, depending on your CPU and installation process. The benefits to parallelism won't be as dramatic as the static typing and better caching, but they're still a solid win.

  • 摊销类型查找:Python 是一种动态类型语言,因此数组中的每个元素都有运行时开销。然而,Numpy(以及熊猫)在 C 中执行计算(通常通过 Cython)。数组的类型仅在迭代开始时确定;仅此一项节省就是最大的胜利之一。

  • 更好的缓存:迭代 C 数组是缓存友好的,因此速度非常快。Pandas DataFrame 是一个“面向列的表”,这意味着每一列实际上只是一个数组。因此,您可以在 DataFrame 上执行的本机操作(例如对列中的所有元素求和)将很少有缓存未命中。

  • 更多并行机会:可以通过 SIMD 指令操作简单的 C 数组。Numpy 的某些部分启用 SIMD,具体取决于您的 CPU 和安装过程。并行性的好处不会像静态类型和更好的缓存那样显着,但它们仍然是一个坚实的胜利。

Moral of the story: use the vector operations in Numpy and pandas. They are faster than scalar operations in Python for the simple reason that these operations are exactly what a C programmer would have written by hand anyway. (Except that the array notion is much easier to read than explicit loops with embedded SIMD instructions.)

故事寓意:在 Numpy 和 Pandas 中使用向量运算。它们比 Python 中的标量运算更快,原因很简单,这些运算正是 C 程序员无论如何都会手工编写的。(除了数组概念比带有嵌入式 SIMD 指令的显式循环更容易阅读。)

回答by Jeff

Here's the way to do your problem. This is all vectorized.

这是解决您的问题的方法。这都是矢量化的。

In [58]: df = table1.merge(table2,on='letter')

In [59]: df['calc'] = df['number1']*df['number2']

In [60]: df
Out[60]: 
  letter  number1  number2  calc
0      a       50      0.2    10
1      a       50      0.5    25
2      b      -10      0.1    -1
3      b      -10      0.4    -4

In [61]: df.groupby('letter')['calc'].max()
Out[61]: 
letter
a         25
b         -1
Name: calc, dtype: float64

In [62]: df.groupby('letter')['calc'].idxmax()
Out[62]: 
letter
a         1
b         2
Name: calc, dtype: int64

In [63]: df.loc[df.groupby('letter')['calc'].idxmax()]
Out[63]: 
  letter  number1  number2  calc
1      a       50      0.5    25
2      b      -10      0.1    -1

回答by Polor Beer

Another option is to use to_records(), which is faster than both itertuplesand iterrows.

另一种选择是使用to_records(),它比itertuples和都快iterrows

But for your case, there is much room for other types of improvements.

但是对于您的情况,其他类型的改进还有很大的空间。

Here's my final optimized version

这是我的最终优化版本

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        # np.multiply is in general faster than "x * y"
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))


Benchmark test:

基准测试:

-- iterrows() --
100 loops, best of 3: 12.7 ms per loop
  letter  number2
0      a      0.5
1      b      0.1
2      c      5.0
3      d      4.0

-- itertuple() --
100 loops, best of 3: 12.3 ms per loop

-- to_records() --
100 loops, best of 3: 7.29 ms per loop

-- Use group by --
100 loops, best of 3: 4.07 ms per loop
  letter  number2
1      a      0.5
2      b      0.1
4      c      5.0
5      d      4.0

-- Avoid multiplication --
1000 loops, best of 3: 1.39 ms per loop
  letter  number2
0      a      0.5
1      b      0.1
2      c      5.0
3      d      4.0


Full code:

完整代码:

import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b','c','d'],
      'number1':[50,-10,.5,3]}

t2 = {'letter':['a','a','b','b','c','d','c'],
      'number2':[0.2,0.5,0.1,0.4,5,4,1]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=table1.index)


print('\n-- iterrows() --')

def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2'] * t1info)
    maxrow_in_t2 = calculation.index(max(calculation))
    return t2info.loc[maxrow_in_t2]

#%% Iterate through filtering relevant data, optimizing, returning info
def iterthrough():
    for row_index, row in table1.iterrows():   
        t2info = table2[table2.letter == row['letter']].reset_index()
        table3.iloc[row_index,:] = optimize(t2info, row['number1'])

%timeit iterthrough()
print(table3)

print('\n-- itertuple() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.itertuples():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.itertuples():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()


print('\n-- to_records() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.to_records():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.to_records():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()

print('\n-- Use group by --')

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    for index, letter, n1 in table1.to_records():
        t2 = table2.iloc[grouped.groups[letter]]
        calculation = t2.number2 * n1
        maxrow = calculation.argsort().iloc[-1]
        ret.append(t2.iloc[maxrow])
    global table3
    table3 = pd.DataFrame(ret)

%timeit iterthrough()
print(table3)

print('\n-- Even Faster --')
def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))

%timeit iterthrough()
print(table3)

The final version is almost 10x faster than the original code. The strategy is:

最终版本几乎比原始代码快 10 倍。策略是:

  1. Use groupbyto avoid repeated comparing of values.
  2. Use to_recordsto access raw numpy.records objects.
  3. Don't operate on DataFrame until you have compiled all the data.
  1. 使用groupby以避免重复的值进行比较。
  2. 使用to_records访问原始numpy.records对象。
  3. 在编译所有数据之前,不要对 DataFrame 进行操作。

回答by Vandana Sharma

Yes, Pandas itertuples() is faster than iterrows(). you can refer the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html

是的,Pandas itertuples() 比 iterrows() 快。您可以参考文档:https: //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html

"To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows."

“为了在迭代行时保留 dtypes,最好使用 itertuples(),它返回值的命名元组,并且通常比 iterrows 更快。”