Pandas:dropna 后就地重命名的特殊性能下降

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22532302/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:49:59  来源:igfitidea点击:

Pandas: peculiar performance drop for inplace rename after dropna

pythonperformancepandasin-place

提问by eldad-a

I have reported this as an issue on pandas issues. In the meanwhile I post this here hoping to save others time, in case they encounter similar issues.

我已将此报告为有关Pandas问题的问题。同时,我在此发布此内容,希望可以节省其他人的时间,以防他们遇到类似问题。

Upon profiling a process which needed to be optimized I found that renaming columns NOT inplace improves performance (execution time) by x120. Profiling indicates this is related to garbage collection (see below).

在分析需要优化的过程时,我发现重命名未就地列的性能(执行时间)提高了 x120。分析表明这与垃圾收集有关(见下文)。

Furthermore, the expected performance is recovered by avoiding the dropna method.

此外,通过避免 dropna 方法可以恢复预期的性能。

The following short example demonstrates a factor x12:

以下简短示例演示了因子 x12:

import pandas as pd
import numpy as np

inplace=True

就地=真

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
df = (df1-df2).dropna()
## inplace rename:
df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

100 loops, best of 3: 15.6 ms per loop

100 个循环,最好的 3 个:每个循环 15.6 毫秒

first output line of %%prun:

的第一个输出行%%prun

ncalls tottime percall cumtime percall filename:lineno(function)

1  0.018 0.018 0.018 0.018 {gc.collect}

ncalls tottime percall cumtime percall filename:lineno(function)

1  0.018 0.018 0.018 0.018 {gc.collect}

inplace=False

就地=假

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
df = (df1-df2).dropna()
## avoid inplace:
df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

1000 loops, best of 3: 1.24 ms per loop

1000 个循环,最好的 3 个:每个循环 1.24 毫秒

avoid dropna

避免滴滴

The expected performance is recovered by avoiding the dropnamethod:

通过避免该dropna方法来恢复预期的性能:

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
#no dropna:
df = (df1-df2)#.dropna()
## inplace rename:
df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

1000 loops, best of 3: 865 μs per loop

1000 个循环,最好的 3 个:每个循环 865 μs

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
## no dropna
df = (df1-df2)#.dropna()
## avoid inplace:
df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

1000 loops, best of 3: 902 μs per loop

1000 个循环,最好的 3 个:每个循环 902 μs

回答by Jeff

This is a copy of the explanation on github.

这是github上的解释的副本。

There is no guaranteethat an inplaceoperation is actually faster. Often they are actually the same operation that works on a copy, but the top-level reference is reassigned.

没有保证,一个inplace操作实际上要快。通常它们实际上是在副本上工作的相同操作,但是顶级引用被重新分配。

The reason for the difference in performance in this case is as follows.

在这种情况下性能差异的原因如下。

The (df1-df2).dropna()call creates a slice of the dataframe. When you apply a new operation, this triggers a SettingWithCopycheck because it couldbe a copy (but often is not).

(df1-df2).dropna()调用创建了数据帧的一个切片。当您应用新操作时,这会触发SettingWithCopy检查,因为它可能是副本(但通常不是)。

This check must perform a garbage collection to wipe out some cache references to see if it's a copy. Unfortunately python syntax makes this unavoidable.

此检查必须执行垃圾收集以清除一些缓存引用,以查看它是否为副本。不幸的是,python 语法使这不可避免。

You can not have this happen, by simply making a copy first.

你不能让这种情况发生,只需先复制一份。

df = (df1-df2).dropna().copy()

followed by an inplaceoperation will be as performant as before.

之后的inplace操作将和以前一样高效。

My personal opinion: I neveruse in-place operations. The syntax is harder to read and it does not offer any advantages.

我的个人意见:我从不使用就地操作。语法更难阅读,并且没有任何优势。