pandas 为什么 DataFrame 的串联速度会呈指数级增长?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36489576/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why does concatenation of DataFrames get exponentially slower?
提问by jfive
I have a function which processes a DataFrame, largely to process data into buckets create a binary matrix of features in a particular column using pd.get_dummies(df[col])
.
我有一个处理 DataFrame 的函数,主要是将数据处理到存储桶中,使用pd.get_dummies(df[col])
.
To avoid processing all of my data using this function at once (which goes out of memory and causes iPython to crash), I have broken the large DataFrame into chunks using:
为了避免一次使用此函数处理我的所有数据(这会导致内存不足并导致 iPython 崩溃),我使用以下方法将大型 DataFrame 分解为多个块:
chunks = (len(df) / 10000) + 1
df_list = np.array_split(df, chunks)
pd.get_dummies(df)
will automatically create new columns based on the contents of df[col]
and these are likely to differ for each df
in df_list
.
pd.get_dummies(df)
会自动创建一个基于内容的新栏目df[col]
和这些都有可能为每个不同df
在df_list
。
After processing, I am concatenating the DataFrames back together using:
处理后,我使用以下方法将 DataFrame 重新连接在一起:
for i, df_chunk in enumerate(df_list):
print "chunk", i
[x, y] = preprocess_data(df_chunk)
super_x = pd.concat([super_x, x], axis=0)
super_y = pd.concat([super_y, y], axis=0)
print datetime.datetime.utcnow()
The processing time of the first chunk is perfectly acceptable, however, it grows per chunk! This is not to do with the preprocess_data(df_chunk)
as there is no reason for it to increase. Is this increase in time occurring as a result of the call to pd.concat()
?
第一个块的处理时间是完全可以接受的,但是每个块都会增加!这与 无关,preprocess_data(df_chunk)
因为它没有理由增加。这种时间增加是否由于调用pd.concat()
?
Please see log below:
请看下面的日志:
chunks 6
chunk 0
2016-04-08 00:22:17.728849
chunk 1
2016-04-08 00:22:42.387693
chunk 2
2016-04-08 00:23:43.124381
chunk 3
2016-04-08 00:25:30.249369
chunk 4
2016-04-08 00:28:11.922305
chunk 5
2016-04-08 00:32:00.357365
Is there a workaround to speed this up? I have 2900 chunks to process so any help is appreciated!
是否有解决方法来加快速度?我有 2900 个数据块要处理,因此非常感谢您的帮助!
Open to any other suggestions in Python!
接受 Python 中的任何其他建议!
回答by unutbu
Never call DataFrame.append
or pd.concat
inside a for-loop. It leads to quadratic copying.
永远不要在 for 循环中调用DataFrame.append
或pd.concat
。它导致二次复制。
pd.concat
returns a new DataFrame. Space has to be allocated for the new
DataFrame, and data from the old DataFrames have to be copied into the new
DataFrame. Consider the amount of copying required by this line inside the for-loop
(assuming each x
has size 1):
pd.concat
返回一个新的数据帧。必须为新 DataFrame 分配空间,并且必须将旧 DataFrame 中的数据复制到新 DataFrame 中。考虑该行内部所需的复制量for-loop
(假设每行的x
大小为 1):
super_x = pd.concat([super_x, x], axis=0)
| iteration | size of old super_x | size of x | copying required |
| 0 | 0 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 2 | 2 | 1 | 3 |
| ... | | | |
| N-1 | N-1 | 1 | N |
1 + 2 + 3 + ... + N = N(N+1)/2
. So there is O(N**2)
copies required to
complete the loop.
1 + 2 + 3 + ... + N = N(N+1)/2
. 所以O(N**2)
需要副本来完成循环。
Now consider
现在考虑
super_x = []
for i, df_chunk in enumerate(df_list):
[x, y] = preprocess_data(df_chunk)
super_x.append(x)
super_x = pd.concat(super_x, axis=0)
Appending to a list is an O(1)
operationand does not require copying. Now
there is a single call to pd.concat
after the loop is done. This call to
pd.concat
requires N copies to be made, since super_x
contains N
DataFrames of size 1. So when constructed this way, super_x
requires O(N)
copies.
附加到列表是一种O(1)
操作,不需要复制。现在pd.concat
在循环完成后有一个单一的调用。这个调用
pd.concat
需要制作 N 个副本,因为super_x
包含N
大小为 1 的数据帧。所以当以这种方式构造时,super_x
需要O(N)
副本。
回答by Alexander
Every time you concatenate, you are returning a copy of the data.
每次连接时,都会返回数据的副本。
You want to keep a list of your chunks, and then concatenate everything as the final step.
您想保留一个块列表,然后将所有内容连接起来作为最后一步。
df_x = []
df_y = []
for i, df_chunk in enumerate(df_list):
print "chunk", i
[x, y] = preprocess_data(df_chunk)
df_x.append(x)
df_y.append(y)
super_x = pd.concat(df_x, axis=0)
del df_x # Free-up memory.
super_y = pd.concat(df_y, axis=0)
del df_y # Free-up memory.