pandas 为什么 DataFrame 的串联速度会呈指数级增长？

Question

提问by jfive

I have a function which processes a DataFrame, largely to process data into buckets create a binary matrix of features in a particular column using pd.get_dummies(df[col]).

我有一个处理 DataFrame 的函数，主要是将数据处理到存储桶中，使用pd.get_dummies(df[col]).

To avoid processing all of my data using this function at once (which goes out of memory and causes iPython to crash), I have broken the large DataFrame into chunks using:

为了避免一次使用此函数处理我的所有数据（这会导致内存不足并导致 iPython 崩溃），我使用以下方法将大型 DataFrame 分解为多个块：

chunks = (len(df) / 10000) + 1
df_list = np.array_split(df, chunks)

pd.get_dummies(df)will automatically create new columns based on the contents of df[col]and these are likely to differ for each dfin df_list.

pd.get_dummies(df)会自动创建一个基于内容的新栏目df[col]和这些都有可能为每个不同df在df_list。

After processing, I am concatenating the DataFrames back together using:

处理后，我使用以下方法将 DataFrame 重新连接在一起：

for i, df_chunk in enumerate(df_list):
    print "chunk", i
    [x, y] = preprocess_data(df_chunk)
    super_x = pd.concat([super_x, x], axis=0)
    super_y = pd.concat([super_y, y], axis=0)
    print datetime.datetime.utcnow()

The processing time of the first chunk is perfectly acceptable, however, it grows per chunk! This is not to do with the preprocess_data(df_chunk)as there is no reason for it to increase. Is this increase in time occurring as a result of the call to pd.concat()?

第一个块的处理时间是完全可以接受的，但是每个块都会增加！这与无关，preprocess_data(df_chunk)因为它没有理由增加。这种时间增加是否由于调用pd.concat()?

Please see log below:

请看下面的日志：

chunks 6
chunk 0
2016-04-08 00:22:17.728849
chunk 1
2016-04-08 00:22:42.387693 
chunk 2
2016-04-08 00:23:43.124381
chunk 3
2016-04-08 00:25:30.249369
chunk 4
2016-04-08 00:28:11.922305
chunk 5
2016-04-08 00:32:00.357365

Is there a workaround to speed this up? I have 2900 chunks to process so any help is appreciated!

是否有解决方法来加快速度？我有 2900 个数据块要处理，因此非常感谢您的帮助！

Open to any other suggestions in Python!

接受 Python 中的任何其他建议！

Answer 1

回答by unutbu

Never call DataFrame.appendor pd.concatinside a for-loop. It leads to quadratic copying.

永远不要在 for 循环中调用DataFrame.append或pd.concat。它导致二次复制。

pd.concatreturns a new DataFrame. Space has to be allocated for the new DataFrame, and data from the old DataFrames have to be copied into the new DataFrame. Consider the amount of copying required by this line inside the for-loop(assuming each xhas size 1):

pd.concat返回一个新的数据帧。必须为新 DataFrame 分配空间，并且必须将旧 DataFrame 中的数据复制到新 DataFrame 中。考虑该行内部所需的复制量for-loop（假设每行的x大小为 1）：

super_x = pd.concat([super_x, x], axis=0)

| iteration | size of old super_x | size of x | copying required |
|         0 |                   0 |         1 |                1 |
|         1 |                   1 |         1 |                2 |
|         2 |                   2 |         1 |                3 |
|       ... |                     |           |                  |
|       N-1 |                 N-1 |         1 |                N |

1 + 2 + 3 + ... + N = N(N+1)/2. So there is O(N**2)copies required to complete the loop.

1 + 2 + 3 + ... + N = N(N+1)/2. 所以O(N**2)需要副本来完成循环。

Now consider

现在考虑

super_x = []
for i, df_chunk in enumerate(df_list):
    [x, y] = preprocess_data(df_chunk)
    super_x.append(x)
super_x = pd.concat(super_x, axis=0)

Appending to a list is an O(1)operationand does not require copying. Now there is a single call to pd.concatafter the loop is done. This call to pd.concatrequires N copies to be made, since super_xcontains NDataFrames of size 1. So when constructed this way, super_xrequires O(N)copies.

附加到列表是一种O(1)操作，不需要复制。现在pd.concat在循环完成后有一个单一的调用。这个调用 pd.concat需要制作 N 个副本，因为super_x包含N大小为 1 的数据帧。所以当以这种方式构造时，super_x需要O(N)副本。

Answer 2

回答by Alexander

Every time you concatenate, you are returning a copy of the data.

每次连接时，都会返回数据的副本。

You want to keep a list of your chunks, and then concatenate everything as the final step.

您想保留一个块列表，然后将所有内容连接起来作为最后一步。

df_x = []
df_y = []
for i, df_chunk in enumerate(df_list):
    print "chunk", i
    [x, y] = preprocess_data(df_chunk)
    df_x.append(x)
    df_y.append(y)

super_x = pd.concat(df_x, axis=0)
del df_x  # Free-up memory.
super_y = pd.concat(df_y, axis=0)
del df_y  # Free-up memory.

pandas 为什么 DataFrame 的串联速度会呈指数级增长？

提问by jfive

回答by unutbu

回答by Alexander

相关推荐

最近更新

标签

pandas 为什么 DataFrame 的串联速度会呈指数级增长？

提问by jfive

回答by unutbu

回答by Alexander

相关推荐

pandas 用熊猫读取SAS文件

pandas 数据帧中每组熊猫的第二个最大值

pandas 从两列计算和创建百分比列

如何从 Pandas 的数据框中满足条件的位置获取前一行

相关推荐

最近更新

标签