为什么 Pandas Concatenation (pandas.concat) 如此内存效率低下?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29806936/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why is Pandas Concatenation (pandas.concat) so Memory Inefficient?
提问by sfortney
I have about 30 GB of data (in a list of about 900 dataframes) that I am attempting to concatenate together. The machine I am working with is a moderately powerful Linux Box with about 256 GB of ram. However, when I try to concatenate my files I quickly run out of available ram. I have tried all sorts of workarounds to fix this (concatenating in smaller batches with for loops, etc.) but I still cannot get these to concatenate. Two questions spring to mind:
我有大约 30 GB 的数据(在大约 900 个数据帧的列表中),我试图将它们连接在一起。我正在使用的机器是一个中等强大的 Linux Box,大约有 256 GB 的内存。但是,当我尝试连接我的文件时,我很快就用完了可用的 ram。我已经尝试了各种解决方法来解决这个问题(用 for 循环等小批量连接),但我仍然无法将它们连接起来。两个问题浮现在脑海:
Has anyone else dealt with this and found an effective workaround? I cannot use a straight append because I need the 'column merging' (for lack of a better word) functionality of the
join='outer'argument inpd.concat().Why is Pandas concatenation (which I know is just calling
numpy.concatenate) so inefficient with its use of memory?
有没有其他人处理过这个问题并找到了有效的解决方法?因为我需要的“列合并”(由于缺乏一个更好的词)的功能,我不能使用直追加
join='outer'的说法pd.concat()。为什么 Pandas 串联(我知道它只是调用
numpy.concatenate)在使用内存时效率如此低?
I should also note that I do not think the problem is an explosion of columns as concatenating 100 of the dataframes together gives about 3000 columns whereas the base dataframe has about 1000.
我还应该注意,我不认为问题是列的爆炸式增长,因为将 100 个数据帧连接在一起给出了大约 3000 列,而基本数据帧有大约 1000 列。
Edit:
编辑:
The data I am working with is financial data about 1000 columns wide and about 50,000 rows deep for each of my 900 dataframes. The types of data going across left to right are:
我正在处理的数据是我的 900 个数据帧中每一个的大约 1000 列宽和大约 50,000 行深的财务数据。从左到右的数据类型是:
- date in string format,
stringnp.floatint
- 字符串格式的日期,
stringnp.floatint
... and so on repeating. I am concatenating on column name with an outer join which means that any columns in df2that are not in df1will not be discarded but shunted off to the side.
......等等重复。我将列名与外部连接连接起来,这意味着df2不在其中的任何列df1都不会被丢弃,而是被分流到一边。
Example:
例子:
#example code
data=pd.concat(datalist4, join="outer", axis=0, ignore_index=True)
#two example dataframes (about 90% of the column names should be in common
#between the two dataframes, the unnamed columns, etc are not a significant
#number of the columns)
print datalist4[0].head()
800_1 800_2 800_3 800_4 900_1 900_2 0 2014-08-06 09:00:00 BEST_BID 1117.1 103 2014-08-06 09:00:00 BEST_BID
1 2014-08-06 09:00:00 BEST_ASK 1120.0 103 2014-08-06 09:00:00 BEST_ASK
2 2014-08-06 09:00:00 BEST_BID 1106.9 11 2014-08-06 09:00:00 BEST_BID
3 2014-08-06 09:00:00 BEST_ASK 1125.8 62 2014-08-06 09:00:00 BEST_ASK
4 2014-08-06 09:00:00 BEST_BID 1117.1 103 2014-08-06 09:00:00 BEST_BID
900_3 900_4 1000_1 1000_2 ... 2400_4 0 1017.2 103 2014-08-06 09:00:00 BEST_BID ... NaN
1 1020.1 103 2014-08-06 09:00:00 BEST_ASK ... NaN
2 1004.3 11 2014-08-06 09:00:00 BEST_BID ... NaN
3 1022.9 11 2014-08-06 09:00:00 BEST_ASK ... NaN
4 1006.7 10 2014-08-06 09:00:00 BEST_BID ... NaN
_1 _2 _3 _4 _1.1 _2.1 _3.1 _4.1 0 #N/A Invalid Security NaN NaN NaN #N/A Invalid Security NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN
dater
0 2014.8.6
1 2014.8.6
2 2014.8.6
3 2014.8.6
4 2014.8.6
[5 rows x 777 columns]
print datalist4[1].head()
150_1 150_2 150_3 150_4 200_1 200_2 0 2013-12-04 09:00:00 BEST_BID 1639.6 30 2013-12-04 09:00:00 BEST_ASK
1 2013-12-04 09:00:00 BEST_ASK 1641.8 133 2013-12-04 09:00:08 BEST_BID
2 2013-12-04 09:00:01 BEST_BID 1639.5 30 2013-12-04 09:00:08 BEST_ASK
3 2013-12-04 09:00:05 BEST_BID 1639.4 30 2013-12-04 09:00:08 BEST_ASK
4 2013-12-04 09:00:08 BEST_BID 1639.3 133 2013-12-04 09:00:08 BEST_BID
200_3 200_4 250_1 250_2 ... 2500_1 0 1591.9 133 2013-12-04 09:00:00 BEST_BID ... 2013-12-04 10:29:41
1 1589.4 30 2013-12-04 09:00:00 BEST_ASK ... 2013-12-04 11:59:22
2 1591.6 103 2013-12-04 09:00:01 BEST_BID ... 2013-12-04 11:59:23
3 1591.6 133 2013-12-04 09:00:04 BEST_BID ... 2013-12-04 11:59:26
4 1589.4 133 2013-12-04 09:00:07 BEST_BID ... 2013-12-04 11:59:29
2500_2 2500_3 2500_4 Unnamed: 844_1 Unnamed: 844_2 0 BEST_ASK 0.35 50 #N/A Invalid Security NaN
1 BEST_ASK 0.35 11 NaN NaN
2 BEST_ASK 0.40 11 NaN NaN
3 BEST_ASK 0.45 11 NaN NaN
4 BEST_ASK 0.50 21 NaN NaN
Unnamed: 844_3 Unnamed: 844_4 Unnamed: 848_1 dater
0 NaN NaN #N/A Invalid Security 2013.12.4
1 NaN NaN NaN 2013.12.4
2 NaN NaN NaN 2013.12.4
3 NaN NaN NaN 2013.12.4
4 NaN NaN NaN 2013.12.4
[5 rows x 850 columns]
回答by Alexander
I've had performance issues concatenating a large number of DataFrames to a 'growing' DataFrame. My workaround was appending all sub DataFrames to a list, and then concatenating the list of DataFrames once processing of the sub DataFrames has been completed.
我在将大量 DataFrame 连接到“不断增长”的 DataFrame 时遇到了性能问题。我的解决方法是将所有子数据帧附加到一个列表中,然后在子数据帧的处理完成后连接数据帧列表。
回答by Jeff
Looks like you are trying to row-wise concat, even though you text indicates that you what column-wise. Specify axis=1.
看起来您正在尝试按行连接,即使您的文本指示您是按列进行的。指定axis=1。
Other points to consider:
其他需要考虑的点:
copy=Falseflag will not help at all; this only matters if you are not concatting blocks of the same dtype (which you indicated you are).
copy=Falseflag 根本没有帮助;这仅在您没有连接相同 dtype 的块(您表示您是)时才重要。
pd.concatdoesuse np.concatenateunder the hood. If you think you can do better, then go for it.
pd.concat确实np.concatenate在引擎盖下使用。如果你认为你可以做得更好,那就去做吧。
def make_frames(n=100, rows=100, cols=100):
return [ pd.DataFrame(np.random.randn(rows,cols),columns=np.random.choice(110,100,replace=False)) for i in xrange(n) ]
In [28]: l = make_frames(rows=10000)
In [29]: l[0].head()
Out[29]:
60 75 101 103 87 29 10 106 71 26 30 83 2 28 99 85 88 62 58 18 42 1 105 25 34 ... 102 27 22 \
0 -0.854117 -0.007549 -0.510359 -0.993757 0.877635 -0.303199 -1.488548 1.179360 0.578095 0.807792 0.169930 -1.781403 0.204696 -0.515057 -0.954246 1.106073 0.666516 -1.146988 1.335709 0.362838 -0.675379 1.483469 0.670385 -0.483312 -0.703795 ... 1.322645 -1.942183 1.053502
1 2.057542 0.860946 -0.037665 -0.347265 0.152562 -0.859537 1.431045 1.306419 0.623013 1.192325 0.909597 1.710507 1.319330 -0.402874 1.749581 1.223489 0.036354 0.140255 0.844330 -0.091447 -0.347245 0.259055 1.187882 -0.216858 -1.421336 ... 1.122068 0.887538 0.205854
2 -0.077974 0.947503 0.688666 0.288104 -1.275329 -0.840847 -2.014090 -1.318507 -0.889416 -0.098005 0.055492 0.847597 -1.289428 -0.910093 0.201312 -1.699879 0.103062 -1.041608 0.379171 -1.089937 0.894626 -1.500215 -0.501182 0.042078 -0.840789 ... 0.539192 0.193256 0.196138
3 0.291993 1.138577 1.061509 0.856553 1.118931 0.725806 -0.689776 1.337957 -1.009835 -0.976506 -0.392317 0.295876 0.092240 0.418201 0.473585 0.013809 -1.169947 0.424797 0.019051 -0.526189 0.066991 -0.268750 1.277004 -0.736560 -0.314987 ... 0.272045 -0.333272 0.573267
4 -2.073985 -0.016950 -1.712770 0.286212 -0.159693 -0.495864 1.286450 -1.168880 1.031456 -3.080568 1.443880 -0.604405 0.406383 -0.162986 1.077255 1.160726 0.943949 -1.517681 -1.049972 1.208850 -0.859617 -0.145358 -0.638898 0.248012 -2.985845 ... -0.699697 0.051352 0.575304
69 76 91 45 14 37 0 81 38 72 107 11 5 73 70 8 90 94 53 3 55 12
0 -0.972965 -0.298674 1.283482 2.344092 -0.597735 -0.407978 0.971726 -0.935620 0.236889 -0.957096 -2.366399 -0.943760 0.293325 -0.240385 -0.392554 -0.887556 0.261402 -2.050122 -1.776865 -1.513899 -0.953916 0.630495
1 -1.471033 0.269830 -0.744507 -0.982779 0.624527 -1.782704 1.197262 -0.297730 1.122939 -1.039226 0.171351 -0.828985 0.698245 0.563430 0.718177 0.682369 1.415918 0.049931 0.648000 1.785455 -0.190021 -1.329753
2 -1.942792 0.560981 -0.353782 -1.637407 -1.495131 -0.593041 -1.617116 -0.910257 -0.506877 0.178378 -0.623986 0.302544 0.279309 -0.266409 0.780306 0.986510 -1.549847 0.063632 -0.480434 1.393221 -1.237682 1.577320
3 0.468151 -1.002872 -0.147329 -0.420609 0.183696 0.527632 0.018911 -2.059989 1.642613 -0.428345 1.350693 -1.323321 -0.247263 0.331525 -2.036862 -2.593575 0.362101 -0.184095 0.419231 -0.633878 0.097499 -0.026044
4 -0.581330 -0.848421 -0.682027 -1.260004 -0.357354 -0.304743 0.409537 -1.189925 -0.609352 -0.610345 -0.798009 0.219822 -0.681764 1.872736 1.738017 0.439148 1.012881 -0.934613 -1.007427 -0.390359 0.329949 0.486906
[5 rows x 100 columns]
Concat, note using axis=1as this is column-wise concat.
Concat,请注意使用,axis=1因为这是按列连接的。
In [31]: df = pd.concat(l,axis=1,ignore_index=True)
In [32]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Columns: 10000 entries, 0 to 9999
dtypes: float64(10000)
memory usage: 763.0 MB
Timings
时间安排
In [33]: %timeit pd.concat(l,axis=1,ignore_index=True)
1 loops, best of 3: 1.15 s per loop
In [34]: %memit pd.concat(l,axis=1,ignore_index=True)
peak memory: 2390.25 MiB, increment: 651.28 MiB

