pandas 大熊猫数据帧上的外部合并导致 MemoryError ---如何与大熊猫合并“大数据”?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39824952/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Outer merge on large pandas DataFrames causes MemoryError---how to do "big data" merges with pandas?
提问by ShanZhengYang
I have two pandas DataFrames df1
and df2
with a fairly standard format:
我有两个 Pandas DataFramesdf1
和df2
一个相当标准的格式:
one two three feature
A 1 2 3 feature1
B 4 5 6 feature2
C 7 8 9 feature3
D 10 11 12 feature4
E 13 14 15 feature5
F 16 17 18 feature6
...
And the same format for df2
. The sizes of these DataFrames are around 175MB and 140 MB.
和相同的格式df2
。这些 DataFrame 的大小约为 175MB 和 140MB。
merged_df = pd.merge(df1, df2, on='feature', how='outer', suffixes=('','_features'))
I get the following MemoryError:
我得到以下内存错误:
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
return op.get_result()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info
sort=self.sort, how=self.how)
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers
return join_func(lkey, rkey, count, **kwargs)
File "pandas/src/join.pyx", line 187, in pandas.algos.full_outer_join (pandas/algos.c:61680)
File "pandas/src/join.pyx", line 196, in pandas.algos._get_result_indexer (pandas/algos.c:61978)
MemoryError
Is it possible there is a "size limit" for pandas dataframes when merging? I am surprised that this wouldn't work. Maybe this is a bug in a certain version of pandas?
合并时Pandas数据帧是否可能存在“大小限制”?我很惊讶这行不通。也许这是某个版本的Pandas中的错误?
EDIT: As mentioned in the comments, many duplicates in the merge column can easily cause RAM issues. See: Python Pandas Merge Causing Memory Overflow
编辑:如评论中所述,合并列中的许多重复项很容易导致 RAM 问题。请参阅:Python Pandas 合并导致内存溢出
The question now is, how can we do this merge? It seems the best way would be to partition the dataframe, somehow.
现在的问题是,我们如何进行这种合并?似乎最好的方法是以某种方式对数据帧进行分区。
回答by jezrael
You can try first filter df1
by unique
values, merge
and last concat
output.
您可以尝试首先df1
按unique
值过滤,merge
最后concat
输出。
If need only outer join, I think there is memory problem also. But if add some other code for filter output of each loop, it can works.
如果只需要外连接,我认为也有内存问题。但是如果为每个循环的过滤器输出添加一些其他代码,它可以工作。
dfs = []
for val in df.feature.unique():
df1 = pd.merge(df[df.feature==val], df2, on='feature', how='outer', suffixes=('','_key'))
#http://stackoverflow.com/a/39786538/2901002
#df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
print (df1)
dfs.append(df1)
df = pd.concat(dfs, ignore_index=True)
print (df)
Other solution is use dask.dataframe.DataFrame.merge
.
其他解决方案是使用dask.dataframe.DataFrame.merge
。
回答by Greg Miller
Try specifying a data type for the numeric columns to reduce the size of the existing data frames, such as:
尝试为数字列指定数据类型以减少现有数据框的大小,例如:
df[['one','two', 'three']] = df[['one','two', 'three']].astype(np.int32)
This should reduce the memory significantly and will hopefully let you preform the merge.
这应该会显着减少内存,并有望让您执行合并。