pandas 大熊猫数据帧上的外部合并导致 MemoryError ---如何与大熊猫合并“大数据”?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39824952/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:07:41  来源:igfitidea点击:

Outer merge on large pandas DataFrames causes MemoryError---how to do "big data" merges with pandas?

pythonpandasmemorydataframeout-of-memory

提问by ShanZhengYang

I have two pandas DataFrames df1and df2with a fairly standard format:

我有两个 Pandas DataFramesdf1df2一个相当标准的格式:

   one  two  three   feature
A    1    2      3   feature1
B    4    5      6   feature2  
C    7    8      9   feature3   
D    10   11     12  feature4
E    13   14     15  feature5 
F    16   17     18  feature6 
...

And the same format for df2. The sizes of these DataFrames are around 175MB and 140 MB.

和相同的格式df2。这些 DataFrame 的大小约为 175MB 和 140MB。

merged_df = pd.merge(df1, df2, on='feature', how='outer', suffixes=('','_features'))

I get the following MemoryError:

我得到以下内存错误:

File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
    return op.get_result()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info
    sort=self.sort, how=self.how) 
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
File "pandas/src/join.pyx", line 187, in pandas.algos.full_outer_join (pandas/algos.c:61680)
  File "pandas/src/join.pyx", line 196, in pandas.algos._get_result_indexer (pandas/algos.c:61978)
MemoryError

Is it possible there is a "size limit" for pandas dataframes when merging? I am surprised that this wouldn't work. Maybe this is a bug in a certain version of pandas?

合并时Pandas数据帧是否可能存在“大小限制”?我很惊讶这行不通。也许这是某个版本的Pandas中的错误?

EDIT: As mentioned in the comments, many duplicates in the merge column can easily cause RAM issues. See: Python Pandas Merge Causing Memory Overflow

编辑:如评论中所述,合并列中的许多重复项很容易导致 RAM 问题。请参阅:Python Pandas 合并导致内存溢出

The question now is, how can we do this merge? It seems the best way would be to partition the dataframe, somehow.

现在的问题是,我们如何进行这种合并?似乎最好的方法是以某种方式对数据帧进行分区。

回答by jezrael

You can try first filter df1by uniquevalues, mergeand last concatoutput.

您可以尝试首先df1unique值过滤,merge最后concat输出。

If need only outer join, I think there is memory problem also. But if add some other code for filter output of each loop, it can works.

如果只需要外连接,我认为也有内存问题。但是如果为每个循环的过滤器输出添加一些其他代码,它可以工作。

dfs = []
for val in df.feature.unique():
    df1 = pd.merge(df[df.feature==val], df2, on='feature', how='outer', suffixes=('','_key'))
    #http://stackoverflow.com/a/39786538/2901002
    #df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
    print (df1)
    dfs.append(df1)

df = pd.concat(dfs, ignore_index=True)
print (df)


Other solution is use dask.dataframe.DataFrame.merge.

其他解决方案是使用dask.dataframe.DataFrame.merge

回答by Greg Miller

Try specifying a data type for the numeric columns to reduce the size of the existing data frames, such as:

尝试为数字列指定数据类型以减少现有数据框的大小,例如:

df[['one','two', 'three']] = df[['one','two', 'three']].astype(np.int32)

This should reduce the memory significantly and will hopefully let you preform the merge.

这应该会显着减少内存,并有望让您执行合并。