pandas 大熊猫数据帧上的外部合并导致 MemoryError ---如何与大熊猫合并“大数据”？

Question

提问by ShanZhengYang

I have two pandas DataFrames df1and df2with a fairly standard format:

我有两个 Pandas DataFramesdf1和df2一个相当标准的格式：

   one  two  three   feature
A    1    2      3   feature1
B    4    5      6   feature2  
C    7    8      9   feature3   
D    10   11     12  feature4
E    13   14     15  feature5 
F    16   17     18  feature6 
...

And the same format for df2. The sizes of these DataFrames are around 175MB and 140 MB.

和相同的格式df2。这些 DataFrame 的大小约为 175MB 和 140MB。

merged_df = pd.merge(df1, df2, on='feature', how='outer', suffixes=('','_features'))

I get the following MemoryError:

我得到以下内存错误：

File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
    return op.get_result()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info
    sort=self.sort, how=self.how) 
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
File "pandas/src/join.pyx", line 187, in pandas.algos.full_outer_join (pandas/algos.c:61680)
  File "pandas/src/join.pyx", line 196, in pandas.algos._get_result_indexer (pandas/algos.c:61978)
MemoryError

Is it possible there is a "size limit" for pandas dataframes when merging? I am surprised that this wouldn't work. Maybe this is a bug in a certain version of pandas?

合并时Pandas数据帧是否可能存在“大小限制”？我很惊讶这行不通。也许这是某个版本的Pandas中的错误？

EDIT: As mentioned in the comments, many duplicates in the merge column can easily cause RAM issues. See: Python Pandas Merge Causing Memory Overflow

编辑：如评论中所述，合并列中的许多重复项很容易导致 RAM 问题。请参阅：Python Pandas 合并导致内存溢出

The question now is, how can we do this merge? It seems the best way would be to partition the dataframe, somehow.

现在的问题是，我们如何进行这种合并？似乎最好的方法是以某种方式对数据帧进行分区。

Answer 1

回答by jezrael

You can try first filter df1by uniquevalues, mergeand last concatoutput.

您可以尝试首先df1按unique值过滤，merge最后concat输出。

If need only outer join, I think there is memory problem also. But if add some other code for filter output of each loop, it can works.

如果只需要外连接，我认为也有内存问题。但是如果为每个循环的过滤器输出添加一些其他代码，它可以工作。

dfs = []
for val in df.feature.unique():
    df1 = pd.merge(df[df.feature==val], df2, on='feature', how='outer', suffixes=('','_key'))
    #http://stackoverflow.com/a/39786538/2901002
    #df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
    print (df1)
    dfs.append(df1)

df = pd.concat(dfs, ignore_index=True)
print (df)

Other solution is use dask.dataframe.DataFrame.merge.

其他解决方案是使用dask.dataframe.DataFrame.merge。

Answer 2

回答by Greg Miller

Try specifying a data type for the numeric columns to reduce the size of the existing data frames, such as:

尝试为数字列指定数据类型以减少现有数据框的大小，例如：

df[['one','two', 'three']] = df[['one','two', 'three']].astype(np.int32)

This should reduce the memory significantly and will hopefully let you preform the merge.

这应该会显着减少内存，并有望让您执行合并。

pandas 大熊猫数据帧上的外部合并导致 MemoryError ---如何与大熊猫合并“大数据”？

提问by ShanZhengYang

回答by jezrael

回答by Greg Miller

相关推荐

最近更新

标签

pandas 大熊猫数据帧上的外部合并导致 MemoryError ---如何与大熊猫合并“大数据”？

提问by ShanZhengYang

回答by jezrael

回答by Greg Miller

相关推荐

pandas 类型错误：“float”类型的参数不可迭代-Tensorflow wide_n_deep_tutorial

pandas 熊猫直方图：将每列的直方图绘制为大图的子图

pandas 熊猫合并具有相同值和相同索引的行

Pandas 数据帧枢轴 - 内存错误

相关推荐

最近更新

标签