Pandas 合并错误:MemoryError

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19085280/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:12:27  来源:igfitidea点击:

Pandas Merge Error: MemoryError

pythonmergepandas

提问by agconti

Problem:

问题:

I'm trying to two relatively small datasets together, but the merge raises a MemoryError. I have two datasets of aggregates of country trade data, that I'm trying to merge on the keys year and country, so the data needs to be particularity placed. This unfortunately makes the use of concatand its performance benefits impossible as seen in the answer to this question: MemoryError on large merges with pandas in Python.

我正在尝试将两个相对较小的数据集放在一起,但合并会引发MemoryError. 我有两个国家贸易数据的聚合数据集,我试图在关键年份和国家合并,所以数据需要特殊放置。不幸的是,这使得concat无法使用及其性能优势,如以下问题的答案所示:MemoryError on large merges with pandas in Python

Here's the setup:

这是设置:

The attempted merge:

尝试的合并:

df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])

Basic data structure:

基本数据结构:

i:

一世:

    Year    Reporter_Code   Trade_Flow_Code Partner_Code    Classification  Commodity Code  Quantity Unit Code  Supplementary Quantity  Netweight (kg)  Value   Estimation Code
0    2003    381     2   36  H2  070951  8   1274    1274    13810   0
1    2003    381     2   36  H2  070930  8   17150   17150   30626   0
2    2003    381     2   36  H2  0709    8   20493   20493   635840  0
3    2003    381     1   36  H2  0507    8   5200    5200    27619   0
4    2003    381     1   36  H2  050400  8   56439   56439   683104  0

df:

df:

    mporter  cod     CC ComTrade_CC Distance_miles
0    110     215     215     757     428.989
1    110     215     215     757     428.989
2    110     215     215     757     428.989
3    110     215     215     757     428.989
4    110     215     215     757     428.989

Error Traceback:

错误追溯:

 MemoryError                      Traceback (most recent call last)
<ipython-input-10-8d6e9fb45de6> in <module>()
      1 for i in c_list:
----> 2     df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
     36                          right_index=right_index, sort=sort, suffixes=suffixes,
     37                          copy=copy)
---> 38     return op.get_result()
     39 if __debug__:
     40     merge.__doc__ = _merge_doc % '\nleft : DataFrame'

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
    193                                       copy=self.copy)
    194 
--> 195         result_data = join_op.get_result()
    196         result = DataFrame(result_data)
    197 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
    693                 if klass in mapping:
    694                     klass_blocks.extend((unit, b) for b in mapping[klass])
--> 695             res_blk = self._get_merged_block(klass_blocks)
    696 
    697             # if we have a unique result index, need to clear the _ref_locs

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge)
    706     def _get_merged_block(self, to_merge):
    707         if len(to_merge) > 1:
--> 708             return self._merge_blocks(to_merge)
    709         else:
    710             unit, block = to_merge[0]

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks)
    728         # Should use Fortran order??
    729         block_dtype = _get_block_dtype([x[1] for x in merge_chunks])
--> 730         out = np.empty(out_shape, dtype=block_dtype)
    731 
    732         sofar = 0

MemoryError: 

Thanks for your thoughts!

谢谢你的想法!

回答by Gordon Bean

In case anyone coming across this question still has similar trouble with merge, you can probably get concatto work by renaming the relevant columns in the two dataframes to the same names, setting them as a MultiIndex(i.e. df = dv.set_index(['A','B'])), and then using concatto join them.

如果遇到这个问题的任何人仍然遇到类似的问题merge,您可能可以concat通过将两个数据框中的相关列重命名为相同的名称,将它们设置为 a MultiIndex(ie df = dv.set_index(['A','B'])),然后使用concat加入它们来开始工作。

UPDATE

更新

Example:

例子:

df1 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'C':[3, 4]})
df2 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'D':[7, 8]})
both = pd.concat([df1.set_index(['A','B']), df2.set_index(['A','B'])], axis=1).reset_index()

df1

df1

    A   B   C
0   1   2   3
1   2   3   4

df2

df2

    A   B   D
0   1   2   7
1   2   3   8

both

两个都

    A   B   C   D
0   1   2   3   7
1   2   3   4   8

I haven't benchmarked the performance of this approach, but it didn't get the memory error and worked for my applications.

我还没有对这种方法的性能进行基准测试,但它没有出现内存错误并且适用于我的应用程序。