Python Pandas 合并导致内存溢出

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32750970/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:55:32  来源:igfitidea点击:

Python Pandas Merge Causing Memory Overflow

pythonmemorypandasmergeout-of-memory

提问by pefmath

I'm new to Pandas and am trying to merge a few subsets of data. I'm giving a specific case where this happens, but the question is general: How/why is it happening and how can I work around it?

我是 Pandas 的新手,正在尝试合并一些数据子集。我给出了发生这种情况的具体案例,但问题很普遍:它是如何/为什么发生的,我该如何解决?

The data I load is around 85 Megs or so but I often watch my python session run up close to 10 gigs of memory usage then give a memory error.

我加载的数据大约为 85 Megs 左右,但我经常看到我的 python 会话运行接近 10 gigs 的内存使用然后给出内存错误。

I have no idea why this happens, but it's killing me as I can't even get started looking at the data the way I want to.

我不知道为什么会发生这种情况,但这让我很伤心,因为我什至无法开始以我想要的方式查看数据。

Here's what I've done:

这是我所做的:

Importing the Main data

导入主要数据

import requests, zipfile, StringIO
import numpy as np
import pandas as pd 


STAR2013url="http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_all_csv_v3.zip"
STAR2013fileName = 'ca2013_all_csv_v3.txt'

r = requests.get(STAR2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))

STAR2013=pd.read_csv(z.open(STAR2013fileName))

Importing some Cross Cross Referencing Tables

导入一些交叉引用表

STARentityList2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013entities_csv.zip"
STARentityList2013fileName = "ca2013entities_csv.txt"
r = requests.get(STARentityList2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARentityList2013=pd.read_csv(z.open(STARentityList2013fileName))

STARlookUpTestID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/tests.zip"
STARlookUpTestID2013fileName = "Tests.txt"
r = requests.get(STARlookUpTestID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpTestID2013=pd.read_csv(z.open(STARlookUpTestID2013fileName))

STARlookUpSubgroupID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/subgroups.zip"
STARlookUpSubgroupID2013fileName = "Subgroups.txt"
r = requests.get(STARlookUpSubgroupID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpSubgroupID2013=pd.read_csv(z.open(STARlookUpSubgroupID2013fileName))

Renaming a Column ID to Allow for Merge

重命名列 ID 以允许合并

STARlookUpSubgroupID2013 = STARlookUpSubgroupID2013.rename(columns={'001':'Subgroup ID'})
STARlookUpSubgroupID2013

Successful Merge

成功合并

merged = pd.merge(STAR2013,STARlookUpSubgroupID2013, on='Subgroup ID')

Try a second merge. This is where the Memory Overflow Happens

尝试第二次合并。这是内存溢出发生的地方

merged=pd.merge(merged, STARentityList2013, on='School Code')

I did all of this in ipython notebook, but don't think that changes anything.

我在 ipython notebook 中做了所有这些,但不要认为这会改变任何东西。

回答by mplf

Although this is an old question, I recently came across the same problem.

虽然这是一个老问题,但我最近遇到了同样的问题。

In my instance, duplicate keys are required in both dataframes, and I needed a method which could tell if a merge will fit into memory ahead of computation, and if not, change the computation method.

在我的例子中,两个数据帧中都需要重复的键,我需要一种方法来判断合并是否在计算之前适合内存,如果不是,则更改计算方法。

The method I came up with is as follows:

我想出的方法如下:

Calculate merge size:

计算合并大小:

def merge_size(left_frame, right_frame, group_by, how='inner'):
    left_groups = left_frame.groupby(group_by).size()
    right_groups = right_frame.groupby(group_by).size()
    left_keys = set(left_groups.index)
    right_keys = set(right_groups.index)
    intersection = right_keys & left_keys
    left_diff = left_keys - intersection
    right_diff = right_keys - intersection

    left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
    right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
    left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
    right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan

    sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
    sizes += [left_nan * right_nan]

    left_size = [left_groups[group_name] for group_name in left_diff]
    right_size = [right_groups[group_name] for group_name in right_diff]
    if how == 'inner':
        return sum(sizes)
    elif how == 'left':
        return sum(sizes + left_size)
    elif how == 'right':
        return sum(sizes + right_size)
    return sum(sizes + left_size + right_size)

Note:

笔记:

At present with this method, the key can only be a label, not a list. Using a list for group_bycurrently returns a sum of merge sizes for each label in the list. This will result in a merge size far larger than the actual merge size.

目前这种方法的key只能是label,不能是list。使用group_by当前列表返回列表中每个标签的合并大小总和。这将导致合并大小远大于实际合并大小。

If you are using a list of labels for the group_by, the final row size is:

如果您使用 group_by 的标签列表,则最终行大小为:

min([merge_size(df1, df2, label, how) for label in group_by])

Check if this fits in memory

检查这是否适合内存

The merge_sizefunction defined here returns the number of rows which will be created by merging two dataframes together.

merge_size此处定义的函数返回将通过将两个数据帧合并在一起而创建的行数。

By multiplying this with the count of columns from both dataframes, then multiplying by the size of np.float[32/64], you can get a rough idea of how large the resulting dataframe will be in memory. This can then be compared against psutil.virtual_memory().availableto see if your system can calculate the full merge.

通过将其与来自两个数据帧的列数相乘,然后乘以 np.float[32/64] 的大小,您可以大致了解结果数据帧在内存中的大小。然后可以将其与此进行比较,psutil.virtual_memory().available以查看您的系统是否可以计算完整合并。

def mem_fit(df1, df2, key, how='inner'):
    rows = merge_size(df1, df2, key, how)
    cols = len(df1.columns) + (len(df2.columns) - 1)
    required_memory = (rows * cols) * np.dtype(np.float64).itemsize

    return required_memory <= psutil.virtual_memory().available

The merge_sizemethod has been proposed as an extension of pandasin this issue. https://github.com/pandas-dev/pandas/issues/15068.

merge_size方法已被提议作为pandas本期的扩展。https://github.com/pandas-dev/pandas/issues/15068