如何在不遇到 MemoryError 的情况下连接多个 pandas.DataFrames

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44715393/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:51:55  来源:igfitidea点击:

How to concatenate multiple pandas.DataFrames without running into MemoryError

pythonpandasmemorymemory-management

提问by bluprince13

I have three DataFrames that I'm trying to concatenate.

我有三个要连接的 DataFrame。

concat_df = pd.concat([df1, df2, df3])

This results in a MemoryError. How can I resolve this?

这会导致 MemoryError。我该如何解决这个问题?

Note that most of the existing similar questions are on MemoryErrors occuring when reading large files. I don't have that problem. I have read my files in into DataFrames. I just can't concatenate that data.

请注意,大多数现有的类似问题都是关于读取大文件时发生的 MemoryErrors。我没有那个问题。我已将我的文件读入 DataFrames。我只是无法连接这些数据。

采纳答案by bluprince13

I'm grateful to the community for their answers. However, in my case, I found out that the problem was actually due to the fact that I was using 32 bit Python.

我感谢社区的回答。但是,就我而言,我发现问题实际上是由于我使用的是 32 位 Python。

There are memory limitsdefined for Windows 32 and 64 bit OS. For a 32 bit process, it is only 2 GB. So, even if your RAM has more than 2GB, and even if you're running the 64 bit OS, but you are running a 32 bit process, then that process will be limited to just 2 GB of RAM - in my case that process was Python.

内存限制的Windows 32和64位OS定义。对于 32 位进程,它只有 2 GB。因此,即使您的 RAM 超过 2GB,即使您运行的是 64 位操作系统,但您运行的是 32 位进程,那么该进程也将仅限于 2 GB 的 RAM - 在我的情况下是该进程是 Python。

I upgraded to 64 bit Python, and haven't had a memory error since then!

我升级到 64 位 Python,从那时起就没有出现内存错误!

Other relevant questions are: Python 32-bit memory limits on 64bit windows, Should I use Python 32bit or Python 64bit, Why is this numpy array too big to load?

其他相关问题是:Python 32-bit memory limits on 64bit windows我应该使用 Python 32bit 还是 Python 64bit为什么这个 numpy 数组太大而无法加载?

回答by B. M.

The problem is, like viewed in the others answers, a problem of memory. And a solution is to store data on disk, then to build an unique dataframe.

问题是,就像在其他答案中看到的那样,是记忆问题。一个解决方案是将数据存储在磁盘上,然后构建一个唯一的数据帧。

With such huge data, performance is an issue.

对于如此庞大的数据,性能是一个问题。

csv solutions are very slow, since conversion in text mode occurs. HDF5 solutions are shorter, more elegant and faster since using binary mode. I propose a third way in binary mode, with pickle, which seems to be even faster, but more technical and needing some more room. And a fourth, by hand.

csv 解决方案非常慢,因为会在文本模式下进行转换。由于使用二进制模式,HDF5 解决方案更短、更优雅、更快。我提出了第三种二进制模式的方法,使用pickle,它似乎更快,但更具技术性并且需要更多空间。第四个,手动。

Here the code:

这里的代码:

import numpy as np
import pandas as pd

# a DataFrame factory:
dfs=[]
for i in range(10):
    dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4)))

# a csv solution
def bycsv(dfs):
    md,hd='w',True
    for df in dfs:
        df.to_csv('df_all.csv',mode=md,header=hd,index=None)
        md,hd='a',False
    #del dfs
    df_all=pd.read_csv('df_all.csv',index_col=None)
    os.remove('df_all.csv') 
    return df_all    

Better solutions :

更好的解决方案:

def byHDF(dfs):
    store=pd.HDFStore('df_all.h5')
    for df in dfs:
        store.append('df',df,data_columns=list('0123'))
    #del dfs
    df=store.select('df')
    store.close()
    os.remove('df_all.h5')
    return df

def bypickle(dfs):
    c=[]
    with open('df_all.pkl','ab') as f:
        for df in dfs:
            pickle.dump(df,f)
            c.append(len(df))    
    #del dfs
    with open('df_all.pkl','rb') as f:
        df_all=pickle.load(f)
        offset=len(df_all)
        df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4)))

        for size in c[1:]:
            df=pickle.load(f)
            df_all.iloc[offset:offset+size]=df.values 
            offset+=size
    os.remove('df_all.pkl')
    return df_all

For homogeneous dataframes, we can do even better :

对于同类数据帧,我们可以做得更好:

def byhand(dfs):
    mtot=0
    with open('df_all.bin','wb') as f:
        for df in dfs:
            m,n =df.shape
            mtot += m
            f.write(df.values.tobytes())
            typ=df.values.dtype                
    #del dfs
    with open('df_all.bin','rb') as f:
        buffer=f.read()
        data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
        df_all=pd.DataFrame(data=data,columns=list(range(n))) 
    os.remove('df_all.bin')
    return df_all

And some tests on (little, 32 Mb) data to compare performance. you have to multiply by about 128 for 4 Gb.

并对(小,32 Mb)数据进行一些测试以比较性能。对于 4 Gb,您必须乘以大约 128。

In [92]: %time w=bycsv(dfs)
Wall time: 8.06 s

In [93]: %time x=byHDF(dfs)
Wall time: 547 ms

In [94]: %time v=bypickle(dfs)
Wall time: 219 ms

In [95]: %time y=byhand(dfs)
Wall time: 109 ms

A check :

支票:

In [195]: (x.values==w.values).all()
Out[195]: True

In [196]: (x.values==v.values).all()
Out[196]: True

In [197]: (x.values==y.values).all()
Out[196]: True

Of course all of that must be improved and tuned to fit your problem.

当然,所有这些都必须改进和调整以适应您的问题。

For exemple df3 can be split in chuncks of size 'total_memory_size - df_total_size' to be able to run bypickle.

例如,df3 可以分成大小为 'total_memory_size - df_total_size' 的块以便能够运行bypickle

I can edit it if you give more information on your data structure and size if you want. Beautiful question !

如果您愿意,我可以编辑它,如果您提供有关数据结构和大小的更多信息。漂亮的问题!

回答by glegoux

I advice you to put your dataframes into single csv file by concatenation. Then to read your csv file.

我建议您通过串联将数据帧放入单个 csv 文件中。然后读取您的csv文件。

Execute that:

执行:

# write df1 content in file.csv
df1.to_csv('file.csv', index=False)
# append df2 content to file.csv
df2.to_csv('file.csv', mode='a', columns=False, index=False)
# append df3 content to file.csv
df3.to_csv('file.csv', mode='a', columns=False, index=False)

# free memory
del df1, df2, df3

# read all df1, df2, df3 contents
df = pd.read_csv('file.csv')

If this solution isn't enougth performante, to concat larger files than usually. Do:

如果此解决方案的性能不够,则连接比通常更大的文件。做:

df1.to_csv('file.csv', index=False)
df2.to_csv('file1.csv', index=False)
df3.to_csv('file2.csv', index=False)

del df1, df2, df3

Then run bash command:

然后运行 ​​bash 命令:

cat file1.csv >> file.csv
cat file2.csv >> file.csv
cat file3.csv >> file.csv

Or concat csv files in python :

或者在 python 中连接 csv 文件:

def concat(file1, file2):
    with open(file2, 'r') as filename2:
        data = file2.read()
    with open(file1, 'a') as filename1:
        file.write(data)

concat('file.csv', 'file1.csv')
concat('file.csv', 'file2.csv')
concat('file.csv', 'file3.csv')

After read:

阅读后:

df = pd.read_csv('file.csv')

回答by JohnE

Kinda taking a guess here, but maybe:

有点猜测这里,但也许:

df1 = pd.concat([df1,df2])
del df2
df1 = pd.concat([df1,df3])
del df3

Obviously, you could do that more as a loop but the key is you want to delete df2, df3, etc. as you go. As you are doing it in the question, you never clear out the old dataframes so you are using about twice as much memory as you need to.

显然,您可以循环执行更多操作,但关键是您要随时删除 df2、df3 等。当您在问题中这样做时,您永远不会清除旧的数据帧,因此您使用的内存大约是您需要的两倍。

More generally, if you are reading and concatentating, I'd do it something like this (if you had 3 CSVs: foo0, foo1, foo2):

更一般地说,如果您正在阅读和连接,我会这样做(如果您有 3 个 CSV:foo0、foo1、foo2):

concat_df = pd.DataFrame()
for i in range(3):
    temp_df = pd.read_csv('foo'+str(i)+'.csv')
    concat_df = pd.concat( [concat_df, temp_df] )

In other words, as you are reading in files, you only keep the small dataframes in memory temporarily, until you concatenate them into the combined df, concat_df. As you currently do it, you are keeping around all the smaller dataframes, even after concatenating them.

换句话说,当您读取文件时,您只会将小数据帧暂时保留在内存中,直到将它们连接成组合的 df concat_df。正如您目前所做的那样,您将保留所有较小的数据帧,即使在连接它们之后也是如此。

回答by Pietro Tortella

Similar to what @glegoux suggests, also pd.DataFrame.to_csvcan write in append mode, so you can do something like:

与@glegoux 建议的类似,也pd.DataFrame.to_csv可以在追加模式下编写,因此您可以执行以下操作:

df1.to_csv(filename)
df2.to_csv(filename, mode='a', columns=False)
df3.to_csv(filename, mode='a', columns=False)

del df1, df2, df3
df_concat = pd.read_csv(filename)

回答by Tanu

Dask might be good option to try for handling large dataframes - Go through Dask Docs

Dask 可能是尝试处理大型数据帧的不错选择 - 浏览Dask Docs

回答by NickBraunagel

You can store your individual dataframes in a HDF Store, and then call the store just like one big dataframe.

您可以将您的单个数据帧存储在 HDF Store 中,然后像调用一个大数据帧一样调用该存储。

# name of store
fname = 'my_store'

with pd.get_store(fname) as store:

    # save individual dfs to store
    for df in [df1, df2, df3, df_foo]:
        store.append('df',df,data_columns=['FOO','BAR','ETC']) # data_columns = identify the column in the dfs you are appending

    # access the store as a single df
    df = store.select('df', where = ['A>2'])  # change where condition as required (see documentation for examples)
    # Do other stuff with df #

# close the store when you're done
os.remove(fname)

回答by Walt Reed

Another option:

另外一个选项:

1) Write df1to .csv file: df1.to_csv('Big file.csv')

1) 写入df1.csv 文件:df1.to_csv('Big file.csv')

2) Open .csv file, then append df2:

2) 打开 .csv 文件,然后附加df2

with open('Big File.csv','a') as f:
    df2.to_csv(f, header=False)

3) Repeat Step 2 with df3

3) 重复第 2 步 df3

with open('Big File.csv','a') as f:
    df3.to_csv(f, header=False)

回答by Prakhar Agarwal

I've had a similar performance issues while trying to concatenate a large number of DataFrames to a 'growing' DataFrame.

我在尝试将大量 DataFrame 连接到“不断增长”的 DataFrame 时遇到了类似的性能问题。

My workaround was appending all sub DataFrames to a list, and then concatenating the list of DataFrames once processing of the sub DataFrames has been completed. This will bring the runtime to almost half.

我的解决方法是将所有子数据帧附加到一个列表中,然后在子数据帧的处理完成后连接数据帧列表。这将使运行时间几乎减少一半。