Python 当我合并两个 Pandas 数据帧时出现 MemoryError
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47386405/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
MemoryError when I merge two Pandas data frames
提问by Ronit Chidara
I searched almost all over the internet and somehow none of the approaches seem to work in my case.
我几乎在整个互联网上进行了搜索,不知何故,这些方法似乎都不适用于我的情况。
I have two large csv files (each with a million+ rows and about 300-400MB in size). They are loading fine into data frames using the read_csvfunction without having to use the chunksizeparameter. I even performed certain minor operations on this data like new column generation, filtering, etc.
我有两个大型 csv 文件(每个都有一百万行以上,大小约为 300-400MB)。他们使用read_csv函数很好地加载到数据帧中,而无需使用chunksize参数。我什至对这些数据执行了一些小操作,比如新列的生成、过滤等。
However, when I try to merge these two frames, I get a MemoryError. I have even tried to use SQLite to accomplish the merge, but in vain. The operation takes forever.
但是,当我尝试合并这两帧时,我收到了MemoryError。我什至尝试使用 SQLite 来完成合并,但徒劳无功。手术需要永远。
Mine is a Windows 7 PC with 8GB RAM. The Python version is 2.7
我的是一台带有 8GB RAM 的 Windows 7 PC。Python 版本是 2.7
Thank you.
谢谢你。
Edit: I tried chunking methods too. When I do this, I don't get MemoryError, but the RAM usage explodes and my system crashes.
编辑:我也尝试了分块方法。当我这样做时,我没有收到 MemoryError,但 RAM 使用量激增,我的系统崩溃了。
回答by T_cat
When you are merging data using pandas.merge it will use df1 memory, df2 memory and merge_df memory. I believe that it is why you get a memory error. You should export df2 to a csv file and use chunksize option and merge data.
当您使用 pandas.merge 合并数据时,它将使用 df1 内存、df2 内存和 merge_df 内存。我相信这就是您收到内存错误的原因。您应该将 df2 导出到 csv 文件并使用 chunksize 选项并合并数据。
It might be a better way but you can try this. *for large data set you can use chunksize option in pandas.read_csv
这可能是一个更好的方法,但你可以试试这个。*对于大数据集,您可以在 pandas.read_csv 中使用 chunksize 选项
df1 = pd.read_csv("yourdata.csv")
df2 = pd.read_csv("yourdata2.csv")
df2_key = df2.Colname2
# creating a empty bucket to save result
df_result = pd.DataFrame(columns=(df1.columns.append(df2.columns)).unique())
df_result.to_csv("df3.csv",index_label=False)
# save data which only appear in df1 # sorry I was doing left join here. no need to run below two line.
# df_result = df1[df1.Colname1.isin(df2.Colname2)!=True]
# df_result.to_csv("df3.csv",index_label=False, mode="a")
# deleting df2 to save memory
del(df2)
def preprocess(x):
df2=pd.merge(df1,x, left_on = "Colname1", right_on = "Colname2")
df2.to_csv("df3.csv",mode="a",header=False,index=False)
reader = pd.read_csv("yourdata2.csv", chunksize=1000) # chunksize depends with you colsize
[preprocess(r) for r in reader]
this will save merged data as df3.
这会将合并的数据保存为 df3。
回答by user3062459
The reason you might be getting MemoryError: Unable to allocate..
could be due to duplicates or blanks in your dataframe. Check the column you are joining on (when using merge) and see if you have duplicates or blanks. If so get rid of them using this command:
您可能会收到的原因MemoryError: Unable to allocate..
可能是由于数据框中的重复项或空白。检查您要加入的列(使用合并时),看看您是否有重复项或空白。如果是这样,请使用以下命令摆脱它们:
df.drop_duplicates(subset ='column_name', keep = False, inplace = True)
Then re-run your python/pandas code. This worked for me.
然后重新运行你的 python/pandas 代码。这对我有用。
回答by Neil
In general chunk version suggested by @T_cat works great.
一般来说,@T_cat 建议的块版本效果很好。
However, memory exploding might be caused by joining on columns that have Nan
values.
So you may want to exclude those rows from the join.
但是,连接具有Nan
值的列可能会导致内存爆炸。因此,您可能希望从联接中排除这些行。
see: https://github.com/pandas-dev/pandas/issues/24698#issuecomment-614347153
见:https: //github.com/pandas-dev/pandas/issues/24698#issuecomment-614347153