pandas 运行中型合并功能 ipython notebook jupyter 时出现内存错误

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35243057/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:38:42  来源:igfitidea点击:

Memory error when running medium sized merge function ipython notebook jupyter

pandasout-of-memoryipython

提问by David Hancock

I'm trying to merge around 100 dataframes with a for loop and am getting a memory error. I'm using ipython jupyter notebook

我正在尝试使用 for 循环合并大约 100 个数据帧,但出现内存错误。我正在使用 ipython jupyter 笔记本

Here is a sample of the data:

以下是数据示例:

    timestamp   Namecoin_cap
0   2013-04-28  5969081
1   2013-04-29  7006114
2   2013-04-30  7049003

Each frame is around 1000 lines long

每帧大约 1000 行长

Here's the error in detail, I've also include my merge function.

这是错误的详细信息,我还包含了我的合并功能。

My system is currently using up 64% of it memory

我的系统目前使用了 64% 的内存

I have searched for similar issues but it seems most are for very large arrays >1GB, my data is relatively small in comparison.

我已经搜索过类似的问题,但似乎大多数是针对大于 1GB 的非常大的阵列,相比之下,我的数据相对较小。

EDIT: Something is suspicious. I wrote a beta program before, this was to test with 4 dataframes, i just exported that through pickle and it is 500kb. Now when i try to export the 100 frames one I get a memory error. It does however export a file that is 2GB. So i suspect somewhere down the line my code has created some kind of loop, creating a very large file. NB the 100 frames are stored in a dictionary

编辑:有些事情是可疑的。我之前写了一个 beta 程序,这是用 4 个数据帧进行测试,我只是通过 pickle 导出它,它是 500kb。现在,当我尝试导出 100 帧时,出现内存错误。然而,它确实导出了一个 2GB 的文件。所以我怀疑我的代码在某处创建了某种循环,创建了一个非常大的文件。注意 100 帧存储在字典中

EDIT2: I have exported the scrypt to .py

EDIT2:我已将 scrypt 导出到 .py

http://pastebin.com/GqaHr7xc

http://pastebin.com/GqaHr7xc

This is a .xlsx that cointains asset names the script needs

这是一个 .xlsx,包含脚本所需的资产名称

The script fetches data regarding various assets, then cleans it up and saves each asset to a data frame in a dictionary

该脚本获取有关各种资产的数据,然后对其进行清理并将每个资产保存到字典中的数据框中

I'd be really appreciative if someone could have a look and see if there's anything immediately wrong. Other wise please advise on what tests I can run.

如果有人可以看看是否有任何问题,我将不胜感激。其他明智的,请告诉我可以运行哪些测试。

EDIT3: I'm finding it really hard to understand why this is happening, the code worked fine in the beta, all i have done now is add more assets.

EDIT3:我发现很难理解为什么会发生这种情况,代码在测试版中运行良好,我现在所做的就是添加更多资产。

EDIT4: I ran I size check on the object (dict of dfs) and it is 1,066,793 bytes

EDIT4:我对对象(dfs 的字典)进行了大小检查,结果为 1,066,793 字节

EDIT5: The problem is in the merge function for coin 37

EDIT5:问题出在硬币 37 的合并函数中

for coin in coins[:37]:
    data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')

This is when the error occurs. for coin in coins[:36]:' doesn't produce an error howeverfor coin in coins[:37]:' produces the error, any ideas ?

这是发生错误的时候。 for coin in coins[:36]:' doesn't produce an error howeverfor coin in coins[:37]:' 产生错误,有什么想法吗?

EDIT6: the 36th element is 'Syscoin', i did coins.remove('Syscoin') however the memory problem still occurs. So it seems to be a problem with the 36th element in coins no matter what the coin is

EDIT6:第36个元素是'Syscoin',我做了coins.remove('Syscoin')但是内存问题仍然存在。所以不管是什么硬币,似乎是硬币中的第36个元素有问题

EDIT7: goCards suggestions seemed to work however the next part of the code:

EDIT7:goCards 建议似乎有效,但是代码的下一部分:

merged = data2['merged']
merged['Total_MC'] = merged.drop('timestamp',axis=1).sum(axis=1)

Produces a memory error. I'm stumped

产生内存错误。我难住了

回答by goCards

In regard to storage, I would recommend using a simple csv over pickle. Csv is a more generic format. It is human readable,and you can check your data quality easier especially as your data grows.

关于存储,我建议在泡菜上使用简单的 csv。CSV 是一种更通用的格式。它是人类可读的,您可以更轻松地检查数据质量,尤其是随着数据的增长。

file_template_string='%s.csv'
for eachKey in dfDict:
    filename = file_template_string%(eachKey)
    dfDict[eachKey].to_csv(filename)

If you need to date the files you can also put a timestamp in the filename.

如果您需要确定文件的日期,您还可以在文件名中添加时间戳。

import time
from datetime import datetime
cur = time.time()
cur = datetime.fromtimestamp(cur)
file_template_string = "%s_{0}.csv".format(cur.strftime("%m_%d_%Y_%H_%M_%S"))

There are some obvious errors in your code.

您的代码中有一些明显的错误。

for coin in coins: #line 61,89
for coin in data: #should be

df = data2['Namecoin'] #line 87
keys = data2.keys()
keys.remove('Namecoin')
for coin in keys:
    df = pd.merge(left=df,right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')

回答by VishnuVardhanA

Same issue happened to me! "MemoryError:" by notebook on execution of pandas. I have also screen printed quite lot of observations before issued happened.

同样的问题发生在我身上!“MemoryError:”由关于Pandas执行的笔记本。在发布之前,我也丝网印刷了很多观察结果。

Reinstalling Anaconda didn't help. Later realized that i was working with IPython notebook instead Jupyter notebook. Switched to Jupyter notebook. Everything worked fine!

重新安装 Anaconda 没有帮助。后来意识到我正在使用 IPython notebook 而不是 Jupyter notebook。切换到 Jupyter 笔记本。一切正常!