pandas 运行中型合并功能 ipython notebook jupyter 时出现内存错误

Question

提问by David Hancock

I'm trying to merge around 100 dataframes with a for loop and am getting a memory error. I'm using ipython jupyter notebook

我正在尝试使用 for 循环合并大约 100 个数据帧，但出现内存错误。我正在使用 ipython jupyter 笔记本

Here is a sample of the data:

以下是数据示例：

    timestamp   Namecoin_cap
0   2013-04-28  5969081
1   2013-04-29  7006114
2   2013-04-30  7049003

Each frame is around 1000 lines long

每帧大约 1000 行长

Here's the error in detail, I've also include my merge function.

这是错误的详细信息，我还包含了我的合并功能。

My system is currently using up 64% of it memory

我的系统目前使用了 64% 的内存

I have searched for similar issues but it seems most are for very large arrays >1GB, my data is relatively small in comparison.

我已经搜索过类似的问题，但似乎大多数是针对大于 1GB 的非常大的阵列，相比之下，我的数据相对较小。

EDIT: Something is suspicious. I wrote a beta program before, this was to test with 4 dataframes, i just exported that through pickle and it is 500kb. Now when i try to export the 100 frames one I get a memory error. It does however export a file that is 2GB. So i suspect somewhere down the line my code has created some kind of loop, creating a very large file. NB the 100 frames are stored in a dictionary

编辑：有些事情是可疑的。我之前写了一个 beta 程序，这是用 4 个数据帧进行测试，我只是通过 pickle 导出它，它是 500kb。现在，当我尝试导出 100 帧时，出现内存错误。然而，它确实导出了一个 2GB 的文件。所以我怀疑我的代码在某处创建了某种循环，创建了一个非常大的文件。注意 100 帧存储在字典中

EDIT2: I have exported the scrypt to .py

EDIT2：我已将 scrypt 导出到 .py

http://pastebin.com/GqaHr7xc

This is a .xlsx that cointains asset names the script needs

这是一个 .xlsx，包含脚本所需的资产名称

The script fetches data regarding various assets, then cleans it up and saves each asset to a data frame in a dictionary

该脚本获取有关各种资产的数据，然后对其进行清理并将每个资产保存到字典中的数据框中

I'd be really appreciative if someone could have a look and see if there's anything immediately wrong. Other wise please advise on what tests I can run.

如果有人可以看看是否有任何问题，我将不胜感激。其他明智的，请告诉我可以运行哪些测试。

EDIT3: I'm finding it really hard to understand why this is happening, the code worked fine in the beta, all i have done now is add more assets.

EDIT3：我发现很难理解为什么会发生这种情况，代码在测试版中运行良好，我现在所做的就是添加更多资产。

EDIT4: I ran I size check on the object (dict of dfs) and it is 1,066,793 bytes

EDIT4：我对对象（dfs 的字典）进行了大小检查，结果为 1,066,793 字节

EDIT5: The problem is in the merge function for coin 37

EDIT5：问题出在硬币 37 的合并函数中

for coin in coins[:37]:
    data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')

This is when the error occurs. for coin in coins[:36]:' doesn't produce an error howeverfor coin in coins[:37]:' produces the error, any ideas ?

这是发生错误的时候。 for coin in coins[:36]:' doesn't produce an error howeverfor coin in coins[:37]:' 产生错误，有什么想法吗？

EDIT6: the 36th element is 'Syscoin', i did coins.remove('Syscoin') however the memory problem still occurs. So it seems to be a problem with the 36th element in coins no matter what the coin is

EDIT6：第36个元素是'Syscoin'，我做了coins.remove('Syscoin')但是内存问题仍然存在。所以不管是什么硬币，似乎是硬币中的第36个元素有问题

EDIT7: goCards suggestions seemed to work however the next part of the code:

EDIT7：goCards 建议似乎有效，但是代码的下一部分：

merged = data2['merged']
merged['Total_MC'] = merged.drop('timestamp',axis=1).sum(axis=1)

Produces a memory error. I'm stumped

产生内存错误。我难住了

Answer 1

回答by goCards

In regard to storage, I would recommend using a simple csv over pickle. Csv is a more generic format. It is human readable,and you can check your data quality easier especially as your data grows.

关于存储，我建议在泡菜上使用简单的 csv。CSV 是一种更通用的格式。它是人类可读的，您可以更轻松地检查数据质量，尤其是随着数据的增长。

file_template_string='%s.csv'
for eachKey in dfDict:
    filename = file_template_string%(eachKey)
    dfDict[eachKey].to_csv(filename)

If you need to date the files you can also put a timestamp in the filename.

如果您需要确定文件的日期，您还可以在文件名中添加时间戳。

import time
from datetime import datetime
cur = time.time()
cur = datetime.fromtimestamp(cur)
file_template_string = "%s_{0}.csv".format(cur.strftime("%m_%d_%Y_%H_%M_%S"))

There are some obvious errors in your code.

您的代码中有一些明显的错误。

for coin in coins: #line 61,89
for coin in data: #should be

df = data2['Namecoin'] #line 87
keys = data2.keys()
keys.remove('Namecoin')
for coin in keys:
    df = pd.merge(left=df,right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')

Answer 2

回答by VishnuVardhanA

Same issue happened to me! "MemoryError:" by notebook on execution of pandas. I have also screen printed quite lot of observations before issued happened.

同样的问题发生在我身上！“MemoryError：”由关于Pandas执行的笔记本。在发布之前，我也丝网印刷了很多观察结果。

Reinstalling Anaconda didn't help. Later realized that i was working with IPython notebook instead Jupyter notebook. Switched to Jupyter notebook. Everything worked fine!

重新安装 Anaconda 没有帮助。后来意识到我正在使用 IPython notebook 而不是 Jupyter notebook。切换到 Jupyter 笔记本。一切正常！

pandas 运行中型合并功能 ipython notebook jupyter 时出现内存错误

提问by David Hancock

回答by goCards

回答by VishnuVardhanA

相关推荐

最近更新

标签

pandas 运行中型合并功能 ipython notebook jupyter 时出现内存错误

提问by David Hancock

回答by goCards

回答by VishnuVardhanA

相关推荐

pandas 在数据框的两列之间运行基本关联

pandas groupby 分组和亚组级别分析

将 gz 文件直接加载到 Pandas 数据帧中

pandas Python Reindex 生成 Nan

相关推荐

最近更新

标签