Python pandas 数据框的最大大小
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23569771/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Maximum size of pandas dataframe
提问by Nils Gudat
I'm trying to read in a somewhat large dataset using panda
s read_csv
or read_stata
functions, but I keep running into Memory Error
s. What is the maximum size of a dataframe? My understanding is that dataframes should be okay as long as the data fits into memory, which shouldn't be a problem for me. What else could cause the memory error?
我正在尝试使用panda
sread_csv
或read_stata
函数读取一个有点大的数据集,但我一直遇到Memory Error
s。数据帧的最大大小是多少?我的理解是只要数据适合内存,数据帧就应该没问题,这对我来说应该不是问题。还有什么可能导致内存错误?
For context, I'm trying to read in the Survey of Consumer Finances 2007, both in ASCII format (using read_csv
) and in Stata format (using read_stata
). The file is around 200MB as dta and around 1.2GB as ASCII, and opening it in Stata tells me that there are 5,800 variables/columns for 22,000 observations/rows.
就上下文而言,我正在尝试以ASCII 格式(使用)和 Stata 格式(使用)阅读《2007 年消费者财务调查》。该文件作为 dta 大约为 200MB,作为 ASCII 大约为 1.2GB,在 Stata 中打开它告诉我有 5,800 个变量/列用于 22,000 个观察/行。read_csv
read_stata
采纳答案by MattR
I'm going to post this answer as was discussed in comments. I've seen it come up numerous times without an accepted answer.
我将按照评论中的讨论发布这个答案。我已经看到它多次出现而没有得到公认的答案。
The Memory Error is intuitive - out of memory. But sometimes the solution or the debugging of this error is frustrating as you have enough memory, but the error remains.
内存错误很直观 - 内存不足。但有时由于您有足够的内存,此错误的解决方案或调试令人沮丧,但错误仍然存在。
1) Check for code errors
1) 检查代码错误
This may be a "dumb step" but that's why it's first. Make sure there are no infinite loops or things that will knowingly take a long time (like using something the os
module that will search your entire computer and put the output in an excel file)
这可能是一个“愚蠢的步骤”,但这就是为什么它是第一步。确保没有无限循环或故意花费很长时间的事情(例如使用os
将搜索整个计算机并将输出放入excel文件的模块)
2) Make your code more efficient
2)让你的代码更有效率
Goes along the lines of Step 1. But if something simple is taking a long time, there's usually a module or a better way of doing something that is faster and more memory efficent. That's the beauty of Python and/or open source Languages!
按照步骤 1 的思路进行。但是如果简单的事情需要很长时间,通常会有一个模块或更好的方法来做一些更快、内存效率更高的事情。这就是 Python 和/或开源语言的美妙之处!
3) Check The Total Memory of the object
3)检查对象的总内存
The first step is to check the memory of an object. There are a ton of threads on Stack about this, so you can search them. Popular answers are hereand here
第一步是检查对象的内存。Stack 上有大量关于此的主题,因此您可以搜索它们。热门答案在这里和这里
to find the size of an object in bites you can always use sys.getsizeof()
:
要以咬合方式查找对象的大小,您始终可以使用sys.getsizeof()
:
import sys
print(sys.getsizeof(OBEJCT_NAME_HERE))
Now the error might happen before anything is created, but if you read the csv in chunks you can see how much memory is being used per chunk.
现在错误可能在创建任何内容之前发生,但是如果您以块的形式读取 csv,您可以看到每个块使用了多少内存。
4) Check the memory while running
4)运行时检查内存
Sometimes you have enough memory but the function you are running consumes a lot of memory at runtime. This causes memory to spike beyond the actual size of the finished object causing the code/process to error. Checking memory in real time is lengthy, but can be done. Ipython is good with that. Check Their Document.
有时您有足够的内存,但您正在运行的函数在运行时会消耗大量内存。这会导致内存峰值超出已完成对象的实际大小,从而导致代码/进程出错。实时检查内存很长,但可以做到。Ipython 很好。检查他们的文件。
use the code below to see the documentation straight in Jupyter Notebook:
使用下面的代码直接在 Jupyter Notebook 中查看文档:
%mprun?
%memit?
Sample use:
样品用途:
%load_ext memory_profiler
def lol(x):
return x
%memit lol(500)
#output --- peak memory: 48.31 MiB, increment: 0.00 MiB
If you need help on magic functions This is a great post
如果您需要有关魔术功能的帮助这是一篇很棒的文章
5) This one may be first.... but Check for simple things like bit version
5)这可能是第一个......但是检查一些简单的东西,比如位版本
As in your case, a simple switching of the version of python you were running solved the issue.
与您的情况一样,您运行的 python 版本的简单切换解决了该问题。
Usually the above steps solve my issues.
通常上述步骤可以解决我的问题。