Python 如何估计 Pandas 的 DataFrame 需要多少内存?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18089667/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to estimate how much memory a Pandas' DataFrame will need?
提问by Anne
I have been wondering... If I am reading, say, a 400MB csv file into a pandas dataframe (using read_csv or read_table), is there any way to guesstimate how much memory this will need? Just trying to get a better feel of data frames and memory...
我一直在想……如果我正在将一个 400MB 的 csv 文件读入一个 Pandas 数据帧(使用 read_csv 或 read_table),有没有办法猜测这需要多少内存?只是想更好地了解数据帧和内存......
回答by Viktor Kerkez
Yes there is. Pandas will store your data in 2 dimensional numpy ndarray
structures grouping them by dtypes. ndarray
is basically a raw C array of data with a small header. So you can estimate it's size just by multiplying the size of the dtype
it contains with the dimensions of the array.
就在这里。Pandas 会将您的数据存储在ndarray
按 dtypes 分组的二维 numpy结构中。ndarray
基本上是一个带有小标题的原始 C 数据数组。所以你可以通过将dtype
它包含的大小乘以数组的维度来估计它的大小。
For example: if you have 1000 rows with 2 np.int32
and 5 np.float64
columns, your DataFrame will have one 2x1000 np.int32
array and one 5x1000 np.float64
array which is:
例如:如果您有 1000 行、2列np.int32
和 5np.float64
列,您的 DataFrame 将有np.int32
一个 2x1000np.float64
数组和一个 5x1000数组,即:
4bytes*2*1000 + 8bytes*5*1000 = 48000 bytes
4bytes*2*1000 + 8bytes*5*1000 = 48000 字节
回答by Phillip Cloud
If you know the dtype
s of your array then you can directly compute the number of bytes that it will take to store your data + some for the Python objects themselves. A useful attribute of numpy
arrays is nbytes
. You can get the number of bytes from the arrays in a pandas DataFrame
by doing
如果您知道dtype
数组的s,那么您可以直接计算存储数据所需的字节数 + 一些用于 Python 对象本身的字节数。numpy
数组的一个有用属性是nbytes
. 您可以DataFrame
通过执行以下操作从熊猫中的数组中获取字节数
nbytes = sum(block.values.nbytes for block in df.blocks.values())
object
dtype arrays store 8 bytes per object (object dtype arrays store a pointer to an opaque PyObject
), so if you have strings in your csv you need to take into account that read_csv
will turn those into object
dtype arrays and adjust your calculations accordingly.
object
dtype 数组为每个对象存储 8 个字节(对象 dtype 数组存储一个指向 opaque 的指针PyObject
),因此如果您的 csv 中有字符串,您需要考虑read_csv
将它们转换为object
dtype 数组并相应地调整您的计算。
EDIT:
编辑:
See the numpy
scalar types pagefor more details on the object
dtype
. Since only a reference is stored you need to take into account the size of the object in the array as well. As that page says, object arrays are somewhat similar to Python list
objects.
有关 的更多详细信息,请参阅numpy
标量类型页面object
dtype
。由于只存储了一个引用,因此您还需要考虑数组中对象的大小。正如该页面所说,对象数组有点类似于 Pythonlist
对象。
回答by Jeff
You have to do this in reverse.
您必须反向执行此操作。
In [4]: DataFrame(randn(1000000,20)).to_csv('test.csv')
In [5]: !ls -ltr test.csv
-rw-rw-r-- 1 users 399508276 Aug 6 16:55 test.csv
Technically memory is about this (which includes the indexes)
从技术上讲,内存是关于这个的(包括索引)
In [16]: df.values.nbytes + df.index.nbytes + df.columns.nbytes
Out[16]: 168000160
So 168MB in memory with a 400MB file, 1M rows of 20 float columns
所以 168MB 内存和 400MB 文件,1M 行 20 个浮点列
DataFrame(randn(1000000,20)).to_hdf('test.h5','df')
!ls -ltr test.h5
-rw-rw-r-- 1 users 168073944 Aug 6 16:57 test.h5
MUCH more compact when written as a binary HDF5 file
编写为二进制 HDF5 文件时更加紧凑
In [12]: DataFrame(randn(1000000,20)).to_hdf('test.h5','df',complevel=9,complib='blosc')
In [13]: !ls -ltr test.h5
-rw-rw-r-- 1 users 154727012 Aug 6 16:58 test.h5
The data was random, so compression doesn't help too much
数据是随机的,所以压缩没有太大帮助
回答by firelynx
I thought I would bring some more data to the discussion.
我想我会给讨论带来更多的数据。
I ran a series of tests on this issue.
我对这个问题进行了一系列测试。
By using the python resource
package I got the memory usage of my process.
通过使用 pythonresource
包,我得到了进程的内存使用情况。
And by writing the csv into a StringIO
buffer, I could easily measure the size of it in bytes.
通过将 csv 写入StringIO
缓冲区,我可以轻松地以字节为单位测量它的大小。
I ran two experiments, each one creating 20 dataframes of increasing sizes between 10,000 lines and 1,000,000 lines. Both having 10 columns.
我进行了两个实验,每个实验创建了 20 个数据帧,其大小在 10,000 行到 1,000,000 行之间增加。两者都有 10 列。
In the first experiment I used only floats in my dataset.
在第一个实验中,我只在我的数据集中使用了浮点数。
This is how the memory increased in comparison to the csv file as a function of the number of lines. (Size in Megabytes)
这就是与 csv 文件相比,作为行数的函数,内存增加的方式。(大小以兆字节为单位)
The second experiment I had the same approach, but the data in the dataset consisted of only short strings.
第二个实验我采用了相同的方法,但数据集中的数据仅由短字符串组成。
It seems that the relation of the size of the csv and the size of the dataframe can vary quite a lot, but the size in memory will always be bigger by a factor of 2-3 (for the frame sizes in this experiment)
看起来 csv 的大小和数据帧的大小之间的关系可以变化很大,但是内存中的大小总是会大 2-3 倍(对于本实验中的帧大小)
I would love to complete this answer with more experiments, please comment if you want me to try something special.
我很想用更多的实验来完成这个答案,如果你想让我尝试一些特别的东西,请发表评论。
回答by Aleksey Sivokon
df.memory_usage()
will return how much each column occupies:
df.memory_usage()
将返回每列占用的数量:
>>> df.memory_usage()
Row_ID 20906600
Household_ID 20906600
Vehicle 20906600
Calendar_Year 20906600
Model_Year 20906600
...
To include indexes, pass index=True
.
要包含索引,请传递index=True
.
So to get overall memory consumption:
所以要获得整体内存消耗:
>>> df.memory_usage(index=True).sum()
731731000
Also, passing deep=True
will enable a more accurate memory usage report, that accounts for the full usage of the contained objects.
此外,传递deep=True
将启用更准确的内存使用报告,该报告说明所包含对象的全部使用情况。
This is because memory usage does not include memory consumed by elements that are not components of the array if deep=False
(default case).
这是因为内存使用不包括非数组组件的元素消耗的内存 if deep=False
(默认情况)。
回答by Zaher Abdul Azeez
This I believe this gives the in-memory size any object in python. Internals need to be checked with regard to pandas and numpy
我相信这给出了python中任何对象的内存大小。需要检查 pandas 和 numpy 的内部结构
>>> import sys
#assuming the dataframe to be df
>>> sys.getsizeof(df)
59542497
回答by Brian Burns
Here's a comparison of the different methods - sys.getsizeof(df)
is simplest.
这是不同方法的比较 -sys.getsizeof(df)
最简单。
For this example, df
is a dataframe with 814 rows, 11 columns (2 ints, 9 objects) - read from a 427kb shapefile
对于此示例,df
是一个具有 814 行、11 列(2 个整数、9 个对象)的数据框 - 从 427kb shapefile 读取
sys.getsizeof(df)
sys.getsizeof(df)
>>> import sys >>> sys.getsizeof(df) (gives results in bytes) 462456
df.memory_usage()
df.memory_usage()
>>> df.memory_usage() ... (lists each column at 8 bytes/row) >>> df.memory_usage().sum() 71712 (roughly rows * cols * 8 bytes) >>> df.memory_usage(deep=True) (lists each column's full memory usage) >>> df.memory_usage(deep=True).sum() (gives results in bytes) 462432
df.info()
df.info()
Prints dataframe info to stdout. Technically these are kibibytes (KiB), not kilobytes - as the docstring says, "Memory usage is shown in human-readable units (base-2 representation)." So to get bytes would multiply by 1024, e.g. 451.6 KiB = 462,438 bytes.
将数据帧信息打印到标准输出。从技术上讲,这些是千比字节 (KiB),而不是千字节——正如文档字符串所说,“内存使用以人类可读的单位(base-2 表示)显示。” 所以要获得字节将乘以 1024,例如 451.6 KiB = 462,438 字节。
>>> df.info() ... memory usage: 70.0+ KB >>> df.info(memory_usage='deep') ... memory usage: 451.6 KB