Python 如何估计 Pandas 的 DataFrame 需要多少内存?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18089667/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:54:05  来源:igfitidea点击:

How to estimate how much memory a Pandas' DataFrame will need?

pythonpandas

提问by Anne

I have been wondering... If I am reading, say, a 400MB csv file into a pandas dataframe (using read_csv or read_table), is there any way to guesstimate how much memory this will need? Just trying to get a better feel of data frames and memory...

我一直在想……如果我正在将一个 400MB 的 csv 文件读入一个 Pandas 数据帧(使用 read_csv 或 read_table),有没有办法猜测这需要多少内存?只是想更好地了解数据帧和内存......

回答by Viktor Kerkez

Yes there is. Pandas will store your data in 2 dimensional numpy ndarraystructures grouping them by dtypes. ndarrayis basically a raw C array of data with a small header. So you can estimate it's size just by multiplying the size of the dtypeit contains with the dimensions of the array.

就在这里。Pandas 会将您的数据存储在ndarray按 dtypes 分组的二维 numpy结构中。ndarray基本上是一个带有小标题的原始 C 数据数组。所以你可以通过将dtype它包含的大小乘以数组的维度来估计它的大小。

For example: if you have 1000 rows with 2 np.int32and 5 np.float64columns, your DataFrame will have one 2x1000 np.int32array and one 5x1000 np.float64array which is:

例如:如果您有 1000 行、2列np.int32和 5np.float64列,您的 DataFrame 将有np.int32一个 2x1000np.float64数组和一个 5x1000数组,即:

4bytes*2*1000 + 8bytes*5*1000 = 48000 bytes

4bytes*2*1000 + 8bytes*5*1000 = 48000 字节

回答by Phillip Cloud

If you know the dtypes of your array then you can directly compute the number of bytes that it will take to store your data + some for the Python objects themselves. A useful attribute of numpyarrays is nbytes. You can get the number of bytes from the arrays in a pandas DataFrameby doing

如果您知道dtype数组的s,那么您可以直接计算存储数据所需的字节数 + 一些用于 Python 对象本身的字节数。numpy数组的一个有用属性是nbytes. 您可以DataFrame通过执行以下操作从熊猫中的数组中获取字节数

nbytes = sum(block.values.nbytes for block in df.blocks.values())

objectdtype arrays store 8 bytes per object (object dtype arrays store a pointer to an opaque PyObject), so if you have strings in your csv you need to take into account that read_csvwill turn those into objectdtype arrays and adjust your calculations accordingly.

objectdtype 数组为每个对象存储 8 个字节(对象 dtype 数组存储一个指向 opaque 的指针PyObject),因此如果您的 csv 中有字符串,您需要考虑read_csv将它们转换为objectdtype 数组并相应地调整您的计算。

EDIT:

编辑:

See the numpyscalar types pagefor more details on the objectdtype. Since only a reference is stored you need to take into account the size of the object in the array as well. As that page says, object arrays are somewhat similar to Python listobjects.

有关 的更多详细信息,请参阅numpy标量类型页面objectdtype。由于只存储了一个引用,因此您还需要考虑数组中对象的大小。正如该页面所说,对象数组有点类似于 Pythonlist对象。

回答by Jeff

You have to do this in reverse.

您必须反向执行此操作。

In [4]: DataFrame(randn(1000000,20)).to_csv('test.csv')

In [5]: !ls -ltr test.csv
-rw-rw-r-- 1 users 399508276 Aug  6 16:55 test.csv

Technically memory is about this (which includes the indexes)

从技术上讲,内存是关于这个的(包括索引)

In [16]: df.values.nbytes + df.index.nbytes + df.columns.nbytes
Out[16]: 168000160

So 168MB in memory with a 400MB file, 1M rows of 20 float columns

所以 168MB 内存和 400MB 文件,1M 行 20 个浮点列

DataFrame(randn(1000000,20)).to_hdf('test.h5','df')

!ls -ltr test.h5
-rw-rw-r-- 1 users 168073944 Aug  6 16:57 test.h5

MUCH more compact when written as a binary HDF5 file

编写为二进制 HDF5 文件时更加紧凑

In [12]: DataFrame(randn(1000000,20)).to_hdf('test.h5','df',complevel=9,complib='blosc')

In [13]: !ls -ltr test.h5
-rw-rw-r-- 1 users 154727012 Aug  6 16:58 test.h5

The data was random, so compression doesn't help too much

数据是随机的,所以压缩没有太大帮助

回答by firelynx

I thought I would bring some more data to the discussion.

我想我会给讨论带来更多的数据。

I ran a series of tests on this issue.

我对这个问题进行了一系列测试。

By using the python resourcepackage I got the memory usage of my process.

通过使用 pythonresource包,我得到了进程的内存使用情况。

And by writing the csv into a StringIObuffer, I could easily measure the size of it in bytes.

通过将 csv 写入StringIO缓冲区,我可以轻松地以字节为单位测量它的大小。

I ran two experiments, each one creating 20 dataframes of increasing sizes between 10,000 lines and 1,000,000 lines. Both having 10 columns.

我进行了两个实验,每个实验创建了 20 个数据帧,其大小在 10,000 行到 1,000,000 行之间增加。两者都有 10 列。

In the first experiment I used only floats in my dataset.

在第一个实验中,我只在我的数据集中使用了浮点数。

This is how the memory increased in comparison to the csv file as a function of the number of lines. (Size in Megabytes)

这就是与 csv 文件相比,作为行数的函数,内存增加的方式。(大小以兆字节为单位)

Memory and CSV size in Megabytes as a function of the number of rows with float entries

内存和 CSV 大小(以兆字节为单位)作为具有浮点条目的行数的函数

The second experiment I had the same approach, but the data in the dataset consisted of only short strings.

第二个实验我采用了相同的方法,但数据集中的数据仅由短字符串组成。

Memory and CSV size in Megabytes as a function of the number of rows with string entries

内存和 CSV 大小(以兆字节为单位)作为具有字符串条目的行数的函数

It seems that the relation of the size of the csv and the size of the dataframe can vary quite a lot, but the size in memory will always be bigger by a factor of 2-3 (for the frame sizes in this experiment)

看起来 csv 的大小和数据帧的大小之间的关系可以变化很大,但是内存中的大小总是会大 2-3 倍(对于本实验中的帧大小)

I would love to complete this answer with more experiments, please comment if you want me to try something special.

我很想用更多的实验来完成这个答案,如果你想让我尝试一些特别的东西,请发表评论。

回答by Aleksey Sivokon

df.memory_usage()will return how much each column occupies:

df.memory_usage()将返回每列占用的数量:

>>> df.memory_usage()

Row_ID            20906600
Household_ID      20906600
Vehicle           20906600
Calendar_Year     20906600
Model_Year        20906600
...

To include indexes, pass index=True.

要包含索引,请传递index=True.

So to get overall memory consumption:

所以要获得整体内存消耗:

>>> df.memory_usage(index=True).sum()
731731000

Also, passing deep=Truewill enable a more accurate memory usage report, that accounts for the full usage of the contained objects.

此外,传递deep=True将启用更准确的内存使用报告,该报告说明所包含对象的全部使用情况。

This is because memory usage does not include memory consumed by elements that are not components of the array if deep=False(default case).

这是因为内存使用不包括非数组组件的元素消耗的内存 if deep=False(默认情况)。

回答by Zaher Abdul Azeez

This I believe this gives the in-memory size any object in python. Internals need to be checked with regard to pandas and numpy

我相信这给出了python中任何对象的内存大小。需要检查 pandas 和 numpy 的内部结构

>>> import sys
#assuming the dataframe to be df 
>>> sys.getsizeof(df) 
59542497

回答by Brian Burns

Here's a comparison of the different methods - sys.getsizeof(df)is simplest.

这是不同方法的比较 -sys.getsizeof(df)最简单。

For this example, dfis a dataframe with 814 rows, 11 columns (2 ints, 9 objects) - read from a 427kb shapefile

对于此示例,df是一个具有 814 行、11 列(2 个整数、9 个对象)的数据框 - 从 427kb shapefile 读取

sys.getsizeof(df)

sys.getsizeof(df)

>>> import sys
>>> sys.getsizeof(df)
(gives results in bytes)
462456

df.memory_usage()

df.memory_usage()

>>> df.memory_usage()
...
(lists each column at 8 bytes/row)

>>> df.memory_usage().sum()
71712
(roughly rows * cols * 8 bytes)

>>> df.memory_usage(deep=True)
(lists each column's full memory usage)

>>> df.memory_usage(deep=True).sum()
(gives results in bytes)
462432

df.info()

df.info()

Prints dataframe info to stdout. Technically these are kibibytes (KiB), not kilobytes - as the docstring says, "Memory usage is shown in human-readable units (base-2 representation)." So to get bytes would multiply by 1024, e.g. 451.6 KiB = 462,438 bytes.

将数据帧信息打印到标准输出。从技术上讲,这些是千比字节 (KiB),而不是千字节——正如文档字符串所说,“内存使用以人类可读的单位(base-2 表示)显示。” 所以要获得字节将乘以 1024,例如 451.6 KiB = 462,438 字节。

>>> df.info()
...
memory usage: 70.0+ KB

>>> df.info(memory_usage='deep')
...
memory usage: 451.6 KB