Python 如何释放 Pandas 数据帧使用的内存？

Question

提问by b10hazard

I have a really large csv file that I opened in pandas as follows....

我有一个非常大的 csv 文件，我在 Pandas 中打开了如下....

import pandas
df = pandas.read_csv('large_txt_file.txt')

Once I do this my memory usage increases by 2GB, which is expected because this file contains millions of rows. My problem comes when I need to release this memory. I ran....

执行此操作后，我的内存使用量将增加 2GB，这是预期的，因为此文件包含数百万行。当我需要释放这段记忆时，我的问题就来了。我跑了……

del df

However, my memory usage did not drop. Is this the wrong approach to release memory used by a pandas data frame? If it is, what is the proper way?

但是，我的内存使用量并没有下降。这是释放熊猫数据框使用的内存的错误方法吗？如果是，正确的方法是什么？

Answer 1

回答by Wilfred Hughes

Reducing memory usage in Python is difficult, because Python does not actually release memory back to the operating system. If you delete objects, then the memory is available to new Python objects, but not free()'d back to the system (see this question).

在 Python 中减少内存使用是很困难的，因为Python 实际上并没有将内存释放回操作系统。如果删除对象，则内存可用于新的 Python 对象，但不会free()返回到系统（请参阅此问题）。

If you stick to numeric numpy arrays, those are freed, but boxed objects are not.

如果你坚持使用数字 numpy 数组，它们会被释放，但装箱对象不会。

>>> import os, psutil, numpy as np
>>> def usage():
...     process = psutil.Process(os.getpid())
...     return process.get_memory_info()[0] / float(2 ** 20)
... 
>>> usage() # initial memory usage
27.5 

>>> arr = np.arange(10 ** 8) # create a large array without boxing
>>> usage()
790.46875
>>> del arr
>>> usage()
27.52734375 # numpy just free()'d the array

>>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects
>>> usage()
3135.109375
>>> del arr
>>> usage()
2372.16796875  # numpy frees the array, but python keeps the heap big

Reducing the Number of Dataframes

减少数据帧的数量

Python keep our memory at high watermark, but we can reduce the total number of dataframes we create. When modifying your dataframe, prefer inplace=True, so you don't create copies.

Python 将我们的内存保持在高水位线，但我们可以减少我们创建的数据帧的总数。修改您的数据框时，更喜欢inplace=True，这样您就不会创建副本。

Another common gotcha is holding on to copies of previously created dataframes in ipython:

另一个常见问题是在 ipython 中保留先前创建的数据帧的副本：

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'foo': [1,2,3,4]})

In [3]: df + 1
Out[3]: 
   foo
0    2
1    3
2    4
3    5

In [4]: df + 2
Out[4]: 
   foo
0    3
1    4
2    5
3    6

In [5]: Out # Still has all our temporary DataFrame objects!
Out[5]: 
{3:    foo
 0    2
 1    3
 2    4
 3    5, 4:    foo
 0    3
 1    4
 2    5
 3    6}

You can fix this by typing %reset Outto clear your history. Alternatively, you can adjust how much history ipython keeps with ipython --cache-size=5(default is 1000).

您可以通过键入%reset Out来清除历史记录来解决此问题。或者，您可以调整 ipython 保留多少历史记录ipython --cache-size=5（默认为 1000）。

Reducing Dataframe Size

减少数据帧大小

Wherever possible, avoid using object dtypes.

尽可能避免使用对象数据类型。

>>> df.dtypes
foo    float64 # 8 bytes per value
bar      int64 # 8 bytes per value
baz     object # at least 48 bytes per value, often more

Values with an object dtype are boxed, which means the numpy array just contains a pointer and you have a full Python object on the heap for every value in your dataframe. This includes strings.

具有对象 dtype 的值被装箱，这意味着 numpy 数组只包含一个指针，并且对于数据帧中的每个值，堆上都有一个完整的 Python 对象。这包括字符串。

Whilst numpy supports fixed-size strings in arrays, pandas does not (it's caused user confusion). This can make a significant difference:

虽然 numpy 支持数组中的固定大小字符串，但 Pandas 不支持（这会引起用户混淆）。这可能会产生重大影响：

>>> import numpy as np
>>> arr = np.array(['foo', 'bar', 'baz'])
>>> arr.dtype
dtype('S3')
>>> arr.nbytes
9

>>> import sys; import pandas as pd
>>> s = pd.Series(['foo', 'bar', 'baz'])
dtype('O')
>>> sum(sys.getsizeof(x) for x in s)
120

You may want to avoid using string columns, or find a way of representing string data as numbers.

您可能希望避免使用字符串列，或找到一种将字符串数据表示为数字的方法。

If you have a dataframe that contains many repeated values (NaN is very common), then you can use a sparse data structureto reduce memory usage:

如果您的数据帧包含许多重复值（NaN 非常常见），那么您可以使用稀疏数据结构来减少内存使用：

>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 605.5 MB

>>> df1.shape
(39681584, 1)

>>> df1.foo.isnull().sum() * 100. / len(df1)
20.628483479893344 # so 20% of values are NaN

>>> df1.to_sparse().info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 543.0 MB

Viewing Memory Usage

查看内存使用情况

You can view the memory usage (docs):

您可以查看内存使用情况（文档）：

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 14 columns):
...
dtypes: datetime64[ns](1), float64(8), int64(1), object(4)
memory usage: 4.4+ GB

As of pandas 0.17.1, you can also do df.info(memory_usage='deep')to see memory usage including objects.

从 pandas 0.17.1 开始，您还df.info(memory_usage='deep')可以查看内存使用情况，包括对象。

Answer 2

回答by Ami Tavory

As noted in the comments, there are some things to try: gc.collect(@EdChum) may clear stuff, for example. At least from my experience, these things sometimes work and often don't.

正如评论中所指出的，有一些事情可以尝试：gc.collect例如，(@EdChum) 可能会清除内容。至少从我的经验来看，这些东西有时会奏效，但通常不会。

There is one thing that always works, however, because it is done at the OS, not language, level.

然而，有一件事总是有效的，因为它是在操作系统层面完成的，而不是语言层面。

Suppose you have a function that creates an intermediate huge DataFrame, and returns a smaller result (which might also be a DataFrame):

假设您有一个函数创建一个中间巨大的 DataFrame，并返回一个较小的结果（也可能是一个 DataFrame）：

def huge_intermediate_calc(something):
    ...
    huge_df = pd.DataFrame(...)
    ...
    return some_aggregate

Then if you do something like

那么如果你做类似的事情

import multiprocessing

result = multiprocessing.Pool(1).map(huge_intermediate_calc, [something_])[0]

Then the function is executed at a different process. When that process completes, the OS retakes all the resources it used. There's really nothing Python, pandas, the garbage collector, could do to stop that.

然后该函数在不同的进程中执行。当该过程完成时，操作系统会重新使用它使用的所有资源。Python、pandas、垃圾收集器真的没有办法阻止它。

Answer 3

回答by hardi

This solves the problem of releasing the memory for me!!!

这就为我解决了释放内存的问题！！！

del [[df_1,df_2]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()

the data-frame will be explicitly set to null

数据框将显式设置为空

Answer 4

回答by Marlon Abeykoon

del dfwill not be deleted if there are any reference to the dfat the time of deletion. So you need to to delete all the references to it with del dfto release the memory.

del df如果在删除时有任何参考，将不会被删除df。所以你需要删除所有对它的引用del df来释放内存。

So all the instances bound to df should be deleted to trigger garbage collection.

所以所有绑定到 df 的实例都应该被删除以触发垃圾收集。

Use objgraghto check which is holding onto the objects.

使用objgragh来检查哪个是固定在物体上的。

Answer 5

回答by parvez khan

import matplotlib as plt
import matplotlib.pyplot as plt
import datetime as dt



import pandas as pd 
import psycopg2
import pandas.io.sql as psql
conn = psycopg2.connect("dbname='postgres' user='user' host='100.10.20.600' password='Password'")
dataframe = psql.read_sql("""SELECT * FROM "schema"."dataset_name" """, conn)



pd.read_table(filename) - From a delimited text file (like TSV)
pd.read_excel(filename) - From an Excel file
pd.read_sql(query, connection_object) - Reads from a SQL table/database
pd.read_json(json_string) - Reads from a JSON file and extracts tables to a list of dataframes
df[col] or df.col- Returns column with label col as Series
df[[col1, col2]] - Returns Columns as a new DataFrame
s.iloc[0] - Selection by position/Integer-based indexing
s.loc[0] - Selection by index/label-based indexing
df.loc[:, :] and df.iloc[:, :] - First argument represents the number of rows and the second for columns
df.ix[0:a, 0:b] - Arguments notation is same as above but returns a rows and (b-1) columns [deprecated in Python 3]
df.loc[0:4,['App','Category']]

Data Cleaning

df.drop([col1, col2, col3], inplace = True, axis=1) - Remove set of column(s)
df.columns = ['a','b','c'] - Renames columns
df.isnull() - Checks for null Values, Returns Boolean DataFrame
df.isnull().any() - Returns boolean value for each column, gives True if any null value detected corresponding to that column
df.dropna() - Drops all rows that contain null values
df.dropna(axis=1) - Drops all columns that contain null values
df.fillna(x) - Replaces all null values with x
s.replace(1,'one') - Replaces all values equal to 1 with 'one'
s.replace([1,3], ['one','three']) - Replaces all 1 with 'one' and 3 with 'three'
df.rename(columns = lambda x: x + '_1') - Mass renaming of columns
df.rename(columns = {'old_name': 'new_name'}) - Selective renaming
df.rename(index = lambda x: x + 1) - Mass renaming of index
df[new_col] = df.col1 + ', ' + df.col2 - Add two columns to create a new column in the same DataFrame

FIlter sort

df[df[col] > 0.5] - Rows where the values in col > 0.5
df[(df[col] > 0.5) & (df[col] < 0.7)] - Rows where 0.7 > col > 0.5
df.sort_values(col1) - Sorts values by col1 in ascending order
df.sort_values(col2,ascending=False) - Sorts values by col2 in descending order
df.sort_values([col1,col2],ascending=[True,False]) - Sorts values by col1 in ascending order then col2 in descending order

df.groupby(col) - Returns a groupby object for values from one column

df.groupby([col1,col2]) - Returns a groupby object values from multiple columns
df.groupby(col1)[col2].mean() - (Aggregation) Returns the mean of the values in col2, grouped by the values in col1
df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean) - Creates a pivot table that groups by col1 and calculates the mean of col2 and col3
df.apply(np.mean) - Applies a function across each column
df.apply(np.max, axis=1) - Applies a function across each row
df.applymap(lambda arg(s): expression) - Apply the expression on each value of the DataFrame
df[col].map(lambda arg(s): expression) - Apply the expression on each value of the column col


swapcase - Swaps the case lower/upper.
lower() / upper() - Converts strings in the Series/Index to lower / upper case.
len() - Computes String length.
strip() - Helps strip whitespace(including newline) from each string in the Series/index from both the sides.
split(' ') - Splits each string with the given pattern.
cat(sep=' ') - Concatenates the series/index elements with given separator.
get_dummies() - Returns the DataFrame with One-Hot Encoded values.
contains(pattern) - Returns Boolean True for each element if the substring contains in the element, else False.
replace(a,b) - Replaces the value a with the value b.
repeat(value) - Repeats each element with specified number of times.
count(pattern) - Returns count of appearance of pattern in each element.
startswith(pattern) / endswith(pattern) - Returns true if the element in the Series/Index starts / ends with the pattern.
find(pattern) - Returns the first position of the first occurrence of the pattern. Returns -1 if not found.
findall(pattern) - Returns a list of all occurrence of the pattern.
islower() / isupper() / isnumeric() - Checks whether all characters in each string in the Series/Index in lower / upper case / numeric or not. Returns Boolean.

df1.append(df2) OR pd.concat([df1, df2], axis=0) - Adds the rows in df1 to the end of df2 (columns should be identical)
pd.concat([df1, df2], axis=1) - Adds the columns in df1 to the end of df2 (rows should be identical)
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True)

Stats

df.mean() - Returns the mean of all columns
df.corr() - Returns the correlation between columns in a DataFrame
df.count() - Returns the number of non-null values in each DataFrame column
df.max() - Returns the highest value in each column
df.min() - Returns the lowest value in each column
df.median() - Returns the median of each column
df.std() - Returns the standard deviation of each column






Date


dataframe['start_time']=pd.to_datetime(dataframe['start_time'])
dataframe['end_time']=pd.to_datetime(dataframe['end_time'])

ataframe['month']=dataframe['start_time'].dt.month_name()
dataframe['start_hour']=dataframe['start_time'].dt.hour
dataframe['start_hour']=dataframe['start_time'].dt.month_name()

print(pd.merge(left, right, on='subject_id', how='left'))






dd=dd.drop(dd[dd.Sport=='Swimming'].index)

df_merge3=df_merge4.groupby(['Name','Team']).agg({'Event':'count','Medal':'count'}).reset_index().sort_values(by=['Event','Medal'],ascending=[False,True])

Python 如何释放 Pandas 数据帧使用的内存？

提问by b10hazard

回答by Wilfred Hughes

Reducing the Number of Dataframes

减少数据帧的数量

Reducing Dataframe Size

减少数据帧大小

Viewing Memory Usage

查看内存使用情况

回答by Ami Tavory

回答by hardi

回答by Marlon Abeykoon

回答by parvez khan

相关推荐

最近更新

标签

Python 如何释放 Pandas 数据帧使用的内存？

提问by b10hazard

回答by Wilfred Hughes

Reducing the Number of Dataframes

减少数据帧的数量

Reducing Dataframe Size

减少数据帧大小

Viewing Memory Usage

查看内存使用情况

回答by Ami Tavory

回答by hardi

回答by Marlon Abeykoon

回答by parvez khan

相关推荐

Keras + Tensorflow 和 Python 中的多处理

Python 中的函数链

Python pandas.concat 中的列顺序

Python “SSL证书验证失败”使用pip安装软件包

相关推荐

最近更新

标签