Pandas HDFStore 从内存中卸载数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/18201042/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas HDFStore unload dataframe from memory
提问by smartexpert
OK I am experimenting with pandas to load around a 30GB csv file with 40 million+ rows and 150+ columns in to HDFStore. The majority of the columns are strings, followed by numerical and dates.
好的,我正在尝试使用 Pandas 将一个 30GB 的 csv 文件加载到 HDFStore 中,该文件有 4000 万多行和 150 多列。大多数列是字符串,然后是数字和日期。
I have never really used numpy, pandas or pytables before but have played around with data frames in R.
我以前从未真正使用过 numpy、pandas 或 pytables,但在 R 中使用过数据框。
I am currently just storing a sample file of around 20000 rows in to HDFStore. When I try to read the table from HDFStore the table is loaded to memory and memory usage goes up by ~100MB
我目前只是将大约 20000 行的示例文件存储到 HDFStore。当我尝试从 HDFStore 读取表时,表被加载到内存中,内存使用量增加了 ~100MB
f=HDFStore('myfile.h5')
g=f['df']
Then I delete the variable containing the DataFrame:
然后我删除包含 DataFrame 的变量:
del g
At the point the memory usage decreases by about 5MB
此时内存使用量减少了约 5MB
If I again load the data into g using g=f['df'], the memory usage shoots up another 100MB
如果我再次将数据加载到 g using 中g=f['df'],内存使用量会再增加 100MB
Cleanup only happens when I actually close the window.
仅当我实际关闭窗口时才会进行清理。
The way the data is organized, I am probably going to divide the data into individual tables with the max table size around 1GB which can fit into memory and then use it one at a time. However, this approach will not work if I am not able to clear memory.
根据数据的组织方式,我可能会将数据分成单个表,最大表大小约为 1GB,可以放入内存,然后一次使用一个。但是,如果我无法清除内存,这种方法将不起作用。
Any ideas on how I can achieve this?
关于如何实现这一目标的任何想法?
回答by Pythonic
To answer on the second point of OP's question ("how to free memory")
回答 OP 问题的第二点(“如何释放内存”)
Short answer
简答
Closing the store and deleting the selected dataframe does not work, however I found a call to gc.collect()clears up memory well after you delete the dataframe.
关闭存储并删除选定的数据框不起作用,但是我发现gc.collect()在删除数据框后调用可以很好地清除内存。
Example
例子
In the example below, memory is cleaned automatically as expected:
在下面的示例中,内存按预期自动清理:
data=numpy.random.rand(10000,1000)         # memory up by 78MB
df=pandas.DataFrame(data)                  # memory up by 1 MB
store = pandas.HDFStore('test.h5')         # memory up by 3 MB
store.append('df', df)                     # memory up by 9 MB (why?!?!)
del data                                   # no change in memory
del df                                     # memory down by 78 MB
store.close()                              # no change in memory
gc.collect()                               # no change in memory (1) 
(1) the store is still in memory, albeit closed
(1) 商店仍在内存中,尽管已关闭
Now suppose we continue from above and reopen storeas per below. Memory is cleaned only aftergc.collect() is called:
现在假设我们从上面继续并store按照下面的方式重新打开。只有在调用 gc.collect()后才会清理内存:
store = pandas.HDFStore('test.h5')         # no change in memory (2) 
df = store.select('df')                    # memory up by 158MB ?! (3)
del df                                     # no change in memory
store.close()                              # no change in memory
gc.collect()                               # memory down by 158 MB (4)
(2) the store never left, (3) I have read that selection of a table migth take up as much as 3x the sixe of the table, (4) the store is still there
(2) 这家商店从未离开过,(3) 我读过一张桌子的选择可能占桌子六倍的 3 倍,(4) 商店还在那里
Finally I also tried to do a .copy()of the df on open (df = store.select('df')). Do notdo this, it creates a monster in memory that cannot be garbage-collected afterwards.
最后我还尝试.copy()在 open ( df = store.select('df'))上做一个df 。不要这样做,它会在内存中创建一个之后无法进行垃圾收集的怪物。
Final questionIf a DF in memory is 100MB, I understand it might occupy 2-3x size in memory while loading but why does it stayat 200MB in memory after I select it from an HDFStore and close the store?
最后一个问题如果内存中的 DF 是 100MB,我知道它在加载时可能会占用 2-3x 大小的内存,但是为什么在我从 HDFStore 中选择它并关闭存储后它在内存中保持在 200MB?

