当使用“pandas.read_hdf()”读取巨大的 HDF5 文件时,为什么即使我通过指定块大小读取块,我仍然会收到 MemoryError?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30587026/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
When reading huge HDF5 file with "pandas.read_hdf() ", why do I still get MemoryError even though I read in chunks by specifying chunksize?
提问by Ewan
Problem description:
问题描述:
I use python pandas to read a few large CSV file and store it in HDF5 file, the resulting HDF5 file is about 10GB. The problem happens when reading it back. Even though I tried to read it back in chunks, I still get MemoryError.
我使用 python pandas 读取一些大型 CSV 文件并将其存储在 HDF5 文件中,生成的 HDF5 文件大约为 10GB。 回读时出现问题。即使我尝试将其分块读回,我仍然收到 MemoryError。
Here is How I create the HDF5 file:
这是我如何创建 HDF5 文件:
import glob, os
import pandas as pd
hdf = pd.HDFStore('raw_sample_storage2.h5')
os.chdir("C:/RawDataCollection/raw_samples/PLB_Gate")
for filename in glob.glob("RD_*.txt"):
raw_df = pd.read_csv(filename,
sep=' ',
header=None,
names=['time', 'GW_time', 'node_id', 'X', 'Y', 'Z', 'status', 'seq', 'rssi', 'lqi'],
dtype={'GW_time': uint32, 'node_id': uint8, 'X': uint16, 'Y': uint16, 'Z':uint16, 'status': uint8, 'seq': uint8, 'rssi': int8, 'lqi': uint8},
parse_dates=['time'],
date_parser=dateparse,
chunksize=50000,
skip_blank_lines=True)
for chunk in raw_df:
hdf.append('raw_sample_all', chunk, format='table', data_columns = True, index = True, compression='blosc', complevel=9)
Here is How I try to read it back in chunks:
这是我尝试分块读回的方法:
for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', chunksize=300000):
print(df.head(1))
Here is the error message I got:
这是我收到的错误消息:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-7-ef278566a16b> in <module>()
----> 1 for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', chunksize=300000):
2 print(df.head(1))
C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in read_hdf(path_or_buf, key, **kwargs)
321 store = HDFStore(path_or_buf, **kwargs)
322 try:
--> 323 return f(store, True)
324 except:
325
C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in <lambda>(store, auto_close)
303
304 f = lambda store, auto_close: store.select(
--> 305 key, auto_close=auto_close, **kwargs)
306
307 if isinstance(path_or_buf, string_types):
C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
663 auto_close=auto_close)
664
--> 665 return it.get_result()
666
667 def select_as_coordinates(
C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in get_result(self, coordinates)
1346 "can only use an iterator or chunksize on a table")
1347
-> 1348 self.coordinates = self.s.read_coordinates(where=self.where)
1349
1350 return self
C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in read_coordinates(self, where, start, stop, **kwargs)
3545 self.selection = Selection(
3546 self, where=where, start=start, stop=stop, **kwargs)
-> 3547 coords = self.selection.select_coords()
3548 if self.selection.filter is not None:
3549 for field, op, filt in self.selection.filter.format():
C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in select_coords(self)
4507 return self.coordinates
4508
-> 4509 return np.arange(start, stop)
4510
4511 # utilities ###
MemoryError:
My python environment:
我的python环境:
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: x86
processor: x86 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.15.2
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.1
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
Edit 1:
编辑1:
It took about half an hour for the MemoryError to happen after executing read_hdf(), and in the meanwhile I checked taskmgr, and there's little CPU activity and total memory used never exceeded 2.2G.It was about 2.1 GB before I execute the code. So whatever pandas read_hdf() loaded into the RAM is less than 100 MB (I have 4G RAM, and my 32-bit-Windows system can only use 2.7G, and I used the rest for RAM disk)
执行read_hdf()后,大概半小时后出现MemoryError,同时我检查了taskmgr,CPU活动很少,总内存使用量从未超过2.2G。在我执行代码之前大约是 2.1 GB。所以无论pandas read_hdf()加载到RAM中都小于100 MB (我有4G RAM,而我的32位Windows系统只能使用2.7G,其余部分用于RAM磁盘)
Here's the hdf file info:
这是hdf文件信息:
In [2]:
hdf = pd.HDFStore('raw_sample_storage2.h5')
hdf
Out[2]:
<class 'pandas.io.pytables.HDFStore'>
File path: C:/RawDataCollection/raw_samples/PLB_Gate/raw_sample_storage2.h5
/raw_sample_all frame_table (typ->appendable,nrows->308581091,ncols->10,indexers->[index],dc->[time,GW_time,node_id,X,Y,Z,status,seq,rssi,lqi])
Moreover, I can read a portion of the hdf file by indicating 'start' and 'stop' instead of 'chunksize':
此外,我可以通过指示“开始”和“停止”而不是“块大小”来读取 hdf 文件的一部分:
%%time
df = pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000)
print df.info()
print(df.head(5))
The execution only took 4 seconds, and the output is:
执行只用了4秒,输出为:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 300000 entries, 0 to 49999
Data columns (total 10 columns):
time 300000 non-null datetime64[ns]
GW_time 300000 non-null uint32
node_id 300000 non-null uint8
X 300000 non-null uint16
Y 300000 non-null uint16
Z 300000 non-null uint16
status 300000 non-null uint8
seq 300000 non-null uint8
rssi 300000 non-null int8
lqi 300000 non-null uint8
dtypes: datetime64[ns](1), int8(1), uint16(3), uint32(1), uint8(4)
memory usage: 8.9 MB
None
time GW_time node_id X Y Z status seq \
0 2013-10-22 17:20:58 39821761 3 20010 21716 22668 0 33
1 2013-10-22 17:20:58 39821824 4 19654 19647 19241 0 33
2 2013-10-22 17:20:58 39821888 1 16927 21438 22722 0 34
3 2013-10-22 17:20:58 39821952 2 17420 22882 20440 0 34
4 2013-10-22 17:20:58 39822017 3 20010 21716 22668 0 34
rssi lqi
0 -43 49
1 -72 47
2 -46 48
3 -57 46
4 -42 50
Wall time: 4.26 s
Noticing 300000 rows only took 8.9 MB RAM, I tried to use chunksize together with start and stop:
注意到 300000 行只占用了 8.9 MB RAM,我尝试将 chunksize 与 start 和 stop 一起使用:
for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000,chunksize = 3000):
print df.info()
print(df.head(5))
Same MemoryError happens.
发生相同的 MemoryError。
I don't understand what's happening here, if the internal mechanism somehow ignore chunksize/start/stop and tried to load the whole thing into RAM, how come there's almost no increase in RAM usage (only 100 MB) when MemoryError happens? And why does the execution take half an hour just to reach the error at the very beginning of the process without noticeable CPU usage?
我不明白这里发生了什么,如果内部机制以某种方式忽略了块大小/启动/停止并试图将整个内容加载到 RAM 中,那么当 MemoryError 发生时,为什么 RAM 使用量几乎没有增加(仅 100 MB)?为什么在没有明显CPU使用率的情况下,执行需要半小时才能在进程开始时到达错误?
采纳答案by Jeff
So the iterator is built mainly to deal with a whereclause. PyTablesreturns a list of the indicies where the clause is True. These are row numbers. In this case, there is no where clause, but we still use the indexer, which in this case is simply np.arangeon the list of rows.
所以迭代器主要是为了处理where子句而构建的。PyTables返回子句为 True 的索引列表。这些是行号。在这种情况下,没有 where 子句,但我们仍然使用索引器,在这种情况下,它只是np.arange在行列表上。
300MM rows takes 2.2GB. which is too much for windows 32-bit (generally maxes out around 1GB). On 64-bit this would be no problem.
300MM 行需要 2.2GB。这对于 Windows 32 位(通常最大约 1GB)来说太多了。在 64 位上,这没有问题。
In [1]: np.arange(0,300000000).nbytes/(1024*1024*1024.0)
Out[1]: 2.2351741790771484
So this should be handled by slicing semantics, which would make this take only a trivial amount of memory. Issue opened here.
所以这应该通过切片语义来处理,这将使这仅占用少量内存。问题在这里打开。
So I would suggest this. Here the indexer is computed directly and this provides iterator semantics.
所以我会建议这个。这里索引器是直接计算的,这提供了迭代器语义。
In [1]: df = DataFrame(np.random.randn(1000,2),columns=list('AB'))
In [2]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
In [3]: store = pd.HDFStore('test.h5')
In [4]: nrows = store.get_storer('df').nrows
In [6]: chunksize = 100
In [7]: for i in xrange(nrows//chunksize + 1):
chunk = store.select('df',
start=i*chunksize,
stop=(i+1)*chunksize)
# work on the chunk
In [8]: store.close()

