当使用“pandas.read_hdf()”读取巨大的 HDF5 文件时，为什么即使我通过指定块大小读取块，我仍然会收到 MemoryError？

Question

提问by Ewan

Problem description:

问题描述：

I use python pandas to read a few large CSV file and store it in HDF5 file, the resulting HDF5 file is about 10GB. The problem happens when reading it back. Even though I tried to read it back in chunks, I still get MemoryError.

我使用 python pandas 读取一些大型 CSV 文件并将其存储在 HDF5 文件中，生成的 HDF5 文件大约为 10GB。 回读时出现问题。即使我尝试将其分块读回，我仍然收到 MemoryError。

Here is How I create the HDF5 file:

这是我如何创建 HDF5 文件：

import glob, os
import pandas as pd

hdf = pd.HDFStore('raw_sample_storage2.h5')

os.chdir("C:/RawDataCollection/raw_samples/PLB_Gate")
for filename in glob.glob("RD_*.txt"):
    raw_df = pd.read_csv(filename,
                         sep=' ',
                         header=None, 
                         names=['time', 'GW_time', 'node_id', 'X', 'Y', 'Z', 'status', 'seq', 'rssi', 'lqi'], 
                         dtype={'GW_time': uint32, 'node_id': uint8, 'X': uint16, 'Y': uint16, 'Z':uint16, 'status': uint8, 'seq': uint8, 'rssi': int8, 'lqi': uint8},
                         parse_dates=['time'], 
                         date_parser=dateparse, 
                         chunksize=50000, 
                         skip_blank_lines=True)
    for chunk in raw_df:
        hdf.append('raw_sample_all', chunk, format='table', data_columns = True, index = True, compression='blosc', complevel=9)

Here is How I try to read it back in chunks:

这是我尝试分块读回的方法：

for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', chunksize=300000):
    print(df.head(1))

Here is the error message I got:

这是我收到的错误消息：

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-7-ef278566a16b> in <module>()
----> 1 for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', chunksize=300000):
      2     print(df.head(1))

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in read_hdf(path_or_buf, key, **kwargs)
    321         store = HDFStore(path_or_buf, **kwargs)
    322         try:
--> 323             return f(store, True)
    324         except:
    325 

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in <lambda>(store, auto_close)
    303 
    304     f = lambda store, auto_close: store.select(
--> 305         key, auto_close=auto_close, **kwargs)
    306 
    307     if isinstance(path_or_buf, string_types):

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    663                            auto_close=auto_close)
    664 
--> 665         return it.get_result()
    666 
    667     def select_as_coordinates(

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in get_result(self, coordinates)
   1346                     "can only use an iterator or chunksize on a table")
   1347 
-> 1348             self.coordinates = self.s.read_coordinates(where=self.where)
   1349 
   1350             return self

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in read_coordinates(self, where, start, stop, **kwargs)
   3545         self.selection = Selection(
   3546             self, where=where, start=start, stop=stop, **kwargs)
-> 3547         coords = self.selection.select_coords()
   3548         if self.selection.filter is not None:
   3549             for field, op, filt in self.selection.filter.format():

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in select_coords(self)
   4507             return self.coordinates
   4508 
-> 4509         return np.arange(start, stop)
   4510 
   4511 # utilities ###

MemoryError:

My python environment:

我的python环境：

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: x86
processor: x86 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.15.2
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.1
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None

Edit 1:

编辑1：

It took about half an hour for the MemoryError to happen after executing read_hdf(), and in the meanwhile I checked taskmgr, and there's little CPU activity and total memory used never exceeded 2.2G.It was about 2.1 GB before I execute the code. So whatever pandas read_hdf() loaded into the RAM is less than 100 MB (I have 4G RAM, and my 32-bit-Windows system can only use 2.7G, and I used the rest for RAM disk)

执行read_hdf()后，大概半小时后出现MemoryError，同时我检查了taskmgr，CPU活动很少，总内存使用量从未超过2.2G。在我执行代码之前大约是 2.1 GB。所以无论pandas read_hdf()加载到RAM中都小于100 MB （我有4G RAM，而我的32位Windows系统只能使用2.7G，其余部分用于RAM磁盘）

Here's the hdf file info:

这是hdf文件信息：

In [2]:
hdf = pd.HDFStore('raw_sample_storage2.h5')
hdf

Out[2]:
<class 'pandas.io.pytables.HDFStore'>
File path: C:/RawDataCollection/raw_samples/PLB_Gate/raw_sample_storage2.h5
/raw_sample_all            frame_table  (typ->appendable,nrows->308581091,ncols->10,indexers->[index],dc->[time,GW_time,node_id,X,Y,Z,status,seq,rssi,lqi])

Moreover, I can read a portion of the hdf file by indicating 'start' and 'stop' instead of 'chunksize':

此外，我可以通过指示“开始”和“停止”而不是“块大小”来读取 hdf 文件的一部分：

%%time
df = pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000)
print df.info()
print(df.head(5))

The execution only took 4 seconds, and the output is:

执行只用了4秒，输出为：

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300000 entries, 0 to 49999
Data columns (total 10 columns):
time       300000 non-null datetime64[ns]
GW_time    300000 non-null uint32
node_id    300000 non-null uint8
X          300000 non-null uint16
Y          300000 non-null uint16
Z          300000 non-null uint16
status     300000 non-null uint8
seq        300000 non-null uint8
rssi       300000 non-null int8
lqi        300000 non-null uint8
dtypes: datetime64[ns](1), int8(1), uint16(3), uint32(1), uint8(4)
memory usage: 8.9 MB
None
                 time   GW_time  node_id      X      Y      Z  status  seq  \
0 2013-10-22 17:20:58  39821761        3  20010  21716  22668       0   33   
1 2013-10-22 17:20:58  39821824        4  19654  19647  19241       0   33   
2 2013-10-22 17:20:58  39821888        1  16927  21438  22722       0   34   
3 2013-10-22 17:20:58  39821952        2  17420  22882  20440       0   34   
4 2013-10-22 17:20:58  39822017        3  20010  21716  22668       0   34   

   rssi  lqi  
0   -43   49  
1   -72   47  
2   -46   48  
3   -57   46  
4   -42   50  
Wall time: 4.26 s

Noticing 300000 rows only took 8.9 MB RAM, I tried to use chunksize together with start and stop:

注意到 300000 行只占用了 8.9 MB RAM，我尝试将 chunksize 与 start 和 stop 一起使用：

for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000,chunksize = 3000):
    print df.info()
    print(df.head(5))

Same MemoryError happens.

发生相同的 MemoryError。

I don't understand what's happening here, if the internal mechanism somehow ignore chunksize/start/stop and tried to load the whole thing into RAM, how come there's almost no increase in RAM usage (only 100 MB) when MemoryError happens? And why does the execution take half an hour just to reach the error at the very beginning of the process without noticeable CPU usage?

我不明白这里发生了什么，如果内部机制以某种方式忽略了块大小/启动/停止并试图将整个内容加载到 RAM 中，那么当 MemoryError 发生时，为什么 RAM 使用量几乎没有增加（仅 100 MB）？为什么在没有明显CPU使用率的情况下，执行需要半小时才能在进程开始时到达错误？

Answer 1

采纳答案by Jeff

So the iterator is built mainly to deal with a whereclause. PyTablesreturns a list of the indicies where the clause is True. These are row numbers. In this case, there is no where clause, but we still use the indexer, which in this case is simply np.arangeon the list of rows.

所以迭代器主要是为了处理where子句而构建的。PyTables返回子句为 True 的索引列表。这些是行号。在这种情况下，没有 where 子句，但我们仍然使用索引器，在这种情况下，它只是np.arange在行列表上。

300MM rows takes 2.2GB. which is too much for windows 32-bit (generally maxes out around 1GB). On 64-bit this would be no problem.

300MM 行需要 2.2GB。这对于 Windows 32 位（通常最大约 1GB）来说太多了。在 64 位上，这没有问题。

In [1]: np.arange(0,300000000).nbytes/(1024*1024*1024.0)
Out[1]: 2.2351741790771484

So this should be handled by slicing semantics, which would make this take only a trivial amount of memory. Issue opened here.

所以这应该通过切片语义来处理，这将使这仅占用少量内存。问题在这里打开。

So I would suggest this. Here the indexer is computed directly and this provides iterator semantics.

所以我会建议这个。这里索引器是直接计算的，这提供了迭代器语义。

In [1]: df = DataFrame(np.random.randn(1000,2),columns=list('AB'))

In [2]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)

In [3]: store = pd.HDFStore('test.h5')

In [4]: nrows = store.get_storer('df').nrows

In [6]: chunksize = 100

In [7]: for i in xrange(nrows//chunksize + 1):
            chunk = store.select('df',
                                 start=i*chunksize,
                                 stop=(i+1)*chunksize)
            # work on the chunk    

In [8]: store.close()

当使用“pandas.read_hdf()”读取巨大的 HDF5 文件时，为什么即使我通过指定块大小读取块，我仍然会收到 MemoryError？

提问by Ewan

Problem description:

问题描述：

Here is How I create the HDF5 file:

这是我如何创建 HDF5 文件：

Here is How I try to read it back in chunks:

这是我尝试分块读回的方法：

Here is the error message I got:

这是我收到的错误消息：

My python environment:

我的python环境：

Edit 1:

编辑1：

采纳答案by Jeff

相关推荐

最近更新

标签

当使用“pandas.read_hdf()”读取巨大的 HDF5 文件时，为什么即使我通过指定块大小读取块，我仍然会收到 MemoryError？

提问by Ewan

Problem description:

问题描述：

Here is How I create the HDF5 file:

这是我如何创建 HDF5 文件：

Here is How I try to read it back in chunks:

这是我尝试分块读回的方法：

Here is the error message I got:

这是我收到的错误消息：

My python environment:

我的python环境：

Edit 1:

编辑1：

采纳答案by Jeff

相关推荐

Python Pandas DtypeWarning 在导入时指定 dtype 选项 - 如何？

pandas Python：如何从熊猫系列的字典中获取值

将一个 Pandas 数据帧除以另一个 - 忽略索引但尊重列

将月份添加到 Pandas 中的日期时间列

相关推荐

最近更新

标签