在 Pandas 中迭代写入 HDF5 存储

Question

提问by Amelio Vazquez-Reina

Pandashas the following examples for how to store Series, DataFramesand Panelsin HDF5 files:

Pandas有以下示例说明如何在 HDF5 文件中存储Series,DataFrames和Panels：

Prepare some data:

准备一些数据：

In [1142]: store = HDFStore('store.h5')

In [1143]: index = date_range('1/1/2000', periods=8)

In [1144]: s = Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [1145]: df = DataFrame(randn(8, 3), index=index,
   ......:                columns=['A', 'B', 'C'])
   ......:

In [1146]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'],
   ......:            major_axis=date_range('1/1/2000', periods=5),
   ......:            minor_axis=['A', 'B', 'C', 'D'])
   ......:

Save it in a store:

将其保存在商店中：

In [1147]: store['s'] = s

In [1148]: store['df'] = df

In [1149]: store['wp'] = wp

Inspect what's in the store:

检查商店里的东西：

In [1150]: store
Out[1150]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df            frame        (shape->[8,3])  
/s             series       (shape->[5])    
/wp            wide         (shape->[2,5,4])

Close the store:

关闭店铺：

In [1151]: store.close()

Questions:

问题：

In the code above, when is the data actually written to disk?
Say I want to add thousands of large dataframes living in .csvfiles to a single .h5file. I would need to load them and add them to the .h5file one by one since I cannotafford to have them all in memory at once as they would take too much memory. Is this possible with HDF5? What would be the correct way to do it?
The Pandas documentation says the following:
"These stores are not appendable once written(though you simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety."
What does it mean by not appendable nor queryable? Also, shouldn't it say once closedinstead of written?

在上面的代码中，数据是什么时候真正写入磁盘的？
假设我想将.csv文件中的数千个大型数据帧添加到单个.h5文件中。我需要加载它们并将它们.h5一一添加到文件中，因为我无法一次将它们全部放入内存中，因为它们会占用太多内存。这可以用 HDF5 实现吗？什么是正确的方法呢？
Pandas 文档说明如下：
“这些存储一旦写入就不可追加（尽管您只需删除它们并重新编写）。它们也不是可查询的；它们必须完整地检索。”
not appendable 或 queryable是什么意思？另外，不应该说 once closed而不是write吗？

Answer 1

采纳答案by Jeff

As soon as the statement is exectued, eg store['df'] = df. The closejust closes the actual file (which will be closed for you if the process exists, but will print a warning message)
Read the section http://pandas.pydata.org/pandas-docs/dev/io.html#storing-in-table-format
It is generally not a good idea to put a LOT of nodes in an .h5file. You probably want to append and create a smaller number of nodes.
You can just iterate thru your .csvand store/appendthem one by one. Something like:
```
for f in files:
  df = pd.read_csv(f)
  df.to_hdf('file.h5',f,df)
```
Would be one way (creating a separate node for each file)
Not appendable - once you write it, you can only retrieve it all at once, e.g. you cannot select a sub-section
If you have a table, then you can do things like:
```
pd.read_hdf('my_store.h5','a_table_node',['index>100'])
```
which is like a database query, only getting part of the data
Thus, a store is not appendable, nor queryable, while a table is both.

一旦语句被执行，例如store['df'] = df。在close刚刚关闭的实际文件（这会为您如果进程存在被关闭，但会显示一条警告消息）
阅读部分http://pandas.pydata.org/pandas-docs/dev/io.html#storing-in-table-format
将大量节点放在一个.h5文件中通常不是一个好主意。您可能想要追加和创建较少数量的节点。
你可以一一遍历你.csv和store/append他们。就像是：
```
for f in files:
  df = pd.read_csv(f)
  df.to_hdf('file.h5',f,df)
```
将是一种方式（为每个文件创建一个单独的节点）
不可追加 - 一旦你写了它，你只能一次检索它，例如你不能选择一个子部分
如果您有一张桌子，那么您可以执行以下操作：
```
pd.read_hdf('my_store.h5','a_table_node',['index>100'])
```
这就像一个数据库查询，只获取部分数据
因此，一个存储是不可追加的，也不是可查询的，而一个表是两者。

Answer 2

回答by Pablo

Answering question 2, with pandas 0.18.0 you can do:

回答问题 2，使用 pandas 0.18.0，您可以：

store = pd.HDFStore('compiled_measurements.h5')
for filepath in file_iterator:
    raw = pd.read_csv(filepath)
    store.append('measurements', raw, index=False)

store.create_table_index('measurements', columns=['a', 'b', 'c'], optlevel=9, kind='full')
store.close()

Based on thispart of the docs.

基于文档的这一部分。

Depending on how much data you have, the index creation can consume enormous amounts of memory. The PyTables docs describes the values of optlevel.

根据您拥有的数据量，创建索引可能会消耗大量内存。PyTables 文档描述了optlevel的值。

在 Pandas 中迭代写入 HDF5 存储

提问by Amelio Vazquez-Reina

Prepare some data:

准备一些数据：

Save it in a store:

将其保存在商店中：

Inspect what's in the store:

检查商店里的东西：

Close the store:

关闭店铺：

Questions:

问题：

采纳答案by Jeff

回答by Pablo

相关推荐

最近更新

标签

在 Pandas 中迭代写入 HDF5 存储

提问by Amelio Vazquez-Reina

Prepare some data:

准备一些数据：

Save it in a store:

将其保存在商店中：

Inspect what's in the store:

检查商店里的东西：

Close the store:

关闭店铺：

Questions:

问题：

采纳答案by Jeff

回答by Pablo

相关推荐

csv 和 xlsx 文件导入到 Pandas 数据框：速度问题

pandas 使用 NaN 添加两个系列

在 Python 中处理 Pandas DataFrames 列分区中的零

在 Pandas 中交换轴

相关推荐

最近更新

标签