在 Pandas 中迭代写入 HDF5 存储

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16637271/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:50:22  来源:igfitidea点击:

Iteratively writing to HDF5 Stores in Pandas

pythoniopandashdf5pytables

提问by Amelio Vazquez-Reina

Pandashas the following examples for how to store Series, DataFramesand Panelsin HDF5 files:

Pandas有以下示例说明如何在 HDF5 文件中存储Series,DataFramesPanels

Prepare some data:

准备一些数据:

In [1142]: store = HDFStore('store.h5')

In [1143]: index = date_range('1/1/2000', periods=8)

In [1144]: s = Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [1145]: df = DataFrame(randn(8, 3), index=index,
   ......:                columns=['A', 'B', 'C'])
   ......:

In [1146]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'],
   ......:            major_axis=date_range('1/1/2000', periods=5),
   ......:            minor_axis=['A', 'B', 'C', 'D'])
   ......:

Save it in a store:

将其保存在商店中:

In [1147]: store['s'] = s

In [1148]: store['df'] = df

In [1149]: store['wp'] = wp

Inspect what's in the store:

检查商店里的东西:

In [1150]: store
Out[1150]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df            frame        (shape->[8,3])  
/s             series       (shape->[5])    
/wp            wide         (shape->[2,5,4])

Close the store:

关闭店铺:

In [1151]: store.close()

Questions:

问题:

  1. In the code above, when is the data actually written to disk?

  2. Say I want to add thousands of large dataframes living in .csvfiles to a single .h5file. I would need to load them and add them to the .h5file one by one since I cannotafford to have them all in memory at once as they would take too much memory. Is this possible with HDF5? What would be the correct way to do it?

  3. The Pandas documentation says the following:

    "These stores are not appendable once written(though you simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety."

    What does it mean by not appendable nor queryable? Also, shouldn't it say once closedinstead of written?

  1. 在上面的代码中,数据是什么时候真正写入磁盘的

  2. 假设我想将.csv文件中的数千个大型数据帧添加到单个.h5文件中。我需要加载它们并将它们.h5一一添加到文件中,因为我无法一次将它们全部放入内存中,因为它们会占用太多内存。这可以用 HDF5 实现吗?什么是正确的方法呢?

  3. Pandas 文档说明如下:

    “这些存储一旦写入就不可追加(尽管您只需删除它们并重新编写)。它们也不是可查询的;它们必须完整地检索。”

    not appendable 或 queryable是什么意思?另外,不应该说 once closed而不是write吗?

采纳答案by Jeff

  1. As soon as the statement is exectued, eg store['df'] = df. The closejust closes the actual file (which will be closed for you if the process exists, but will print a warning message)

  2. Read the section http://pandas.pydata.org/pandas-docs/dev/io.html#storing-in-table-format

    It is generally not a good idea to put a LOT of nodes in an .h5file. You probably want to append and create a smaller number of nodes.

    You can just iterate thru your .csvand store/appendthem one by one. Something like:

    for f in files:
      df = pd.read_csv(f)
      df.to_hdf('file.h5',f,df)
    

    Would be one way (creating a separate node for each file)

  3. Not appendable - once you write it, you can only retrieve it all at once, e.g. you cannot select a sub-section

    If you have a table, then you can do things like:

    pd.read_hdf('my_store.h5','a_table_node',['index>100'])
    

    which is like a database query, only getting part of the data

    Thus, a store is not appendable, nor queryable, while a table is both.

  1. 一旦语句被执行,例如store['df'] = df。在close刚刚关闭的实际文件(这会为您如果进程存在被关闭,但会显示一条警告消息)

  2. 阅读部分http://pandas.pydata.org/pandas-docs/dev/io.html#storing-in-table-format

    将大量节点放在一个.h5文件中通常不是一个好主意。您可能想要追加和创建较少数量的节点。

    你可以一一遍历你.csvstore/append他们。就像是:

    for f in files:
      df = pd.read_csv(f)
      df.to_hdf('file.h5',f,df)
    

    将是一种方式(为每个文件创建一个单独的节点)

  3. 不可追加 - 一旦你写了它,你只能一次检索它,例如你不能选择一个子部分

    如果您有一张桌子,那么您可以执行以下操作:

    pd.read_hdf('my_store.h5','a_table_node',['index>100'])
    

    这就像一个数据库查询,只获取部分数据

    因此,一个存储是不可追加的,也不是可查询的,而一个表是两者

回答by Pablo

Answering question 2, with pandas 0.18.0 you can do:

回答问题 2,使用 pandas 0.18.0,您可以:

store = pd.HDFStore('compiled_measurements.h5')
for filepath in file_iterator:
    raw = pd.read_csv(filepath)
    store.append('measurements', raw, index=False)

store.create_table_index('measurements', columns=['a', 'b', 'c'], optlevel=9, kind='full')
store.close()

Based on thispart of the docs.

基于文档的这一部分。

Depending on how much data you have, the index creation can consume enormous amounts of memory. The PyTables docs describes the values of optlevel.

根据您拥有的数据量,创建索引可能会消耗大量内存。PyTables 文档描述了optlevel的值。