如何将 Pandas 数据帧写入 HDF5 数据集

Question

提问by AleVis

I'm trying to write data from a Pandas dataframe into a nested hdf5 file, with multiple groups and datasets within each group. I'd like to keep it as a single file which will grow in the future on a daily basis. I've had a go with the following code, which shows the structure of what I'd like to achieve

我正在尝试将 Pandas 数据帧中的数据写入嵌套的 hdf5 文件中，每个组中有多个组和数据集。我想将它保存为一个单独的文件，将来每天都会增长。我已经使用了以下代码，它显示了我想要实现的结构

import h5py
import numpy as np
import pandas as pd

file = h5py.File('database.h5','w')

d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d) 

groups = ['A','B','C']         

for m in groups:

    group = file.create_group(m)
    dataset = ['1','2','3']

    for n in dataset:

        data = df
        ds = group.create_dataset(m + n, data.shape)
        print ("Dataset dataspace is", ds.shape)
        print ("Dataset Numpy datatype is", ds.dtype)
        print ("Dataset name is", ds.name)
        print ("Dataset is a member of the group", ds.parent)
        print ("Dataset was created in the file", ds.file)

        print ("Writing data...")
        ds[...] = data        

        print ("Reading data back...")
        data_read = ds[...]

        print ("Printing data...")
        print (data_read)

file.close(

)

This way the nested structure is created but it loses the index and columns. I've tried the

这样嵌套结构被创建，但它丢失了索引和列。我试过了

df.to_hdf('database.h5', ds, table=True, mode='a')

but didn't work, I get this error

但没有用，我收到这个错误

AttributeError: 'Dataset' object has no attribute 'split'

AttributeError: 'Dataset' 对象没有属性 'split'

Can anyone shed some light please. Many thanks

任何人都可以请说明一下。非常感谢

Answer 1

回答by MaxU

df.to_hdf()expects a string as a keyparameter (second parameter):

df.to_hdf()需要一个字符串作为key参数（第二个参数）：

key: string
identifier for the group in the store

键：字符串
商店中组的标识符

so try this:

所以试试这个：

df.to_hdf('database.h5', ds.name, table=True, mode='a')

where ds.nameshould return you a string (key name):

哪里ds.name应该返回一个字符串（键名）：

In [26]: ds.name
Out[26]: '/A1'

Answer 2

回答by AleVis

I thought to have a go with pandas\pytables and the HDFStore class instead of h5py. So I tried the following

我想用 pandas\pytables 和 HDFStore 类代替 h5py。所以我尝试了以下

import numpy as np
import pandas as pd

db = pd.HDFStore('Database.h5')

index = pd.date_range('1/1/2000', periods=8)

df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['Col1', 'Col2', 'Col3'])

groups = ['A','B','C']     

i = 1    

for m in groups:

    subgroups = ['d','e','f']

    for n in subgroups:

        db.put(m + '/' + n, df, format = 'table', data_columns = True)

It works, 9 groups (groups instead of datasets in pyatbles instead fo h5py?) created from A/d to C/f. Columns and indexes preserved and can do the dataframe operations I need. Still wondering though whether this is an efficient way to retrieve data from a specific group which will become huge in the the future i.e. operations like

它有效，从 A/d 到 C/f 创建了 9 个组（组而不是 pyatbles 中的数据集而不是 h5py？）。保留的列和索引可以执行我需要的数据框操作。仍然想知道这是否是一种从特定组检索数据的有效方法，这些数据在未来将变得巨大，即像这样的操作

db['A/d'].Col1[4:]

如何将 Pandas 数据帧写入 HDF5 数据集

提问by AleVis

回答by MaxU

回答by AleVis

相关推荐

最近更新

标签

如何将 Pandas 数据帧写入 HDF5 数据集

提问by AleVis

回答by MaxU

回答by AleVis

相关推荐

pandas 在python中将多个Excel文件（xlsx）附加在一起

pandas 根据空值的百分比删除熊猫数据框中的列

pandas 使用 Python 将表从一个数据库复制到 SQL Server 中的另一个数据库

使用滚动中值过滤掉 Pandas 数据框中的异常值

相关推荐

最近更新

标签