如何将 Pandas 数据帧写入 HDF5 数据集
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47165911/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to write a Pandas Dataframe into a HDF5 dataset
提问by AleVis
I'm trying to write data from a Pandas dataframe into a nested hdf5 file, with multiple groups and datasets within each group. I'd like to keep it as a single file which will grow in the future on a daily basis. I've had a go with the following code, which shows the structure of what I'd like to achieve
我正在尝试将 Pandas 数据帧中的数据写入嵌套的 hdf5 文件中,每个组中有多个组和数据集。我想将它保存为一个单独的文件,将来每天都会增长。我已经使用了以下代码,它显示了我想要实现的结构
import h5py
import numpy as np
import pandas as pd
file = h5py.File('database.h5','w')
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
groups = ['A','B','C']
for m in groups:
group = file.create_group(m)
dataset = ['1','2','3']
for n in dataset:
data = df
ds = group.create_dataset(m + n, data.shape)
print ("Dataset dataspace is", ds.shape)
print ("Dataset Numpy datatype is", ds.dtype)
print ("Dataset name is", ds.name)
print ("Dataset is a member of the group", ds.parent)
print ("Dataset was created in the file", ds.file)
print ("Writing data...")
ds[...] = data
print ("Reading data back...")
data_read = ds[...]
print ("Printing data...")
print (data_read)
file.close(
)
)
This way the nested structure is created but it loses the index and columns. I've tried the
这样嵌套结构被创建,但它丢失了索引和列。我试过了
df.to_hdf('database.h5', ds, table=True, mode='a')
but didn't work, I get this error
但没有用,我收到这个错误
AttributeError: 'Dataset' object has no attribute 'split'
AttributeError: 'Dataset' 对象没有属性 'split'
Can anyone shed some light please. Many thanks
任何人都可以请说明一下。非常感谢
回答by MaxU
df.to_hdf()expects a string as a key
parameter (second parameter):
df.to_hdf()需要一个字符串作为key
参数(第二个参数):
key: string
identifier for the group in the store
键:字符串
商店中组的标识符
so try this:
所以试试这个:
df.to_hdf('database.h5', ds.name, table=True, mode='a')
where ds.name
should return you a string (key name):
哪里ds.name
应该返回一个字符串(键名):
In [26]: ds.name
Out[26]: '/A1'
回答by AleVis
I thought to have a go with pandas\pytables and the HDFStore class instead of h5py. So I tried the following
我想用 pandas\pytables 和 HDFStore 类代替 h5py。所以我尝试了以下
import numpy as np
import pandas as pd
db = pd.HDFStore('Database.h5')
index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['Col1', 'Col2', 'Col3'])
groups = ['A','B','C']
i = 1
for m in groups:
subgroups = ['d','e','f']
for n in subgroups:
db.put(m + '/' + n, df, format = 'table', data_columns = True)
It works, 9 groups (groups instead of datasets in pyatbles instead fo h5py?) created from A/d to C/f. Columns and indexes preserved and can do the dataframe operations I need. Still wondering though whether this is an efficient way to retrieve data from a specific group which will become huge in the the future i.e. operations like
它有效,从 A/d 到 C/f 创建了 9 个组(组而不是 pyatbles 中的数据集而不是 h5py?)。保留的列和索引可以执行我需要的数据框操作。仍然想知道这是否是一种从特定组检索数据的有效方法,这些数据在未来将变得巨大,即像这样的操作
db['A/d'].Col1[4:]