使用 h5py 保存 Pandas DataFrame 以便与其他 hdf5 阅读器进行互操作

Question

提问by Phil

Here is a sample data frame:

这是一个示例数据框：

import pandas as pd

NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

I know that pandas has the pytables based HDFStore, which is an easy way to efficiently serialize/deserialize a data frame. But those datasets are not very easy to load directly using a reader h5py or matlab. How can I save a data frame using h5py, so that I can easily load it back using another hdf5 reader?

我知道 Pandas 有基于 pytables 的 HDFStore，这是一种高效序列化/反序列化数据帧的简单方法。但是这些数据集并不是很容易使用阅读器 h5py 或 matlab 直接加载。如何使用 h5py 保存数据框，以便我可以使用另一个 hdf5 阅读器轻松加载它？

Answer 1

采纳答案by Jeff

The pandas HDFStoreformat is standard HDF5, with just a convention for how to interpret the meta-data. Docs are here

pandasHDFStore格式是标准的 HDF5，只有一个关于如何解释元数据的约定。文档在这里

In [54]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)

In [55]: h = h5py.File('test.h5')

In [56]: h['df']['table']
Out[56]: <HDF5 dataset "table": shape (7,), type "|V32">

In [64]: h['df']['table'][:]
Out[64]: 
array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])


In [57]: h['df']['table'].attrs.items()
Out[57]: 
[(u'CLASS', 'TABLE'),
 (u'VERSION', '2.7'),
 (u'TITLE', ''),
 (u'FIELD_0_NAME', 'index'),
 (u'FIELD_1_NAME', 'A'),
 (u'FIELD_2_NAME', 'B'),
 (u'FIELD_3_NAME', 'C'),
 (u'FIELD_0_FILL', 0),
 (u'FIELD_1_FILL', 0.0),
 (u'FIELD_2_FILL', 0.0),
 (u'FIELD_3_FILL', 0.0),
 (u'index_kind', 'integer'),
 (u'A_kind', "(lp1\nS'A'\na."),
 (u'A_meta', 'N.'),
 (u'A_dtype', 'float64'),
 (u'B_kind', "(lp1\nS'B'\na."),
 (u'B_meta', 'N.'),
 (u'B_dtype', 'float64'),
 (u'C_kind', "(lp1\nS'C'\na."),
 (u'C_meta', 'N.'),
 (u'C_dtype', 'float64'),
 (u'NROWS', 7)]

In [58]: h.close()

The data will be completely readable in any HDF5 reader. Some of the meta-data is pickled, so care must be taken.

数据将在任何 HDF5 阅读器中完全可读。一些元数据是腌制的，所以必须小心。

Answer 2

回答by Phil

Here is my approach to solving this problem. I am hoping either someone else has a better solution or my approach is helpful to others.

这是我解决这个问题的方法。我希望其他人有更好的解决方案，或者我的方法对其他人有帮助。

First, define function to make a numpy structure array (not a record array) from a pandas DataFrame.

首先，定义函数以从 Pandas DataFrame 创建一个 numpy 结构数组（不是记录数组）。

import numpy as np
def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    This is functionally equivalent to but more efficient than
    np.array(df.to_array())

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """

    v = df.values
    cols = df.columns
    types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
    dtype = np.dtype(types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        z[k] = v[:, i]
    return z

Use reset_indexto make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.

使用reset_index使包括索引作为其数据的一部分，新的数据帧。将该数据框转换为结构体数组。

sa = df_to_sarray(df.reset_index())
sa

array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
       (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
       (7L, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Save that structured array to an hdf5 file.

将该结构化数组保存到 hdf5 文件中。

import h5py
with h5py.File('mydata.h5', 'w') as hf:
            hf['df'] = sa

Load the h5 dataset

加载 h5 数据集

with h5py.File('mydata.h5') as hf:
            sa2 = hf['df'][:]

Extract the ID column and delete it from sa2

提取 ID 列并将其从 sa2 中删除

ID = sa2['ID']
sa2 = nprec.drop_fields(sa2, 'ID')

Make data frame with index ID using sa2

使用 sa2 制作带有索引 ID 的数据框

df2 = pd.DataFrame(sa2, index=ID)
df2.index.name = 'ID'

print(df2)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

Answer 3

回答by iipr

In case it is helpful for anyone, I took this postfrom Guillaumeand Phil, and changed it a bit for my needs with the help of ankostis. We read the pandas DataFrame from a CSV file.

如果它对任何人有帮助，我从Guillaume和Phil那里得到了这篇文章，并在ankostis的帮助下根据我的需要对其进行了一些更改。我们从 CSV 文件中读取 pandas DataFrame。

Mainly I adapted it for Strings, because you cannot store a object in a HDF5 file (I believe). Firstly check which columns types are numpy objects. Then check which is the longest length of that column, and fix that column to be a String of that length. The rest is quite similar to the other post.

主要是我对其进行了调整Strings，因为您不能将对象存储在 HDF5 文件中（我相信）。首先检查哪些列类型是numpy objects. 然后检查哪个是该列的最长长度，并将该列固定为该长度的字符串。其余的与其他帖子非常相似。

def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    Also, for every column of a str type, convert it into 
    a 'bytes' str literal of length = max(len(col)).

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """

    def make_col_type(col_type, col):
        try:
            if 'numpy.object_' in str(col_type.type):
                maxlens = col.dropna().str.len()
                if maxlens.any():
                    maxlen = maxlens.max().astype(int) 
                    col_type = ('S%s' % maxlen, 1)
                else:
                    col_type = 'f2'
            return col.name, col_type
        except:
            print(col.name, col_type, col_type.type, type(col))
            raise

    v = df.values            
    types = df.dtypes
    numpy_struct_types = [make_col_type(types[col], df.loc[:, col]) for col in df.columns]
    dtype = np.dtype(numpy_struct_types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        # This is in case you have problems with the encoding, remove the if branch if not
        try:
            if dtype[i].str.startswith('|S'):
                z[k] = df[k].str.encode('latin').astype('S')
            else:
                z[k] = v[:, i]
        except:
            print(k, v[:, i])
            raise

    return z, dtype

So the workflow would be:

所以工作流程是：

import h5py
import pandas as pd

# Read a CSV file
# Here we assume col_dtypes is a dictionary that contains the dtypes of the columns
df = pd.read_table('./data.csv', sep='\t', dtype=col_dtypes)
# Transform the DataFrame into a structured numpy array and get the dtype
sa, saType = df_to_sarray(df)

# Open/create the HDF5 file
f = h5py.File('test.hdf5', 'a')
# Save the structured array
f.create_dataset('someData', data=sa, dtype=saType)
# Retrieve it and check it is ok when you transform it into a pandas DataFrame
sa2 = f['someData'][:]
df2 = pd.DataFrame(sa2)
print(df2.head())
f.close()

Also, in this way you are able to see it from HDFVieweven when using gzipcompression for instance.

此外，通过这种方式，即使使用压缩，您也可以从HDFView 中看到它gzip。

使用 h5py 保存 Pandas DataFrame 以便与其他 hdf5 阅读器进行互操作

提问by Phil

采纳答案by Jeff

回答by Phil

回答by iipr

相关推荐

最近更新

标签

使用 h5py 保存 Pandas DataFrame 以便与其他 hdf5 阅读器进行互操作

提问by Phil

采纳答案by Jeff

回答by Phil

回答by iipr

相关推荐

如何强制 Pandas read_csv 对所有浮点列使用 float32？

pandas 使用 .concat 创建熊猫数据框时包含空系列

pandas DataFrame 在布尔掩码上设置值

Pandas 错误 - 遇到无效值

相关推荐

最近更新

标签