将字符串列表从 Python 存储到 HDF5 数据集

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23220513/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:31:55  来源:igfitidea点击:

Storing a list of strings to a HDF5 Dataset from Python

pythonhdf5h5py

提问by gman

I am trying to store a variable length list of string to a HDF5 Dataset. The code for this is

我正在尝试将可变长度的字符串列表存储到 HDF5 数据集。这个代码是

import h5py
h5File=h5py.File('xxx.h5','w')
strList=['asas','asas','asas']  
h5File.create_dataset('xxx',(len(strList),1),'S10',strList)
h5File.flush() 
h5File.Close()  

I am getting an error stating that "TypeError: No conversion path for dtype: dtype('&lt U3')" where the &lt means actual less than symbol
How can I solve this problem.

我收到一条错误消息,指出“TypeError:dtype 没有转换路径:dtype('< U3')”,其中 < 表示实际小于符号
我该如何解决这个问题。

采纳答案by SlightlyCuban

You're reading in Unicode strings, but specifying your datatype as ASCII. According to the h5py wiki, h5py does not currently support this conversion.

您正在阅读 Unicode 字符串,但将数据类型指定为 ASCII。根据h5py wiki, h5py 目前不支持这种转换。

You'll need to encode the strings in a format h5py handles:

您需要以 h5py 处理的格式对字符串进行编码:

asciiList = [n.encode("ascii", "ignore") for n in strList]
h5File.create_dataset('xxx', (len(asciiList),1),'S10', asciiList)

Note: not everything encoded in UTF-8 can be encoded in ASCII!

注意:并非所有以 UTF-8 编码的内容都可以以 ASCII 编码!

回答by yardstick17

In HDF5, data in VL format is stored as arbitrary-length vectors of a base type. In particular, strings are stored C-style in null-terminated buffers. NumPy has no native mechanism to support this. Unfortunately, this is the de facto standard for representing strings in the HDF5 C API, and in many HDF5 applications.

Thankfully, NumPy has a generic pointer type in the form of the “object” (“O”) dtype. In h5py, variable-length strings are mapped to object arrays. A small amount of metadata attached to an “O” dtype tells h5py that its contents should be converted to VL strings when stored in the file.

Existing VL strings can be read and written to with no additional effort; Python strings and fixed-length NumPy strings can be auto-converted to VL data and stored.

Example

在 HDF5 中,VL 格式的数据存储为基本类型的任意长度向量。特别是,字符串以 C 风格存储在以 null 结尾的缓冲区中。NumPy 没有支持这一点的本地机制。不幸的是,这是在 HDF5 C API 和许多 HDF5 应用程序中表示字符串的事实标准。

幸运的是,NumPy 具有“对象”(“O”)dtype 形式的通用指针类型。在 h5py 中,变长字符串被映射到对象数组。附加到“O”dtype 的少量元数据告诉 h5py,其内容在存储在文件中时应转换为 VL 字符串。

可以毫不费力地读取和写入现有的 VL 字符串;Python 字符串和固定长度的 NumPy 字符串可以自动转换为 VL 数据并存储。

例子

In [27]: dt = h5py.special_dtype(vlen=str)

In [28]: dset = h5File.create_dataset('vlen_str', (100,), dtype=dt)

In [29]: dset[0] = 'the change of water into water vapour'

In [30]: dset[0]
Out[30]: 'the change of water into water vapour'

回答by Rajendra Koppula

I am in a similar situation wanting to store column names of dataframe as a dataset in hdf5 file. Assuming df.columns is what I want to store, I found the following works:

我处于类似的情况,希望将数据框的列名作为数据集存储在 hdf5 文件中。假设 df.columns 是我想要存储的,我发现以下作品:

h5File = h5py.File('my_file.h5','w')
h5File['col_names'] = df.columns.values.astype('S')

This assumes the column names are 'simple' strings that can be encoded in ASCII.

这假设列名称是可以用 ASCII 编码的“简单”字符串。