pandas 将混合数据和类别的pandas DataFrame存储到hdf5中

Question

提问by AnnetteC

I want to store a dataFrame with different columns into an hdf5 file (find an excerpt with data types below).

我想将具有不同列的数据帧存储到 hdf5 文件中（在下面找到具有数据类型的摘录）。

In  [1]: mydf
Out [1]:
endTime             uint32
distance           float16
signature         category
anchorName        category
stationList         object

Before converting some columns (signature and anchorName in my excerpt above), I used code like following to store it (which works pretty fine):

在转换一些列（上面摘录中的签名和锚名称）之前，我使用如下代码来存储它（效果很好）：

path = 'tmp4.hdf5'
key = 'journeys'
mydf.to_hdf(path, key, mode='w', complevel=9, complib='bzip2')

But it does not work with category and then I tried following:

但它不适用于类别，然后我尝试了以下操作：

path = 'tmp4.hdf5'
key = 'journeys'
mydf.to_hdf(path, key, mode='w', format='t', complevel=9, complib='bzip2')

It works fine, if I remove the column stationList, where each entry is a list of strings. But with this column I got the following exception:

它工作正常，如果我删除列 stationList，其中每个条目都是一个字符串列表。但是在本专栏中，我得到了以下异常：

Cannot serialize the column [stationList] because
its data contents are [mixed] object dtype

How do I need to improve my code to get the data frame stored?

我需要如何改进我的代码来存储数据框？

pandas version: 0.17.1
python version: 2.7.6 (cannot change it due to compability reasons)

pandas 版本：0.17.1
python 版本：2.7.6（由于兼容性原因无法更改）

edit1 (some sample code):

edit1（一些示例代码）：

import pandas as pd

mydf = pd.DataFrame({'endTime' : pd.Series([1443525810,1443540836,1443609470]),
                    'distance' : pd.Series([454.75,477.25,242.12]),
                    'signature' : pd.Series(['ab','cd','ab']),
                    'anchorName' : pd.Series(['tec','ing','pol']),
                    'stationList' : pd.Series([['t1','t2','t3'],['4','t2','t3'],['t3','t2','t4']])
                    })

# this works fine (no category)
mydf.to_hdf('tmp_without_cat.hdf5', 'journeys', mode='w', complevel=9, complib='bzip2')

for col in ['anchorName', 'signature']:
    mydf[col] = mydf[col].astype('category')

# this crashes now because of category data
# mydf.to_hdf('tmp_with_cat.hdf5', 'journeys', mode='w', complevel=9, complib='bzip2')

# switching to format='t'   
# this caused problems because of "mixed data" in column stationList
mydf.to_hdf('tmp_with_cat.hdf5', 'journeys', mode='w', format='t', complevel=9, complib='bzip2')

mydf.pop('stationList')

# this again works fine
mydf.to_hdf('tmp_with_cat_without_stationList.hdf5', 'journeys', mode='w', format='t', complevel=9, complib='bzip2')

edit2: In the meanwhile I tried different things to get rid of this problem. One of these were to convert the entries of column stationList to tupels (possible since they shall not be changed) and to also convert it to category. But it did not change anything. Here are the lines I added after the conversion loop (just for completeness):

edit2：同时我尝试了不同的方法来解决这个问题。其中之一是将列 stationList 的条目转换为 tupels（可能，因为它们不应更改）并将其转换为类别。但这并没有改变任何东西。以下是我在转换循环后添加的行（只是为了完整性）：

mydf.stationList = [tuple(x) for x in mydf.stationList.values]
mydf.stationList.astype('category')

Answer 1

回答by Christian Hudon

You have two problems:

你有两个问题：

You want to store categorical data in a HDF5 file;
You're trying to store arbitrary objects (i.e. stationList) in a HDF5 file.

您想将分类数据存储在 HDF5 文件中；
您正在尝试stationList在 HDF5 文件中存储任意对象（即）。

As you discovered, categorical data is (currently?) only supported in the "table" format for HDF5.

正如您所发现的，分类数据（目前？）仅支持 HDF5 的“表格”格式。

However, storing arbitrary objects (list of strings, etc.) is really not something that is supported by the HDF5 format itself. Pandas working around that for you by serializing these objects using pickle, and then storing the pickle as an arbitrary-length string (which is not supported by all HDF5 formats, I think). But that will be slow and inefficient, and will never be supported well by HDF5.

但是，存储任意对象（字符串列表等）实际上并不是 HDF5 格式本身支持的。Pandas 通过使用pickle 序列化这些对象，然后将pickle 存储为任意长度的字符串（我认为并非所有HDF5 格式都支持）来为您解决这个问题。但这会很慢且效率低下，而且 HDF5 永远不会很好地支持。

In my mind, you have two options:

在我看来，你有两个选择：

Pivot your data so you have one row of data by station name. Then you can store everything in a table-format HDF5 file. (This is a good practice in general; see Hadley Wickham on Tidy Data.)
If you really want to keep this format, then you might as well save the whole dataframe using to_pickle(). This will have no problem dealing with any kind of object (e.g. list of strings, etc.) you throw at it.

旋转您的数据，以便按站名获得一行数据。然后您可以将所有内容存储在表格格式的 HDF5 文件中。（一般来说，这是一个很好的做法；请参阅关于 Tidy Data 的 Hadley Wickham。）
如果您真的想保留这种格式，那么您不妨使用 to_pickle() 保存整个数据帧。处理您扔给它的任何类型的对象（例如字符串列表等）都没有问题。

Personally, I would recommend option 1. You get to use a fast, binary file format. And the pivot will also make other operations with your data easier.

就个人而言，我会推荐选项 1。您可以使用快速的二进制文件格式。并且枢轴还将使您的数据的其他操作更容易。

pandas 将混合数据和类别的pandas DataFrame存储到hdf5中

提问by AnnetteC

回答by Christian Hudon

相关推荐

最近更新

标签

pandas 将混合数据和类别的pandas DataFrame存储到hdf5中

提问by AnnetteC

回答by Christian Hudon

相关推荐

pandas 计算pandas/python中df列中非零数字的数量

pandas 使用 pd.read_csv 时无法删除标题

pandas 如何在合并熊猫数据框中的两列时删除 nan 值？

pandas 系列的“减少”功能

相关推荐

最近更新

标签