使用条件将 HDF5 文件读取到 Pandas DataFrame

Question

提问by codeKiller

I have a huge HDF5 file, I want to load part of it in a pandas DataFrame to perform some operations, but I am interested in filtering some rows.

我有一个巨大的 HDF5 文件，我想将其中的一部分加载到 Pandas DataFrame 中以执行一些操作，但我对过滤一些行感兴趣。

I can explain better with an example:

我可以用一个例子更好地解释：

Original HDF5 file would look something like:

原始 HDF5 文件如下所示：

A    B    C    D
1    0    34   11
2    0    32   15
3    1    35   22
4    1    34   15
5    1    31   9
1    0    34   15
2    1    29   11
3    0    34   15
4    1    12   14
5    0    34   15
1    0    32   13
2    1    34   15
etc  etc  etc  etc

What I am trying to do is to load this, exactly as it is, to a pandas Dataframe but only where A==1 or 3 or 4

我想要做的是将它完全按原样加载到 Pandas Dataframe 但仅 where A==1 or 3 or 4

Until now I can just load the whole HDF5 using:

到目前为止，我只能使用以下方法加载整个 HDF5：

store = pd.HDFStore('Resutls2015_10_21.h5')
df = pd.DataFrame(store['results_table'])

I do not see how to include a wherecondition here.

我不知道如何在where此处包含条件。

Answer 1

采纳答案by unutbu

The hdf5file must be written in tableformat(as opposed to fixedformat) in order to be queryable with pd.read_hdf's whereargument.

该hdf5文件必须以书面table形式（而不是fixed格式）为了以可查询pd.read_hdf的where说法。

Furthermore, Amust be declared as a data_column:

此外，A必须声明为 data_column：

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

or, to specify all columns as (queryable) data columns:

或者，将所有列指定为（可查询的）数据列：

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True,
          format='table')

Then you could use

然后你可以使用

pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')

to select rows where the value column Ais 1, 3 or 4. For example,

选择值列为A1、3 或 4 的行。例如，

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2],
    'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1],
    'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34],
    'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]})

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]'))

yields

产量

    A  B   C   D
0   1  0  34  11
2   3  1  35  22
3   4  1  34  15
5   1  0  34  15
7   3  0  34  15
8   4  1  12  14
10  1  0  32  13

If you have a very long list of values, vals, then you could use string formatting to compose the right whereargument:

如果您有很长的值列表vals，那么您可以使用字符串格式来组成正确的where参数：

where='A in {}'.format(vals)

Answer 2

回答by Dean Fenster

You can do this using pandas.read_hdf(here), with the optional parameter of where.
For example: read_hdf('store_tl.h5', 'table', where = ['index>2'])

您可以使用pandas.read_hdf( here) 和可选参数where.
对于例如：read_hdf('store_tl.h5', 'table', where = ['index>2'])

使用条件将 HDF5 文件读取到 Pandas DataFrame

提问by codeKiller

采纳答案by unutbu

回答by Dean Fenster

相关推荐

最近更新

标签

使用条件将 HDF5 文件读取到 Pandas DataFrame

提问by codeKiller

采纳答案by unutbu

回答by Dean Fenster

相关推荐

带有 CSS 样式的 Pandas df.to_html

pandas 当我没有表对象时，如何在 SQLAlchemy 中删除表？

在 pandas/matplotlib 中获取散点图的 Colorbar 实例

使用 NLTK 和 Pandas 去除停用词

相关推荐

最近更新

标签