使用条件将 HDF5 文件读取到 Pandas DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33451926/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:08:03  来源:igfitidea点击:

read HDF5 file to pandas DataFrame with conditions

pythonpandashdf5

提问by codeKiller

I have a huge HDF5 file, I want to load part of it in a pandas DataFrame to perform some operations, but I am interested in filtering some rows.

我有一个巨大的 HDF5 文件,我想将其中的一部分加载到 Pandas DataFrame 中以执行一些操作,但我对过滤一些行感兴趣。

I can explain better with an example:

我可以用一个例子更好地解释:

Original HDF5 file would look something like:

原始 HDF5 文件如下所示:

A    B    C    D
1    0    34   11
2    0    32   15
3    1    35   22
4    1    34   15
5    1    31   9
1    0    34   15
2    1    29   11
3    0    34   15
4    1    12   14
5    0    34   15
1    0    32   13
2    1    34   15
etc  etc  etc  etc

What I am trying to do is to load this, exactly as it is, to a pandas Dataframe but only where A==1 or 3 or 4

我想要做的是将它完全按原样加载到 Pandas Dataframe 但仅 where A==1 or 3 or 4

Until now I can just load the whole HDF5 using:

到目前为止,我只能使用以下方法加载整个 HDF5:

store = pd.HDFStore('Resutls2015_10_21.h5')
df = pd.DataFrame(store['results_table'])

I do not see how to include a wherecondition here.

我不知道如何在where此处包含条件。

采纳答案by unutbu

The hdf5file must be written in tableformat(as opposed to fixedformat) in order to be queryable with pd.read_hdf's whereargument.

hdf5文件必须以书面table形式(而不是fixed格式)为了以可查询pd.read_hdfwhere说法。

Furthermore, Amust be declared as a data_column:

此外,A必须声明为 data_column

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

or, to specify all columns as (queryable) data columns:

或者,将所有列指定为(可查询的)数据列:

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True,
          format='table')

Then you could use

然后你可以使用

pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')

to select rows where the value column Ais 1, 3 or 4. For example,

选择值列为A1、3 或 4 的行。例如,

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2],
    'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1],
    'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34],
    'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]})

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]'))

yields

产量

    A  B   C   D
0   1  0  34  11
2   3  1  35  22
3   4  1  34  15
5   1  0  34  15
7   3  0  34  15
8   4  1  12  14
10  1  0  32  13


If you have a very long list of values, vals, then you could use string formatting to compose the right whereargument:

如果您有很长的值列表vals,那么您可以使用字符串格式来组成正确的where参数:

where='A in {}'.format(vals)

回答by Dean Fenster

You can do this using pandas.read_hdf(here), with the optional parameter of where.
For example: read_hdf('store_tl.h5', 'table', where = ['index>2'])

您可以使用pandas.read_hdf( here) 和可选参数where.
对于例如read_hdf('store_tl.h5', 'table', where = ['index>2'])