使用条件将 HDF5 文件读取到 Pandas DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33451926/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
read HDF5 file to pandas DataFrame with conditions
提问by codeKiller
I have a huge HDF5 file, I want to load part of it in a pandas DataFrame to perform some operations, but I am interested in filtering some rows.
我有一个巨大的 HDF5 文件,我想将其中的一部分加载到 Pandas DataFrame 中以执行一些操作,但我对过滤一些行感兴趣。
I can explain better with an example:
我可以用一个例子更好地解释:
Original HDF5 file would look something like:
原始 HDF5 文件如下所示:
A B C D
1 0 34 11
2 0 32 15
3 1 35 22
4 1 34 15
5 1 31 9
1 0 34 15
2 1 29 11
3 0 34 15
4 1 12 14
5 0 34 15
1 0 32 13
2 1 34 15
etc etc etc etc
What I am trying to do is to load this, exactly as it is, to a pandas Dataframe but only where A==1 or 3 or 4
我想要做的是将它完全按原样加载到 Pandas Dataframe 但仅 where A==1 or 3 or 4
Until now I can just load the whole HDF5 using:
到目前为止,我只能使用以下方法加载整个 HDF5:
store = pd.HDFStore('Resutls2015_10_21.h5')
df = pd.DataFrame(store['results_table'])
I do not see how to include a where
condition here.
我不知道如何在where
此处包含条件。
采纳答案by unutbu
The hdf5
file must be written in table
format(as opposed to fixed
format) in
order to be queryable with pd.read_hdf
's where
argument.
该hdf5
文件必须以书面table
形式(而不是fixed
格式)为了以可查询pd.read_hdf
的where
说法。
Furthermore, A
must be declared as a data_column:
此外,A
必须声明为 data_column:
df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
format='table')
or, to specify all columns as (queryable) data columns:
或者,将所有列指定为(可查询的)数据列:
df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True,
format='table')
Then you could use
然后你可以使用
pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')
to select rows where the value column A
is 1, 3 or 4. For example,
选择值列为A
1、3 或 4 的行。例如,
import numpy as np
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2],
'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1],
'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34],
'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]})
df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
format='table')
print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]'))
yields
产量
A B C D
0 1 0 34 11
2 3 1 35 22
3 4 1 34 15
5 1 0 34 15
7 3 0 34 15
8 4 1 12 14
10 1 0 32 13
If you have a very long list of values, vals
, then you could use string formatting to compose the right where
argument:
如果您有很长的值列表vals
,那么您可以使用字符串格式来组成正确的where
参数:
where='A in {}'.format(vals)