pandas - 如何仅将 DataFrame 的选定列保存到 HDF5

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27878780/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:49:56  来源:igfitidea点击:

pandas - How to save only selected columns of a DataFrame to HDF5

pythonpandashdf5hdfstore

提问by Fabio Lamanna

I'm reading a csv sample file and store it on .h5 database. The .csv is structured as follows:

我正在读取 csv 示例文件并将其存储在 .h5 数据库中。.csv 的结构如下:

User_ID;Longitude;Latitude;Year;Month;String
267261661;-3.86580025;40.32170825;2013;12;hello world
171255468;-3.83879575;40.05035005;2013;12;hello world
343588169;-3.70759531;40.4055946;2014;2;hello world
908779052;-3.8356385;40.1249459;2013;8;hello world
289540518;-3.6723114;40.3801642;2013;11;hello world
635876313;-3.8323166;40.3379393;2012;10;hello world
175160914;-3.53687933;40.35101274;2013;12;hello world 
155029860;-3.68555076;40.47688417;2013;11;hello world

I've putting it on a .h5 store with the pandas to_hdf, selecting to pass to the .h5 only a couple of columns:

我已经把它和Pandas to_hdf 放在一个 .h5 存储中,选择只传递给 .h5 几列:

import pandas as pd

df = pd.read_csv(filename + '.csv', sep=';')

df.to_hdf('test.h5','key1',format='table',data_columns=['User_ID','Year'])

I've obtained different results in the columns stored in the .h5 file using HDFStore and read_hdf, in particular:

我使用 HDFStore 和 read_hdf 在 .h5 文件中存储的列中获得了不同的结果,特别是:

store = pd.HDFStore('test.h5')
>>> store
>>> <class 'pandas.io.pytables.HDFStore'>
File path: /test.h5
/key1            frame_table  (typ->appendable,nrows->8,ncols->6,indexers->[index],dc->[User_ID,Year])

which is what I expect (only the 'User_ID' and 'Year' columns stored in the database), althought the ncols->6 means that actually all the columns have been stored in the .h5 file.

这是我所期望的(仅存储在数据库中的“User_ID”和“Year”列),尽管 ncols->6 意味着实际上所有列都已存储在 .h5 文件中。

If I try reading the file with pd.read_hdf:

如果我尝试使用 pd.read_hdf 读取文件:

hdf = pd.read_hdf('test.h5','key1')

and asking for the keys:

并要求钥匙:

hdf.keys()
>>> Index([u'User_ID', u'Longitude', u'Latitude', u'Year', u'Month', u'String'], dtype='object')

which is not what I'm expected since all columns of the original .csv file are still in the .h5 database. How can I store only a selection of columns in the .h5 in order to reduce the size of the database?

这不是我所期望的,因为原始 .csv 文件的所有列仍在 .h5 数据库中。如何仅在 .h5 中存储选定的列以减小数据库的大小?

Thanks for your help.

谢谢你的帮助。

回答by Paul H

just select out the columns as you write to the file.

只需在写入文件时选择列。

cols_to_keep = ['User_ID', 'Year']
df[cols_to_keep].to_hdf(...)