使用 Pandas、Python 将数据附加到 HDF5 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46206125/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:26:38  来源:igfitidea点击:

Append data to HDF5 file with Pandas, Python

pythonpandasdataframehdf5

提问by Karl

I have large pandas DataFrames with financial data. I have no problem appending and concatenating additional columns and DataFrames to my .h5 file.

我有带有财务数据的大Pandas数据帧。我可以在我的 .h5 文件中附加和连接额外的列和数据帧。

The financial data is being updated every minute, I need to append a row of data to all of my existing tables inside of my .h5 file every minute.

财务数据每分钟更新一次,我需要每分钟将一行数据附加到我的 .h5 文件中的所有现有表中。

Here is what i have tried so far, but no matter what i do, it overwrites the .h5 file and does not just append data.

这是我到目前为止所尝试的,但无论我做什么,它都会覆盖 .h5 文件,而不仅仅是附加数据。

HDFStore way:

HDFStore方式:

#we open the hdf5 file
save_hdf = HDFStore('test.h5') 

ohlcv_candle.to_hdf('test.h5')

#we give the dataframe a key value
#format=table so we can append data
save_hdf.put('name_of_frame',ohlcv_candle, format='table',  data_columns=True)

#we print our dataframe by calling the hdf file with the key
#just doing this as a test
print(save_hdf['name_of_frame'])    


The other way I have tried it, to_hdf:

我尝试过的另一种方式,to_hdf:

#format=t so we can append data , mode=r+ to specify the file exists and
#we want to append to it
tohlcv_candle.to_hdf('test.h5',key='this_is_a_key', mode='r+', format='t')

#again just printing to check if it worked 
print(pd.read_hdf('test.h5', key='this_is_a_key'))


Here is what one of the DataFrames looks like after being read_hdf:

这是其中一个 DataFrame 在被 read_hdf 之后的样子:

           time     open     high      low    close     volume           PP  
0    1505305260  3137.89  3147.15  3121.17  3146.94   6.205397  3138.420000   
1    1505305320  3146.86  3159.99  3130.00  3159.88   8.935962  3149.956667   
2    1505305380  3159.96  3160.00  3159.37  3159.66   4.524017  3159.676667   
3    1505305440  3159.66  3175.51  3151.08  3175.51   8.717610  3167.366667   
4    1505305500  3175.25  3175.53  3170.44  3175.53   3.187453  3173.833333  

The next time I am getting data (every minute), i would like a row of it added to index 5 of all my columns..and then 6 and 7 ..and so on, without having to read and manipulate the entire file in memory as that would defeat the point of doing this. If there is a better way of solving this, do not be shy to recommend it.

下次我获取数据时(每分钟),我希望将其中一行添加到我所有列的索引 5 中……然后是 6 和 7 ……等等,而无需读取和操作整个文件记忆,因为那会破坏这样做的意义。如果有更好的方法来解决这个问题,请不要羞于推荐它。

P.S. sorry for the formatting of that table in here

PS抱歉这里表格的格式

回答by MaxU

pandas.HDFStore.put()has parameter append(which defaults to False) - that instructs Pandas to overwrite instead of appending.

pandas.HDFStore.put()有参数append(默认为False) - 指示 Pandas 覆盖而不是追加。

So try this:

所以试试这个:

store = pd.HDFStore('test.h5')

store.append('name_of_frame', ohlcv_candle, format='t',  data_columns=True)

we can also use store.put(..., append=True), but this file should also be created in a table format:

我们也可以使用store.put(..., append=True),但这个文件也应该以表格格式创建:

store.put('name_of_frame', ohlcv_candle, format='t', append=True, data_columns=True)

NOTE:appending works only for the table(format='t'- is an alias for format='table') format.

注意:附加仅适用于table( format='t'- 是format='table') 格式的别名。

回答by Nikhil VJ

tohlcv_candle.to_hdf('test.h5',key='this_is_a_key', append=True, mode='r+', format='t')

You need to pass another argument append=Trueto specify that the data is to be appended to existing data if found under that key, instead of over-writing it.

您需要传递另一个参数append=True来指定如果在该键下找到数据,则将数据附加到现有数据,而不是覆盖它。

Without this, the default is Falseand if it encounters an existing table under 'this_is_a_key'then it overwrites.

没有这个,默认是False,如果它遇到一个现有的表,'this_is_a_key'那么它会覆盖。

The mode=argument is only at file-level, telling whether the file as a whole is to be overwritten or appended.

mode=参数仅在文件级别,告诉整个文件是要覆盖还是附加。

One file can have any number of keys, so a mode='a', append=Falsesetting will mean only one key gets over-written while the other keys stay.

一个文件可以有任意数量的键,因此mode='a', append=False设置意味着只有一个键被覆盖,而其他键保持不变。

I had a similar experience as yours and found the additional append argument in the reference doc. After setting it, now it's appending properly for me.

我和你有类似的经历,并在参考文档中找到了附加的 append 参数。设置后,现在它为我正确附加。

Ref: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.html

参考:https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.html

Note: hdf5 won't bother doing anything with the dataframe's indexes. We need to iron those out before putting the data in or when we take it out.

注意:hdf5 不会对数据帧的索引做任何事情。我们需要在放入数据之前或取出数据之前解决这些问题。