加载速度更快:python中的pickle或hdf5
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37928794/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
which is faster for load: pickle or hdf5 in python
提问by denvar
Given a 1.5 Gb list of pandas dataframes, which format is fastest for loading compressed data: pickle (via cPickle), hdf5, or something else in Python?
给定 1.5 Gb 的 pandas 数据帧列表,哪种格式加载压缩数据最快:pickle(通过 cPickle)、hdf5 或 Python 中的其他内容?
- I only care about fastest speed to load the data into memory
- I don't care about dumping the data, it's slow but I only do this once.
- I don't care about file size on disk
- 我只关心将数据加载到内存中的最快速度
- 我不在乎转储数据,它很慢但我只这样做一次。
- 我不在乎磁盘上的文件大小
回答by MaxU
I would consider only two storage formats: HDF5 (PyTables) and Feather
我只考虑两种存储格式:HDF5 (PyTables) 和Feather
Here are results of my read and write comparisonfor the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).
这是我对 DF 的读写比较的结果(形状:4000000 x 6,内存大小 183.1 MB,未压缩 CSV 的大小 - 492 MB)。
Comparison for the following storage formats: (CSV
, CSV.gzip
, Pickle
, HDF5
[various compression]):
以下存储格式的比较:(CSV
, CSV.gzip
, Pickle
, HDF5
[各种压缩]):
read_s write_s size_ratio_to_CSV
storage
CSV 17.900 69.00 1.000
CSV.gzip 18.900 186.00 0.047
Pickle 0.173 1.77 0.374
HDF_fixed 0.196 2.03 0.435
HDF_tab 0.230 2.60 0.437
HDF_tab_zlib_c5 0.845 5.44 0.035
HDF_tab_zlib_c9 0.860 5.95 0.035
HDF_tab_bzip2_c5 2.500 36.50 0.011
HDF_tab_bzip2_c9 2.500 36.50 0.011
But it might be different for you, because all my data was of the datetime
dtype, so it's always better to make such a comparison with yourreal data or at least with the similar data...
但对你来说可能会有所不同,因为我所有的数据都是datetime
dtype,所以最好与你的真实数据或至少与类似数据进行这样的比较......