加载速度更快:python中的pickle或hdf5

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37928794/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:06:57  来源:igfitidea点击:

which is faster for load: pickle or hdf5 in python

pythonpandasnumpydataframehdf5

提问by denvar

Given a 1.5 Gb list of pandas dataframes, which format is fastest for loading compressed data: pickle (via cPickle), hdf5, or something else in Python?

给定 1.5 Gb 的 pandas 数据帧列表,哪种格式加载压缩数据最快:pickle(通过 cPickle)、hdf5 或 Python 中的其他内容?

  • I only care about fastest speed to load the data into memory
  • I don't care about dumping the data, it's slow but I only do this once.
  • I don't care about file size on disk
  • 我只关心将数据加载到内存中的最快速度
  • 我不在乎转储数据,它很慢但我只这样做一次。
  • 我不在乎磁盘上的文件大小

回答by MaxU

I would consider only two storage formats: HDF5 (PyTables) and Feather

我只考虑两种存储格式:HDF5 (PyTables) 和Feather

Here are results of my read and write comparisonfor the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

这是我对 DF 的读写比较的结果(形状:4000000 x 6,内存大小 183.1 MB,未压缩 CSV 的大小 - 492 MB)。

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5[various compression]):

以下存储格式的比较:(CSV, CSV.gzip, Pickle, HDF5[各种压缩]):

                  read_s  write_s  size_ratio_to_CSV
storage
CSV               17.900    69.00              1.000
CSV.gzip          18.900   186.00              0.047
Pickle             0.173     1.77              0.374
HDF_fixed          0.196     2.03              0.435
HDF_tab            0.230     2.60              0.437
HDF_tab_zlib_c5    0.845     5.44              0.035
HDF_tab_zlib_c9    0.860     5.95              0.035
HDF_tab_bzip2_c5   2.500    36.50              0.011
HDF_tab_bzip2_c9   2.500    36.50              0.011

But it might be different for you, because all my data was of the datetimedtype, so it's always better to make such a comparison with yourreal data or at least with the similar data...

但对你来说可能会有所不同,因为我所有的数据都是datetimedtype,所以最好与你的真实数据或至少与类似数据进行这样的比较......