pandas HDF5 比 CSV 占用更多空间？

Question

提问by Amelio Vazquez-Reina

Consider the following example:

考虑以下示例：

Prepare the data:

准备数据：

import string
import random
import pandas as pd

matrix = np.random.random((100, 3000))
my_cols = [random.choice(string.ascii_uppercase) for x in range(matrix.shape[1])]
mydf = pd.DataFrame(matrix, columns=my_cols)
mydf['something'] = 'hello_world'

Set the highest compression possible for HDF5:

为 HDF5 设置可能的最高压缩率：

store = pd.HDFStore('myfile.h5',complevel=9, complib='bzip2')
store['mydf'] = mydf
store.close()

Save also to CSV:

也保存到 CSV：

mydf.to_csv('myfile.csv', sep=':')

The result is:

结果是：

myfile.csvis 5.6 MB big
myfile.h5is 11 MB big

myfile.csv是 5.6 MB 大
myfile.h5是 11 MB 大

The difference grows bigger as the datasets get larger.

随着数据集变大，差异也越来越大。

I have tried with other compression methods and levels. Is this a bug? (I am using Pandas 0.11 and the latest stable version of HDF5 and Python).

我尝试过其他压缩方法和级别。这是一个错误吗？（我使用的是 Pandas 0.11 和 HDF5 和 Python 的最新稳定版本）。

Answer 1

回答by Jeff

Copy of my answer from the issue: https://github.com/pydata/pandas/issues/3651

我的问题回答的副本：https: //github.com/pydata/pandas/issues/3651

Your sample is really too small. HDF5 has a fair amount of overhead with really small sizes (even 300k entries is on the smaller side). The following is with no compression on either side. Floats are really more efficiently represented in binary (that as a text representation).

你的样本实在是太少了。HDF5 具有相当大的开销，而且尺寸非常小（即使是 300k 条目也在较小的一侧）。下面是两边都没有压缩。浮点数在二进制中确实更有效地表示（作为文本表示）。

In addition, HDF5 is row based. You get MUCH efficiency by having tables that are not too wide but are fairly long. (Hence your example is not very efficient in HDF5 at all, store it transposed in this case)

此外，HDF5 是基于行的。通过使用不太宽但相当长的表格，您可以获得很大的效率。（因此您的示例在 HDF5 中根本不是很有效，在这种情况下将其转置存储）

I routinely have tables that are 10M+ rows and query times can be in the ms. Even the below example is small. Having 10+GB files is quite common (not to mention the astronomy guys who 10GB+ is a few seconds!)

我通常有 1000 万行以上的表，查询时间可能在毫秒内。即使下面的例子很小。拥有 10+GB 的文件是很常见的（更不用说 10GB+ 是几秒钟的天文人了！）

-rw-rw-r--  1 jreback users 203200986 May 19 20:58 test.csv
-rw-rw-r--  1 jreback users  88007312 May 19 20:59 test.h5

In [1]: df = DataFrame(randn(1000000,10))

In [9]: df
Out[9]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0    1000000  non-null values
1    1000000  non-null values
2    1000000  non-null values
3    1000000  non-null values
4    1000000  non-null values
5    1000000  non-null values
6    1000000  non-null values
7    1000000  non-null values
8    1000000  non-null values
9    1000000  non-null values
dtypes: float64(10)

In [5]: %timeit df.to_csv('test.csv',mode='w')
1 loops, best of 3: 12.7 s per loop

In [6]: %timeit df.to_hdf('test.h5','df',mode='w')
1 loops, best of 3: 825 ms per loop

In [7]: %timeit pd.read_csv('test.csv',index_col=0)
1 loops, best of 3: 2.35 s per loop

In [8]: %timeit pd.read_hdf('test.h5','df')
10 loops, best of 3: 38 ms per loop

I really wouldn't worry about the size (I suspect you are not, but are merely interested, which is fine). The point of HDF5 is that disk is cheap, cpu is cheap, but you can't have everything in memory at once so we optimize by using chunking

我真的不会担心尺寸（我怀疑你不是，但只是感兴趣，这很好）。HDF5 的重点是磁盘便宜，cpu 便宜，但是你不能一次在内存中拥有所有东西，所以我们使用分块进行优化

pandas HDF5 比 CSV 占用更多空间？

提问by Amelio Vazquez-Reina

Prepare the data:

准备数据：

Set the highest compression possible for HDF5:

为 HDF5 设置可能的最高压缩率：

Save also to CSV:

也保存到 CSV：

回答by Jeff

相关推荐

最近更新

标签

pandas HDF5 比 CSV 占用更多空间？

提问by Amelio Vazquez-Reina

Prepare the data:

准备数据：

Set the highest compression possible for HDF5:

为 HDF5 设置可能的最高压缩率：

Save also to CSV:

也保存到 CSV：

回答by Jeff

相关推荐

在 Pandas 中交换轴

pandas 为什么pandas groupby().transform() 需要唯一索引？

pandas python 熊猫索引 is_unique 不起作用

pandas 日期字段的 cut/qcut 相当于什么？

相关推荐

最近更新

标签