使用 Pandas 将 CSV 文件转换为 HDF5

Question

提问by boojum

When i use pandas to convert csv files to hdf5 files the resulting file is extremely large. For example a test csv file (23 columns, 1.3 million rows) of 170Mb results in an hdf5 file of 2Gb. However if pandas is bypassed and the hdf5 file is directly written (using pytables) it is only 20Mb. In the following code (that is used to do the conversion in pandas) the values of the object columns in the dataframe are explicitly converted to string objects (to prevent pickling):

当我使用 Pandas 将 csv 文件转换为 hdf5 文件时，生成的文件非常大。例如，170Mb 的测试 csv 文件（23 列，130 万行）会生成 2Gb 的 hdf5 文件。但是，如果绕过 pandas 并直接写入 hdf5 文件（使用 pytables），则它只有 20Mb。在以下代码（用于在 Pandas 中进行转换）中，数据帧中对象列的值被显式转换为字符串对象（以防止酸洗）：

# Open the csv file as pandas data frame
data = pd.read_csv(csvfilepath, sep=delimiter, low_memory=False)

# Write the resulting data frame to the hdf5 file
data.to_hdf(hdf5_file_path, table_name, format='table', complevel=9,
            complib='lzo')

This is the hdf5 file inspected (using vitables):

这是检查的 hdf5 文件（使用 vitables）：

What seems odd to me is that the values are represented as a (python?) list by data type (values_block0:int,values_block1:float and values_block2:string) instead of 1 specific column for every column in the csv file. I'm wondering if this causes the large file size and what the impact will be on query times?

我觉得奇怪的是，这些值按数据类型（values_block0:int,values_block1:float 和 values_block2:string）表示为（python？）列表，而不是 csv 文件中每一列的 1 个特定列。我想知道这是否会导致文件变大以及对查询时间有什么影响？

Given that about 1Tb has to be converted I would like to know what can be done to reduce the size of the resulting hdf5 file?

鉴于必须转换大约 1Tb，我想知道可以做些什么来减小生成的 hdf5 文件的大小？

P.S. I'm aware of this questionbut is states that the large hdf5 file size is caused by the HDF5 format itself which can't be the cause in this case given that the hdf5 file resulting from bypassing pandas is much smaller.

PS 我知道这个问题，但指出大的 hdf5 文件大小是由 HDF5 格式本身引起的，在这种情况下这不可能是原因，因为绕过 Pandas 产生的 hdf5 文件要小得多。

P.P.S. Using data.iloc instead of data.loc as suggested by joris doesn't make any difference. I've removed the 'conversion' it doesn't make a difference. The info on the read dataframe as requested by Jeff:

PPS 按照 joris 的建议使用 data.iloc 而不是 data.loc 没有任何区别。我已经删除了“转换”，它没有任何区别。Jeff 要求的读取数据帧的信息：

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1303331 entries, 0 to 1303330
Columns: 23 entries, _PlanId to ACTIVITY_Gratis
dtypes: float64(1), int64(5), object(17)

Answer 1

回答by Jeff

Here'san informal comparison of times/sizes for various IO method

这是各种 IO 方法的时间/大小的非正式比较

Using 0.13.1 on 64-bit linux

在 64 位 linux 上使用 0.13.1

Setup

设置

In [3]: N = 1000000

In [4]: df = DataFrame(dict([ ("int{0}".format(i),np.random.randint(0,10,size=N)) for i in range(5) ]))

In [5]: df['float'] = np.random.randn(N)

In [6]: from random import randrange

In [8]: for i in range(10):
   ...:     df["object_1_{0}".format(i)] = ['%08x'%randrange(16**8) for _ in range(N)]
   ...:     

In [9]: for i in range(7):
   ...:     df["object_2_{0}".format(i)] = ['%15x'%randrange(16**15) for _ in range(N)]
   ...:     

 In [11]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 23 columns):
int0          1000000 non-null int64
int1          1000000 non-null int64
int2          1000000 non-null int64
int3          1000000 non-null int64
int4          1000000 non-null int64
float         1000000 non-null float64
object_1_0    1000000 non-null object
object_1_1    1000000 non-null object
object_1_2    1000000 non-null object
object_1_3    1000000 non-null object
object_1_4    1000000 non-null object
object_1_5    1000000 non-null object
object_1_6    1000000 non-null object
object_1_7    1000000 non-null object
object_1_8    1000000 non-null object
object_1_9    1000000 non-null object
object_2_0    1000000 non-null object
object_2_1    1000000 non-null object
object_2_2    1000000 non-null object
object_2_3    1000000 non-null object
object_2_4    1000000 non-null object
object_2_5    1000000 non-null object
object_2_6    1000000 non-null object
dtypes: float64(1), int64(5), object(17)

types: float64(1), int64(5), object(17)

Saving with various methods

用各种方法保存

In [12]: df.to_hdf('test_fixed.h5','data',format='fixed')

In [13]: df.to_hdf('test_table_no_dc.h5','data',format='table')

In [14]: df.to_hdf('test_table_dc.h5','data',format='table',data_columns=True)

In [15]: df.to_hdf('test_fixed_compressed.h5','data',format='fixed',complib='blosc',complevel=9)
!ls -ltr *.h5

In [16]: !ls -ltr *.h5
-rw-rw-r-- 1 jreback users 361093304 Apr 28 09:20 test_fixed.h5
-rw-rw-r-- 1 jreback users 311475690 Apr 28 09:21 test_table_no_dc.h5
-rw-rw-r-- 1 jreback users 351316525 Apr 28 09:22 test_table_dc.h5
-rw-rw-r-- 1 jreback users 317467870 Apr 28  2014 test_fixed_compressed.h5

The size on disk is going to be a function of the string size that is selected for each column; If you use NO data_columns then its the longest size for ANY string. So writing with data_columns can potentially the size here (balanced by the fact that you have more columns so it takes more space per-column). You prob want to specify min_item_sizeto control see here

磁盘上的大小将是为每列选择的字符串大小的函数；如果您不使用 data_columns，则它是任何字符串的最长长度。因此，使用 data_columns 写入可能会影响此处的大小（通过您有更多列的事实来平衡，因此每列需要更多空间）。你可能想指定min_item_size控制见这里

Here is an example of the on-disk structure:

以下是磁盘结构的示例：

In [8]: DataFrame(dict(A = ['foo','bar','bah'], B = [1,2,3], C = [1.0,2.0,3.0], D=[4.0,5.0,6.0])).to_hdf('test.h5','data',mode='w',format='table')

In [9]: !ptdump -avd test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/data (Group) ''
  /data._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := [],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A', 'B', 'C', 'D'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0', 'values_block_1', 'values_block_2']]
/data/table (Table(3,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
  "values_block_2": StringCol(itemsize=3, shape=(1,), dflt='', pos=3)}
  byteorder := 'little'
  chunkshape := (1872,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /data/table._v_attrs (AttributeSet), 19 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0,
    FIELD_2_NAME := 'values_block_1',
    FIELD_3_FILL := '',
    FIELD_3_NAME := 'values_block_2',
    NROWS := 3,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer',
    values_block_0_dtype := 'float64',
    values_block_0_kind := ['C', 'D'],
    values_block_1_dtype := 'int64',
    values_block_1_kind := ['B'],
    values_block_2_dtype := 'string24',
    values_block_2_kind := ['A']]
  Data dump:
[0] (0, [1.0, 4.0], [1], ['foo'])
[1] (1, [2.0, 5.0], [2], ['bar'])
[2] (2, [3.0, 6.0], [3], ['bah'])

Dtypes are grouping into blocks (if you have data_columns then they are separate). These are just printed this way; they are stored array like.

Dtypes 分组为块（如果你有 data_columns 那么它们是分开的）。这些只是这样打印的；它们像数组一样存储。

使用 Pandas 将 CSV 文件转换为 HDF5

提问by boojum

回答by Jeff

相关推荐

最近更新

标签

使用 Pandas 将 CSV 文件转换为 HDF5

提问by boojum

回答by Jeff

相关推荐

pandas 熊猫数据框 .to_csv

从 Pandas 数据框覆盖箱线图上的实际数据

如何使用 Pandas 将 csv 转换为字典

在 matplotlib 中格式化日期时间 xlabels（pandas df.plot() 方法）

相关推荐

最近更新

标签