Python 将大 csv 转换为 hdf5

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27203161/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:31:21  来源:igfitidea点击:

Convert large csv to hdf5

pythoncsvpandashdf5pytables

提问by jmilloy

I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5pyin testing without any problems, but now I can't do the final dataset without running out of memory.

我有一个 100M 行的 csv 文件(实际上是许多单独的 csv 文件),总共 84GB。我需要将其转换为具有单个浮点数据集的 HDF5 文件。我在测试中使用h5py没有任何问题,但现在我无法在没有内存不足的情况下完成最终数据集。

How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.

如何在不必将整个数据集存储在内存中的情况下写入 HDF5?我期待这里有实际的代码,因为它应该很简单。

I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandashas read_csvand to_hdfmethods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.

我只是在研究pytables,但它看起来不像数组类(对应于 HDF5 数据集)可以迭代写入。同样,pandas 中read_csvto_hdf方法io_tools,但我无法一次加载整个数据集,因此无法正常工作。也许你可以用pytables或pandas中的其他工具帮助我正确解决问题。

采纳答案by unutbu

Use append=Truein the call to to_hdf:

使用append=True在调用to_hdf

import numpy as np
import pandas as pd

filename = '/tmp/test.h5'

df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
#    A  B
# 0  0  1
# 1  2  3
# 2  4  5
# 3  6  7
# 4  8  9

# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df    # allow df to be garbage collected

# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)

print(pd.read_hdf(filename, 'data'))

yields

产量

    A   B
0   0   1
1   2   3
2   4   5
3   6   7
4   8   9
0   0  10
1  20  30
2  40  50
3  60  70
4  80  90

Note that you need to use format='table'in the first call to df.to_hdfto make the table appendable. Otherwise, the format is 'fixed'by default, which is faster for reading and writing, but creates a table which can not be appended to.

请注意,您需要format='table'在第一次调用中df.to_hdf使用 使表可附加。否则,格式为'fixed'默认格式,读取和写入速度更快,但会创建一个无法附加的表。

Thus, you can process each CSV one at a time, use append=Trueto build the hdf5 file. Then overwrite the DataFrame or use del dfto allow the old DataFrame to be garbage collected.

因此,您可以一次处理每个 CSVappend=True文件,用于构建 hdf5 文件。然后覆盖 DataFrame 或使用del df以允许旧的 DataFrame 被垃圾收集。



Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:

或者,df.to_hdf您可以附加到 HDFStore,而不是调用:

import numpy as np
import pandas as pd

filename = '/tmp/test.h5'
store = pd.HDFStore(filename)

for i in range(2):
    df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
    store.append('data', df)

store.close()

store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()

yields

产量

    A   B
0   0   1
1   2   3
2   4   5
3   6   7
4   8   9
0   0  10
1  20  30
2  40  50
3  60  70
4  80  90

回答by senderle

This should be possible with PyTables. You'll need to use the EArrayclass though.

这应该可以通过 PyTables 实现。不过,您需要使用EArray类。

As an example, the following is a script I wrote to import chunked training data stored as .npyfiles into a single .h5file.

例如,以下是我编写的脚本,用于将存储为.npy文件的分块训练数据导入到单个.h5文件中。

import numpy
import tables
import os

training_data = tables.open_file('nn_training.h5', mode='w')
a = tables.Float64Atom()
bl_filter = tables.Filters(5, 'blosc')   # fast compressor at a moderate setting

training_input =  training_data.create_earray(training_data.root, 'X', a,
                                             (0, 1323), 'Training Input',
                                             bl_filter, 4000000)
training_output = training_data.create_earray(training_data.root, 'Y', a,
                                             (0, 27), 'Training Output',
                                             bl_filter, 4000000)

for filename in os.listdir('input'):
    print "loading {}...".format(filename)
    a = numpy.load(os.path.join('input', filename))
    print "writing to h5"
    training_input.append(a)

for filename in os.listdir('output'):
    print "loading {}...".format(filename)
    training_output.append(numpy.load(os.path.join('output', filename)))

Take a look at the docs for detailed instructions, but very briefly, the create_earrayfunction takes 1) a data root or parent node; 2) an array name; 3) a datatype atom; 4) a shape with a 0in the dimension you want to expand; 5) a verbose descriptor; 6) a compression filter; and 7) an expected number of rows along the expandable dimension. Only the first two are required, but you'll probably use all seven in practice. The function accepts a few other optional arguments as well; again, see the docs for details.

查看文档以获取详细说明,但非常简单,该create_earray函数需要 1) 数据根或父节点;2) 数组名;3) 一个数据类型原子;4)0要扩展的维度中带有 a 的形状;5) 一个详细的描述符;6)压缩过滤器;和 7) 沿可扩展维度的预期行数。只有前两个是必需的,但您可能会在实践中使用所有七个。该函数还接受一些其他可选参数;再次,请参阅文档了解详细信息。

Once the array is created, you can use its appendmethod in the expected way.

创建数组后,您可以append按预期方式使用其方法。