Python 将大 csv 转换为 hdf5
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27203161/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert large csv to hdf5
提问by jmilloy
I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5pyin testing without any problems, but now I can't do the final dataset without running out of memory.
我有一个 100M 行的 csv 文件(实际上是许多单独的 csv 文件),总共 84GB。我需要将其转换为具有单个浮点数据集的 HDF5 文件。我在测试中使用h5py没有任何问题,但现在我无法在没有内存不足的情况下完成最终数据集。
How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.
如何在不必将整个数据集存储在内存中的情况下写入 HDF5?我期待这里有实际的代码,因为它应该很简单。
I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandashas read_csv
and to_hdf
methods in its io_tools
, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.
我只是在研究pytables,但它看起来不像数组类(对应于 HDF5 数据集)可以迭代写入。同样,pandas 中有read_csv
和to_hdf
方法io_tools
,但我无法一次加载整个数据集,因此无法正常工作。也许你可以用pytables或pandas中的其他工具帮助我正确解决问题。
采纳答案by unutbu
Use append=True
in the call to to_hdf
:
使用append=True
在调用to_hdf
:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
# A B
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df # allow df to be garbage collected
# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)
print(pd.read_hdf(filename, 'data'))
yields
产量
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
Note that you need to use format='table'
in the first call to df.to_hdf
to make the table appendable. Otherwise, the format is 'fixed'
by default, which is faster for reading and writing, but creates a table which can not be appended to.
请注意,您需要format='table'
在第一次调用中df.to_hdf
使用 使表可附加。否则,格式为'fixed'
默认格式,读取和写入速度更快,但会创建一个无法附加的表。
Thus, you can process each CSV one at a time, use append=True
to build the hdf5 file. Then overwrite the DataFrame or use del df
to allow the old DataFrame to be garbage collected.
因此,您可以一次处理每个 CSVappend=True
文件,用于构建 hdf5 文件。然后覆盖 DataFrame 或使用del df
以允许旧的 DataFrame 被垃圾收集。
Alternatively, instead of calling df.to_hdf
, you could append to a HDFStore:
或者,df.to_hdf
您可以附加到 HDFStore,而不是调用:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
store = pd.HDFStore(filename)
for i in range(2):
df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
store.append('data', df)
store.close()
store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()
yields
产量
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
回答by senderle
This should be possible with PyTables. You'll need to use the EArrayclass though.
这应该可以通过 PyTables 实现。不过,您需要使用EArray类。
As an example, the following is a script I wrote to import chunked training data stored as .npy
files into a single .h5
file.
例如,以下是我编写的脚本,用于将存储为.npy
文件的分块训练数据导入到单个.h5
文件中。
import numpy
import tables
import os
training_data = tables.open_file('nn_training.h5', mode='w')
a = tables.Float64Atom()
bl_filter = tables.Filters(5, 'blosc') # fast compressor at a moderate setting
training_input = training_data.create_earray(training_data.root, 'X', a,
(0, 1323), 'Training Input',
bl_filter, 4000000)
training_output = training_data.create_earray(training_data.root, 'Y', a,
(0, 27), 'Training Output',
bl_filter, 4000000)
for filename in os.listdir('input'):
print "loading {}...".format(filename)
a = numpy.load(os.path.join('input', filename))
print "writing to h5"
training_input.append(a)
for filename in os.listdir('output'):
print "loading {}...".format(filename)
training_output.append(numpy.load(os.path.join('output', filename)))
Take a look at the docs for detailed instructions, but very briefly, the create_earray
function takes 1) a data root or parent node; 2) an array name; 3) a datatype atom; 4) a shape with a 0
in the dimension you want to expand; 5) a verbose descriptor; 6) a compression filter; and 7) an expected number of rows along the expandable dimension. Only the first two are required, but you'll probably use all seven in practice. The function accepts a few other optional arguments as well; again, see the docs for details.
查看文档以获取详细说明,但非常简单,该create_earray
函数需要 1) 数据根或父节点;2) 数组名;3) 一个数据类型原子;4)0
要扩展的维度中带有 a 的形状;5) 一个详细的描述符;6)压缩过滤器;和 7) 沿可扩展维度的预期行数。只有前两个是必需的,但您可能会在实践中使用所有七个。该函数还接受一些其他可选参数;再次,请参阅文档了解详细信息。
Once the array is created, you can use its append
method in the expected way.
创建数组后,您可以append
按预期方式使用其方法。