使用 Pandas 和/或 Numpy 进行读/写操作的最快文件格式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22941147/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest file format for read/write operations with Pandas and/or Numpy
提问by c_david
I've been working for a while with very large DataFrames and I've been using the csv format to store input data and results. I've noticed that a lot of time goes into reading and writing these files which, for example, dramatically slows down batch processing of data. I was wondering if the file format itself is of relevance. Is there a preferred file format for faster reading/writing Pandas DataFrames and/or Numpy arrays?
我已经使用非常大的 DataFrame 工作了一段时间,并且一直在使用 csv 格式来存储输入数据和结果。我注意到读取和写入这些文件需要花费大量时间,例如,这会大大减慢数据的批处理速度。我想知道文件格式本身是否相关。是否有用于更快读取/写入 Pandas DataFrames 和/或 Numpy 数组的首选文件格式?
采纳答案by Jeff
Use HDF5. Beats writing flat files hands down. And you can query. Docs are here
使用 HDF5。比编写平面文件更胜一筹。并且可以查询。文档在这里
Here's a perf comparison vs SQL. Updated to show SQL/HDF_fixed/HDF_table/CSV write and read perfs.
这是性能与 SQL 的比较。更新以显示 SQL/HDF_fixed/HDF_table/CSV 写入和读取性能。
Docs now include a performance section:
文档现在包括一个性能部分:
See here
看这里
回答by Aryeh Leib Taurog
It's always a good idea to run some benchmarks for your use case. I've had good results storing raw structs via numpy:
为您的用例运行一些基准测试总是一个好主意。我通过 numpy 存储原始结构的结果很好:
df.to_records().astype(mytype).tofile('mydata')
df = pd.DataFrame.from_records(np.fromfile('mydata', dtype=mytype))
It is quite fast and takes up less space on the disk. But: you'll need to keep track of the dtype to reload the data, it's not portable between architectures, and it doesn't support the advanced features of HDF5. (numpy has a more advanced binary formatwhich is designed to overcome the first two limitations, but I haven't had much success getting it to work.)
它非常快,占用的磁盘空间更少。但是:您需要跟踪 dtype 以重新加载数据,它在架构之间不可移植,并且它不支持 HDF5 的高级功能。(numpy 有一种更高级的二进制格式,旨在克服前两个限制,但我没有成功让它工作。)
Update:Thanks for pressing me for numbers. My benchmark indicates that indeed HDF5 wins, at least in my case. It's bothfaster andsmaller on disk! Here's what I see with dataframe of about 280k rows, 7 float columns, and a string index:
更新:感谢您向我催促数字。我的基准测试表明 HDF5 确实获胜,至少在我的情况下是这样。这是双方更快和更小的磁盘上!这是我看到的大约 280k 行、7 个浮点列和一个字符串索引的数据框:
In [15]: %timeit df.to_hdf('test_fixed.hdf', 'test', mode='w')
10 loops, best of 3: 172 ms per loop
In [17]: %timeit df.to_records().astype(mytype).tofile('raw_data')
1 loops, best of 3: 283 ms per loop
In [20]: %timeit pd.read_hdf('test_fixed.hdf', 'test')
10 loops, best of 3: 36.9 ms per loop
In [22]: %timeit pd.DataFrame.from_records(np.fromfile('raw_data', dtype=mytype))
10 loops, best of 3: 40.7 ms per loop
In [23]: ls -l raw_data test_fixed.hdf
-rw-r----- 1 altaurog altaurog 18167232 Apr 8 12:42 raw_data
-rw-r----- 1 altaurog altaurog 15537704 Apr 8 12:41 test_fixed.hdf
回答by Rafael S. Calsaverini
Recently pandas added support for the parquet format using as backend the library pyarrow(written by Wes Mckinney himself, with his usual obsession for performance).
最近,pandas 使用库作为后端添加了对镶木地板格式的支持pyarrow(由 Wes Mckinney 本人编写,他通常对性能着迷)。
You only need to install the pyarrowlibrary and use the methods read_parquetand to_parquet. Parquet is much faster to read and write for bigger datasets (above a few hundred megabytes or more) and it also keep track of dtype metadata, so you won't loose data type information when writing and reading from disk. It can actually store more efficiently some datatypes that HDF5 are not very performant with (like strings and timestamps: HDF5 doesn't have a native data type for those, so it uses pickle to serialize them, which makes slow for big datasets).
您只需要安装pyarrow库并使用方法read_parquet和to_parquet. Parquet 读取和写入更大的数据集(超过几百兆字节或更多)要快得多,而且它还跟踪 dtype 元数据,因此在从磁盘写入和读取时不会丢失数据类型信息。实际上,它可以更有效地存储一些 HDF5 性能不佳的数据类型(例如字符串和时间戳:HDF5 没有用于这些数据类型的本机数据类型,因此它使用 pickle 来序列化它们,这对于大数据集来说很慢)。
Parquet is also a columnar format, which makes it very easy to do two things:
Parquet 也是一种柱状格式,这使得它很容易做两件事:
Fastly filter out columns that you're not interested in. With CSV you have to actually read the whole file and only after that you can throw away columns you don't want. With parquet you can actualy read only the columns you're interested.
Make queries filtering out rows and reading only what you care.
快速过滤掉您不感兴趣的列。使用 CSV,您必须实际读取整个文件,然后才能丢弃不需要的列。使用镶木地板,您实际上可以只阅读您感兴趣的列。
使查询过滤掉行并只读取您关心的内容。
Another interesting recent development is the Feather file format, which is also developed by Wes Mckinney. It's essentially just an uncompressed arrowformat written directly to disk, so it is potentially faster to write than the Parquet format. The disadvantage will be files that are 2-3x larger.
最近另一个有趣的发展是 Feather 文件格式,它也是由 Wes Mckinney 开发的。它本质上只是一种arrow直接写入磁盘的未压缩格式,因此写入速度可能比 Parquet 格式快。缺点是文件要大 2-3 倍。
回答by rahenri
HDF is indeed a very good choice, you can also use npy/npz with some caveats:
HDF 确实是一个非常好的选择,你也可以使用 npy/npz 并有一些注意事项:
Here is a benchmark using a data frame of 25k rows and 1000 columns filled with random floats:
这是一个使用随机浮点数填充的 25k 行和 1000 列数据框的基准测试:
Saving to HDF took 0.49s
Saving to npy took 0.40s
Loading from HDF took 0.10s
Loading from npy took 0.061s
npy is about 20% faster to write and about 40% faster to read if you don't compress data.
如果不压缩数据,npy 的写入速度大约快 20%,读取速度大约快 40%。
Code used to generate the output above:
用于生成上述输出的代码:
#!/usr/bin/python3
import pandas as pd
import random
import numpy as np
import time
start = time.time()
f = pd.DataFrame()
for i in range(1000):
f['col_{}'.format(i)] = np.random.rand(25000)
print('Generating data took {}s'.format(time.time() - start))
start = time.time()
f.to_hdf('frame.hdf', 'main', format='fixed')
print('Saving to HDF took {}s'.format(time.time() - start))
start = time.time()
np.savez('frame.npz', f.index, f.values)
print('Saving to npy took {}s'.format(time.time() - start))
start = time.time()
pd.read_hdf('frame.hdf')
print('Loading from HDF took {}s'.format(time.time() - start))
start = time.time()
index, values = np.load('frame.npz')
pd.DataFrame(values, index=index)
print('Loading from npy took {}s'.format(time.time() - start))

