pandas 提高熊猫(PyTables?)HDF5 表写入性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20083098/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Improve pandas (PyTables?) HDF5 table write performance
提问by Peter Gaultney
I've been using pandas for research now for about two months to great effect. With large numbers of medium-sized trace event datasets, pandas + PyTables (the HDF5 interface) does a tremendous job of allowing me to process heterogenous data using all the Python tools I know and love.
我已经使用Pandas进行研究大约两个月了,效果很好。有了大量中等规模的跟踪事件数据集,pandas + PyTables(HDF5 接口)在让我使用我熟悉和喜爱的所有 Python 工具处理异构数据方面做得非常出色。
Generally speaking, I use the Fixed (formerly "Storer") format in PyTables, as my workflow is write-once, read-many, and many of my datasets are sized such that I can load 50-100 of them into memory at a time with no serious disadvantages. (NB: I do much of my work on Opteron server-class machines with 128GB+ system memory.)
一般来说,我在 PyTables 中使用 Fixed(以前称为“Storer”)格式,因为我的工作流程是一次写入,多次读取,而且我的许多数据集的大小都使得我可以一次将 50-100 个加载到内存中时间没有严重的缺点。(注意:我的大部分工作都是在具有 128GB+ 系统内存的 Opteron 服务器级机器上完成的。)
However, for large datasets (500MB and greater), I would like to be able to use the more scalable random-access and query abilities of the PyTables "Tables" format, so that I can perform my queries out-of-memory and then load the much smaller result set into memory for processing. The big hurdle here, however, is the write performance. Yes, as I said, my workflow is write-once, read-many, but the relative times are still unacceptable.
但是,对于大型数据集(500MB 及更大),我希望能够使用 PyTables“Tables”格式的更具可扩展性的随机访问和查询功能,以便我可以在内存不足的情况下执行查询,然后将小得多的结果集加载到内存中进行处理。然而,这里的一大障碍是写入性能。是的,正如我所说,我的工作流程是一次写入,多次读取,但相对时间仍然无法接受。
As an example, I recently ran a large Cholesky factorization that took 3 minutes, 8 seconds (188 seconds) on my 48 core machine. This generated a trace file of ~2.2 GB - the trace is generated in parallel with the program, so there is no additional "trace creation time."
例如,我最近在我的 48 核机器上运行了一个大型 Cholesky 分解,耗时 3 分 8 秒(188 秒)。这生成了约 2.2 GB 的跟踪文件 - 跟踪与程序并行生成,因此没有额外的“跟踪创建时间”。
The initial conversion of my binary trace file into the pandas/PyTables format takes a decent chunk of time, but largely because the binary format is deliberately out-of-order in order to reduce the performance impact of the trace generator itself. This is also irrelevant to the performance loss when moving from the Storer format to the Table format.
将我的二进制跟踪文件初始转换为 pandas/PyTables 格式需要相当长的时间,但这主要是因为二进制格式故意乱序以减少跟踪生成器本身的性能影响。这也与从 Storer 格式移动到 Table 格式时的性能损失无关。
My tests were initially run with pandas 0.12, numpy 1.7.1, PyTables 2.4.0, and numexpr 0.20.1. My 48 core machine runs at 2.8GHz per core, and I am writing to an ext3 filesystem which is probably (but not certainly) on a SSD.
我的测试最初是使用 pandas 0.12、numpy 1.7.1、PyTables 2.4.0 和 numexpr 0.20.1 运行的。我的 48 核机器以每核 2.8GHz 的速度运行,我正在写入一个 ext3 文件系统,该文件系统可能(但不一定)在 SSD 上。
I can write the entire dataset to a Storer format HDF5 file (resulting filesize: 3.3GB) in 7.1 seconds. The same dataset, written to the Table format (resulting file size is also 3.3GB), takes 178.7 seconds to write.
我可以在 7.1 秒内将整个数据集写入 Storer 格式的 HDF5 文件(生成的文件大小:3.3GB)。同样的数据集,写入 Table 格式(结果文件大小也是 3.3GB),写入需要 178.7 秒。
The code is as follows:
代码如下:
with Timer() as t:
store = pd.HDFStore('test_storer.h5', 'w')
store.put('events', events_dataset, table=False, append=False)
print('Fixed format write took ' + str(t.interval))
with Timer() as t:
store = pd.HDFStore('test_table.h5', 'w')
store.put('events', events_dataset, table=True, append=False)
print('Table format write took ' + str(t.interval))
and the output is simply
输出很简单
Fixed format write took 7.1
Table format write took 178.7
My dataset has 28,880,943 rows, and the columns are basic datatypes:
我的数据集有 28,880,943 行,列是基本数据类型:
node_id int64
thread_id int64
handle_id int64
type int64
begin int64
end int64
duration int64
flags int64
unique_id int64
id int64
DSTL_LS_FULL float64
L2_DMISS float64
L3_MISS float64
kernel_type float64
dtype: object
...so I don't think there should be any data-specific issues with the write speed.
...所以我认为写入速度不应该存在任何特定于数据的问题。
I've also tried adding BLOSC compression, to rule out any strange I/O issues that might affect one scenario or the other, but compression seems to decrease the performance of both equally.
我还尝试添加 BLOSC 压缩,以排除可能影响一种情况或另一种情况的任何奇怪的 I/O 问题,但压缩似乎同等地降低了两者的性能。
Now, I realize that the pandas documentation says that the Storer format offers significantly faster writes, and slightly faster reads. (I do experience the faster reads, as a read of the Storer format seems to take around 2.5 seconds, while a read of the Table format takes around 10 seconds.) But it really seems excessive that the Table format write should take 25 times as long as the Storer format write.
现在,我意识到 Pandas 文档说 Storer 格式提供了明显更快的写入速度和稍快的读取速度。(我确实体验到读取速度更快,因为读取 Storer 格式似乎需要大约 2.5 秒,而读取 Table 格式需要大约 10 秒。)但是 Table 格式写入应该花费 25 倍的时间似乎确实过分了只要 Storer 格式写入。
Can any of the folks involved with PyTables or pandas explain the architectural (or otherwise) reasons why writing to the queryable format (which clearly requires very little extra data) should take an order of magnitude longer? And is there any hope for improving this in the future? I'd love to jump in to contributing to one project or the other, as my field is high performance computing and I see a significant use case for both projects in this domain.... but it would be helpful to get some clarification on the issues involved first, and/or some advice on how to speed things up from those who know how the system is built.
任何参与 PyTables 或 Pandas 的人都可以解释为什么写入可查询格式(显然只需要很少的额外数据)应该花费一个数量级更长的时间的架构(或其他)原因吗?未来有没有希望改善这一点?我很乐意为一个项目或另一个项目做出贡献,因为我的领域是高性能计算,而且我看到了该领域中两个项目的重要用例......但得到一些澄清会有所帮助首先涉及的问题,和/或了解系统构建方式的人提供的有关如何加快速度的建议。
EDIT:
编辑:
Running the former tests with %prun in IPython gives the following (somewhat reduced for readability) profile output for the Storer/Fixed format:
在 IPython 中使用 %prun 运行以前的测试会为 Storer/Fixed 格式提供以下(为了可读性而有所减少)配置文件输出:
%prun -l 20 profile.events.to_hdf('test.h5', 'events', table=False, append=False)
3223 function calls (3222 primitive calls) in 7.385 seconds
Ordered by: internal time
List reduced from 208 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
6 7.127 1.188 7.128 1.188 {method '_createArray' of 'tables.hdf5Extension.Array' objects}
1 0.242 0.242 0.242 0.242 {method '_closeFile' of 'tables.hdf5Extension.File' objects}
1 0.003 0.003 0.003 0.003 {method '_g_new' of 'tables.hdf5Extension.File' objects}
46 0.001 0.000 0.001 0.000 {method 'reduce' of 'numpy.ufunc' objects}
and the following for the Tables format:
以及表格格式的以下内容:
%prun -l 40 profile.events.to_hdf('test.h5', 'events', table=True, append=False, chunksize=1000000)
499082 function calls (499040 primitive calls) in 188.981 seconds
Ordered by: internal time
List reduced from 526 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
29 92.018 3.173 92.018 3.173 {pandas.lib.create_hdf_rows_2d}
640 20.987 0.033 20.987 0.033 {method '_append' of 'tables.hdf5Extension.Array' objects}
29 19.256 0.664 19.256 0.664 {method '_append_records' of 'tables.tableExtension.Table' objects}
406 19.182 0.047 19.182 0.047 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
14244 10.646 0.001 10.646 0.001 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
472 10.359 0.022 10.359 0.022 {method 'copy' of 'numpy.ndarray' objects}
80 3.409 0.043 3.409 0.043 {tables.indexesExtension.keysort}
2 3.023 1.512 3.023 1.512 common.py:134(_isnull_ndarraylike)
41 2.489 0.061 2.533 0.062 {method '_fillCol' of 'tables.tableExtension.Row' objects}
87 2.401 0.028 2.401 0.028 {method 'astype' of 'numpy.ndarray' objects}
30 1.880 0.063 1.880 0.063 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
282 0.824 0.003 0.824 0.003 {method 'reduce' of 'numpy.ufunc' objects}
41 0.537 0.013 0.668 0.016 index.py:607(final_idx32)
14490 0.385 0.000 0.712 0.000 array.py:342(_interpret_indexing)
39 0.279 0.007 19.635 0.503 index.py:1219(reorder_slice)
2 0.256 0.128 10.063 5.031 index.py:1099(get_neworder)
1 0.090 0.090 119.392 119.392 pytables.py:3016(write_data)
57842 0.087 0.000 0.087 0.000 {numpy.core.multiarray.empty}
28570 0.062 0.000 0.107 0.000 utils.py:42(is_idx)
14164 0.062 0.000 7.181 0.001 array.py:711(_readSlice)
EDIT 2:
编辑2:
Running again with a pre-release copy of pandas 0.13 (pulled Nov 20 2013 at about 11:00 EST), write times for the Tables format improve significantly but still don't compare "reasonably" to the write speeds of the Storer/Fixed format.
再次运行 Pandas 0.13 的预发布副本(于 2013 年 11 月 20 日美国东部时间大约 11:00 拉),表格式的写入时间显着改善,但仍然无法“合理地”与 Storer/Fixed 的写入速度进行比较格式。
%prun -l 40 profile.events.to_hdf('test.h5', 'events', table=True, append=False, chunksize=1000000)
499748 function calls (499720 primitive calls) in 117.187 seconds
Ordered by: internal time
List reduced from 539 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
640 22.010 0.034 22.010 0.034 {method '_append' of 'tables.hdf5Extension.Array' objects}
29 20.782 0.717 20.782 0.717 {method '_append_records' of 'tables.tableExtension.Table' objects}
406 19.248 0.047 19.248 0.047 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
14244 10.685 0.001 10.685 0.001 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
472 10.439 0.022 10.439 0.022 {method 'copy' of 'numpy.ndarray' objects}
30 7.356 0.245 7.356 0.245 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
29 7.161 0.247 37.609 1.297 pytables.py:3498(write_data_chunk)
2 3.888 1.944 3.888 1.944 common.py:197(_isnull_ndarraylike)
80 3.581 0.045 3.581 0.045 {tables.indexesExtension.keysort}
41 3.248 0.079 3.294 0.080 {method '_fillCol' of 'tables.tableExtension.Row' objects}
34 2.744 0.081 2.744 0.081 {method 'ravel' of 'numpy.ndarray' objects}
115 2.591 0.023 2.591 0.023 {method 'astype' of 'numpy.ndarray' objects}
270 0.875 0.003 0.875 0.003 {method 'reduce' of 'numpy.ufunc' objects}
41 0.560 0.014 0.732 0.018 index.py:607(final_idx32)
14490 0.387 0.000 0.712 0.000 array.py:342(_interpret_indexing)
39 0.303 0.008 19.617 0.503 index.py:1219(reorder_slice)
2 0.288 0.144 10.299 5.149 index.py:1099(get_neworder)
57871 0.087 0.000 0.087 0.000 {numpy.core.multiarray.empty}
1 0.084 0.084 45.266 45.266 pytables.py:3424(write_data)
1 0.080 0.080 55.542 55.542 pytables.py:3385(write)
I noticed while running these tests that there are long periods where writing seems to "pause" (the file on disk is not actively growing), and yet there is also low CPU usage during some of these periods.
在运行这些测试时,我注意到有很长一段时间写入似乎“暂停”(磁盘上的文件没有积极增长),但在其中一些时间段内 CPU 使用率也很低。
I begin to suspect that some known ext3 limitations may interact badly with either pandas or PyTables. Ext3 and other non-extent-based filesystems sometimes struggle to unlink large files promptly, and similar system performance (low CPU usage, but long wait times) is apparent even during a simple 'rm' of a 1GB file, for instance.
我开始怀疑某些已知的 ext3 限制可能与 Pandas 或 PyTables 交互不良。Ext3 和其他非基于扩展的文件系统有时难以及时取消链接大文件,例如,即使在 1GB 文件的简单“rm”期间,类似的系统性能(CPU 使用率低,但等待时间长)也很明显。
To clarify, in each test case, I made sure to remove the existing file, if any, before starting the test, so as not to incur any ext3 file removal/overwrite penalty.
为了澄清,在每个测试用例中,我确保在开始测试之前删除现有文件(如果有),以免招致任何 ext3 文件删除/覆盖损失。
However, when re-running this test with index=None, performance improves drastically (~50s vs the ~120 when indexing). So it would seem that either this process continues to be CPU-bound (my system has relatively old AMD Opteron Istanbul CPUs running @ 2.8GHz, though it does also have 8 sockets with 6 core CPUs in each, all but one of which, of course, sit idle during the write), or that there is some conflict between the way PyTables or pandas attempts to manipulate/read/analyze the file when already partially or fully on the filesystem that causes pathologically bad I/O behavior when the indexing is occurring.
但是,当使用 index=None 重新运行此测试时,性能显着提高(约 50 秒与索引时的约 120 秒)。所以看起来这个过程要么继续受 CPU 限制(我的系统有运行 @ 2.8GHz 的相对较旧的 AMD OpteronIstanbul CPU,尽管它也有 8 个插槽,每个插槽有 6 个核心 CPU,除了其中一个之外,所有的当然,在写入期间闲置),或者 PyTables 或 Pandas 尝试操作/读取/分析文件的方式之间存在一些冲突,当索引已部分或全部在文件系统上时会导致病态的 I/O 行为发生。
EDIT 3:
编辑 3:
@Jeff's suggested tests on a smaller dataset (1.3 GB on disk), after upgrading PyTables from 2.4 to 3.0.0, have gotten me here:
在将 PyTables 从 2.4 升级到 3.0.0 之后,@Jeff 建议对较小的数据集(磁盘上 1.3 GB)进行测试,让我在这里:
In [7]: %timeit f(df)
1 loops, best of 3: 3.7 s per loop
In [8]: %timeit f2(df) # where chunksize= 2 000 000
1 loops, best of 3: 13.8 s per loop
In [9]: %timeit f3(df) # where chunksize= 2 000 000
1 loops, best of 3: 43.4 s per loop
In fact, my performance seems to beat his in all scenarios except for when indexing is turned on (the default). However, indexing still seems to be a killer, and if the way I'm interpreting the output from topand lsas I run these tests is correct, there remain periods of time when there is neither significant processing nor any file writing happening (i.e., CPU usage for the Python process is near 0, and the filesize remains constant). I can only assume these are file reads. Why file reads would be causing slowdowns is hard for me to understand, as I can reliably load an entire 3+ GB file from this disk into memory in under 3 seconds. If they're not file reads, then what is the system 'waiting' on? (No one else is logged into the machine, and there is no other filesystem activity.)
事实上,除了打开索引(默认)之外,我的表现似乎在所有情况下都超过了他。然而,索引似乎仍然是一个杀手,如果我解释输出top和ls运行这些测试的方式是正确的,那么仍然有一段时间既没有重要的处理也没有任何文件写入发生(即 CPU Python 进程的使用率接近 0,并且文件大小保持不变)。我只能假设这些是文件读取。我很难理解为什么文件读取会导致速度变慢,因为我可以在 3 秒内可靠地将整个 3 GB 以上的文件从该磁盘加载到内存中。如果它们不是文件读取,那么系统在“等待”什么?(没有其他人登录到这台机器,也没有其他文件系统活动。)
At this point, with upgraded versions of the relevant python modules, the performance for my original dataset is down to the following figures. Of special interest are the system time, which I assume is at least an upper-bound on the time spent performing IO, and the Wall time, which seems to perhaps account for these mysterious periods of no write/no CPU activity.
此时,随着相关python模块的升级版本,我的原始数据集的性能下降到下图。特别感兴趣的是系统时间,我认为它至少是执行 IO 所用时间的上限,以及墙时间,这似乎可能解释了这些没有写入/没有 CPU 活动的神秘时期。
In [28]: %time f(profile.events)
CPU times: user 0 ns, sys: 7.16 s, total: 7.16 s
Wall time: 7.51 s
In [29]: %time f2(profile.events)
CPU times: user 18.7 s, sys: 14 s, total: 32.7 s
Wall time: 47.2 s
In [31]: %time f3(profile.events)
CPU times: user 1min 18s, sys: 14.4 s, total: 1min 32s
Wall time: 2min 5s
Nevertheless, it would appears that indexing causes significant slowdown for my use case. Perhaps I should attempt limiting the fields indexed instead of simply performing the default case (which may very well be indexing on all of the fields in the DataFrame)? I am not sure how this is likely to affect query times, especially in the cases where a query selects based on a non-indexed field.
尽管如此,索引似乎会导致我的用例显着放缓。也许我应该尝试限制索引的字段,而不是简单地执行默认情况(这很可能是对 DataFrame 中的所有字段进行索引)?我不确定这会如何影响查询时间,尤其是在查询基于非索引字段进行选择的情况下。
Per Jeff's request, a ptdump of the resulting file.
根据 Jeff 的请求,生成文件的 ptdump。
ptdump -av test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := [],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes :=
[(1, ['node_id', 'thread_id', 'handle_id', 'type', 'begin', 'end', 'duration', 'flags', 'unique_id', 'id', 'DSTL_LS_FULL', 'L2_DMISS', 'L3_MISS', 'kernel_type'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['values_block_0', 'values_block_1']]
/df/table (Table(28880943,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int64Col(shape=(10,), dflt=0, pos=1),
"values_block_1": Float64Col(shape=(4,), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (4369,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 15 attributes:
[CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0,
FIELD_1_NAME := 'values_block_0',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'values_block_1',
NROWS := 28880943,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer',
values_block_0_dtype := 'int64',
values_block_0_kind := ['node_id', 'thread_id', 'handle_id', 'type', 'begin', 'end', 'duration', 'flags', 'unique_id', 'id'],
values_block_1_dtype := 'float64',
values_block_1_kind := ['DSTL_LS_FULL', 'L2_DMISS', 'L3_MISS', 'kernel_type']]
and another %prun with the updated modules and the full dataset:
另一个带有更新模块和完整数据集的 %prun:
%prun -l 25 %time f3(profile.events)
CPU times: user 1min 14s, sys: 16.2 s, total: 1min 30s
Wall time: 1min 48s
542678 function calls (542650 primitive calls) in 108.678 seconds
Ordered by: internal time
List reduced from 629 to 25 due to restriction <25>
ncalls tottime percall cumtime percall filename:lineno(function)
640 23.633 0.037 23.633 0.037 {method '_append' of 'tables.hdf5extension.Array' objects}
15 20.852 1.390 20.852 1.390 {method '_append_records' of 'tables.tableextension.Table' objects}
406 19.584 0.048 19.584 0.048 {method '_g_write_slice' of 'tables.hdf5extension.Array' objects}
14244 10.591 0.001 10.591 0.001 {method '_g_read_slice' of 'tables.hdf5extension.Array' objects}
458 9.693 0.021 9.693 0.021 {method 'copy' of 'numpy.ndarray' objects}
15 6.350 0.423 30.989 2.066 pytables.py:3498(write_data_chunk)
80 3.496 0.044 3.496 0.044 {tables.indexesextension.keysort}
41 3.335 0.081 3.376 0.082 {method '_fill_col' of 'tables.tableextension.Row' objects}
20 2.551 0.128 2.551 0.128 {method 'ravel' of 'numpy.ndarray' objects}
101 2.449 0.024 2.449 0.024 {method 'astype' of 'numpy.ndarray' objects}
16 1.789 0.112 1.789 0.112 {method '_g_flush' of 'tables.hdf5extension.Leaf' objects}
2 1.728 0.864 1.728 0.864 common.py:197(_isnull_ndarraylike)
41 0.586 0.014 0.842 0.021 index.py:637(final_idx32)
14490 0.292 0.000 0.616 0.000 array.py:368(_interpret_indexing)
2 0.283 0.142 10.267 5.134 index.py:1158(get_neworder)
274 0.251 0.001 0.251 0.001 {method 'reduce' of 'numpy.ufunc' objects}
39 0.174 0.004 19.373 0.497 index.py:1280(reorder_slice)
57857 0.085 0.000 0.085 0.000 {numpy.core.multiarray.empty}
1 0.083 0.083 35.657 35.657 pytables.py:3424(write_data)
1 0.065 0.065 45.338 45.338 pytables.py:3385(write)
14164 0.065 0.000 7.831 0.001 array.py:615(__getitem__)
28570 0.062 0.000 0.108 0.000 utils.py:47(is_idx)
47 0.055 0.001 0.055 0.001 {numpy.core.multiarray.arange}
28570 0.050 0.000 0.090 0.000 leaf.py:397(_process_range)
87797 0.048 0.000 0.048 0.000 {isinstance}
采纳答案by Francesc
That's an interesting discussion. I think Peter is getting awesome performance for the Fixed format because the format writes in a single shot and also that he has a really good SSD (it can write at more than 450 MB/s).
这是一个有趣的讨论。我认为 Peter 在 Fixed 格式上获得了很棒的性能,因为该格式可以一次性写入,而且他有一个非常好的 SSD(它可以以超过 450 MB/s 的速度写入)。
Appending to table is a more complex operation (the dataset has to be enlarged, and new records must be checked so that we can ensure that they follow the schema of the table). This is why appending rows in tables is generally slower (but still, Jeff is getting ~ 70 MB/s, which is pretty good). That Jeff is getting more speed than Peter is probably due to the fact that he has a better processor.
追加到表是一个更复杂的操作(必须扩大数据集,必须检查新记录,以便我们确保它们遵循表的模式)。这就是为什么在表中追加行通常较慢的原因(但 Jeff 的速度仍然约为 70 MB/s,这已经相当不错了)。Jeff 比 Peter 获得更快的速度可能是因为他拥有更好的处理器。
Finally, indexing in PyTables uses a single processor, yes, and that normally is an expensive operation, so you should really disable it if you are not going to query data on-disk.
最后,PyTables 中的索引使用单个处理器,是的,这通常是一项昂贵的操作,因此如果您不打算查询磁盘上的数据,您应该真正禁用它。
回答by Jeff
Here is a similar comparison I just did. Its about 1/3 of the data 10M rows. The final size is abou 1.3GB
这是我刚刚做的一个类似的比较。它大约是 10M 行数据的 1/3。最终大小约为 1.3GB
I define 3 timing functions:
我定义了 3 个计时函数:
Test the Fixed format (called Storer in 0.12). This writes in a PyTables Array format
测试 Fixed 格式(在 0.12 中称为 Storer)。这以 PyTables 数组格式写入
def f(df):
store = pd.HDFStore('test.h5','w')
store['df'] = df
store.close()
Write in the Table format, using PyTables Table format. Do not create an index.
以 Table 格式写入,使用 PyTables 表格格式。不要创建索引。
def f2(df):
store = pd.HDFStore('test.h5','w')
store.append('df',df,index=False)
store.close()
Same as f2, but create an index (which is normally done)
与 f2 相同,但创建索引(通常完成)
def f3(df):
store = pd.HDFStore('test.h5','w')
store.append('df',df)
store.close()
Create the frame
创建框架
In [25]: df = concat([DataFrame(np.random.randn(10000000,10)),DataFrame(np.random.randint(0,10,size=50000000).reshape(10000000,5))],axis=1)
In [26]: df
Out[26]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 4
dtypes: float64(10), int64(5)
v0.12.0
In [27]: %timeit f(df)
1 loops, best of 3: 14.7 s per loop
In [28]: %timeit f2(df)
1 loops, best of 3: 32 s per loop
In [29]: %timeit f3(df)
1 loops, best of 3: 40.1 s per loop
master/v0.13.0
In [5]: %timeit f(df)
1 loops, best of 3: 12.9 s per loop
In [6]: %timeit f2(df)
1 loops, best of 3: 17.5 s per loop
In [7]: %timeit f3(df)
1 loops, best of 3: 24.3 s per loop
Timing Runs with the same file as provided by the OP (link is below)
定时运行与 OP 提供的文件相同(链接如下)
In [4]: df = pd.read_hdf('test.h5','df')
In [5]: df
Out[5]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28880943 entries, 0 to 28880942
Columns: 14 entries, node_id to kernel_type
dtypes: float64(4), int64(10)
Like f1, Fixed format
像 f1,固定格式
In [6]: %timeit df.to_hdf('test.hdf','df',mode='w')
1 loops, best of 3: 36.2 s per loop
Like f2, Table format, no index
像f2,表格格式,没有索引
In [7]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',index=False)
1 loops, best of 3: 45 s per loop
In [8]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',index=False,chunksize=2000000)
1 loops, best of 3: 44.5 s per loop
Like f3, Table format with index
像f3,带索引的表格格式
In [9]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',chunksize=2000000)
1 loops, best of 3: 1min 36s per loop
Like f3, Table format with index, compressed with blosc
像f3,Table格式带索引,用blosc压缩
In [10]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',chunksize=2000000,complib='blosc')
1 loops, best of 3: 46.5 s per loop
In [11]: %timeit pd.read_hdf('test.hdf','df')
1 loops, best of 3: 10.8 s per loop
Show original file (test.h5, and compressed, test.hdf)
显示原始文件(test.h5 和压缩后的 test.hdf)
In [13]: !ls -ltr test.h*
-rw-r--r-- 1 jreback users 3471518282 Nov 20 18:20 test.h5
-rw-rw-r-- 1 jreback users 649327780 Nov 20 21:17 test.hdf
Several points to note.
有几点需要注意。
Not creating an index can make a non-trivial difference in time. I also believe that if you have a string based index it can substantially worsen write time. That said, you always want to create an index to make retrieval very fast.
You didn't include what your index is, nor whether its sorted (though I only think this makes a small difference).
the write penalty in my examples are roughly 2x (though I have seen it be somewhat bigger when INCLUDING the index time). Thus your 7s (1/2 of my time), for 3x the number I am writing is quite suspect. I am using a reasonably fast disk array. If you were using a flash based disk then this is possible, though.
master/v0.13.0 (release very soon), improves the write times on tables substantially.
you can try setting the
chunksizeparameter to a bigger number when you write the data (its default is 100000). The purpose of the 'relatively' low number is to have a constant memory usage. (e.g. if is bigger you will use more memory, in theory it should write faster though).Tables offer 2 advantages over Fixed format: 1) query retrieval, and 2) appendability. Reading the entire table doesn't take advantage of either, so if you ONLY want to read the entire table, then Fixed format is recommended. (In my experience the flexibility of Tables greatly outweights the write penalty, but YMMV)
不创建索引可以在时间上产生重要的差异。我也相信,如果你有一个基于字符串的索引,它会大大缩短写入时间。也就是说,您总是希望创建一个索引以使检索速度非常快。
你没有包括你的索引是什么,也没有包括它是否排序(尽管我认为这只是一个很小的区别)。
我的示例中的写入惩罚大约是 2 倍(尽管我已经看到在包括索引时间时它会更大一些)。因此,您的 7 秒(我的时间的 1/2),对于我正在写的数字的 3 倍,非常值得怀疑。我正在使用相当快的磁盘阵列。不过,如果您使用的是基于闪存的磁盘,那么这是可能的。
master/v0.13.0(很快发布),大大改善了表的写入时间。
可以尝试
chunksize在写入数据时将参数设置为更大的数字(默认为100000)。“相对”较低的数字的目的是保持内存使用量不变。(例如,如果更大,您将使用更多内存,理论上它应该写得更快)。与固定格式相比,表有两个优点:1) 查询检索,和 2) 可附加性。读取整表也没有利用,所以如果你只想读取整个表,那么建议使用固定格式。(根据我的经验,表格的灵活性大大超过了写入损失,但 YMMV)
Bottom line is to repeat the timings (use ipython as it will run multiple tests). If you can reproduce your results, then pls post a %prun and I'll take a look.
底线是重复计时(使用 ipython 因为它将运行多个测试)。如果你能重现你的结果,那么请发布一个 %prun,我会看看。
Update:
更新:
so the recommended way for a table this size is to compress with blosc and use pandas master/0.13.0 along with PyTables 3.0.0
因此,对于这种大小的表,推荐的方法是使用 blosc 压缩并使用 pandas master/0.13.0 和 PyTables 3.0.0

