Python HDF5 - 并发、压缩和 I/O 性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/16628329/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HDF5 - concurrency, compression & I/O performance
提问by Amelio Vazquez-Reina
I have the following questions about HDF5 performance and concurrency:
我有以下关于 HDF5 性能和并发性的问题:
- Does HDF5 support concurrent write access?
 - Concurrency considerations aside, how is HDF5 performance in terms of I/O performance(does compression ratesaffect the performance)?
 - Since I use HDF5 with Python, how does its performance compare to Sqlite?
 
- HDF5 是否支持并发写访问?
 - 抛开并发性考虑,HDF5 在I/O 性能方面的性能如何(压缩率是否影响性能)?
 - 由于我在 Python 中使用 HDF5,它的性能与 Sqlite 相比如何?
 
References:
参考:
采纳答案by Jeff
Updated to use pandas 0.13.1
更新为使用熊猫 0.13.1
1) No. http://pandas.pydata.org/pandas-docs/dev/io.html#notes-caveats. There are various ways to dothis, e.g. have your different threads/processes write out the computation results, then have a single process combine.
1)不。http://pandas.pydata.org/pandas-docs/dev/io.html#notes-caveats。有多种方法可以做到这一点,例如让不同的线程/进程写出计算结果,然后将单个进程合并。
2) depending the type of data you store, how you do it, and how you want to retrieve, HDF5 can offer vastly better performance. Storing in an HDFStoreas a single array, float data, compressed (in other words, not storing it in a format that allows for querying), will be stored/read amazing fast. Even storing in the table format (which slows down the write performance), will offer quite good write performance. You can look at this for some detailed comparsions (which is what HDFStoreuses under the hood). http://www.pytables.org/, here's a nice picture: 
2) 根据您存储的数据类型、存储方式以及检索方式,HDF5 可以提供更好的性能。将HDFStore浮点数据存储在单个数组中,经过压缩(换句话说,不以允许查询的格式存储),存储/读取速度将非常快。即使以表格式存储(这会降低写入性能),也会提供相当好的写入性能。您可以查看此进行一些详细的比较(这是HDFStore在幕后使用的内容)。http://www.pytables.org/,这是一张漂亮的图片:
(and since PyTables 2.3 the queries are now indexed), so perf actually is MUCH better than this So to answer your question, if you want any kind of performance, HDF5 is the way to go.
(并且自 PyTables 2.3 以来,查询现在已编入索引),因此性能实际上比这要好得多所以要回答您的问题,如果您想要任何类型的性能,HDF5 是要走的路。
Writing:
写作:
In [14]: %timeit test_sql_write(df)
1 loops, best of 3: 6.24 s per loop
In [15]: %timeit test_hdf_fixed_write(df)
1 loops, best of 3: 237 ms per loop
In [16]: %timeit test_hdf_table_write(df)
1 loops, best of 3: 901 ms per loop
In [17]: %timeit test_csv_write(df)
1 loops, best of 3: 3.44 s per loop
Reading
读
In [18]: %timeit test_sql_read()
1 loops, best of 3: 766 ms per loop
In [19]: %timeit test_hdf_fixed_read()
10 loops, best of 3: 19.1 ms per loop
In [20]: %timeit test_hdf_table_read()
10 loops, best of 3: 39 ms per loop
In [22]: %timeit test_csv_read()
1 loops, best of 3: 620 ms per loop
And here's the code
这是代码
import sqlite3
import os
from pandas.io import sql
In [3]: df = DataFrame(randn(1000000,2),columns=list('AB'))
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
A    1000000  non-null values
B    1000000  non-null values
dtypes: float64(2)
def test_sql_write(df):
    if os.path.exists('test.sql'):
        os.remove('test.sql')
    sql_db = sqlite3.connect('test.sql')
    sql.write_frame(df, name='test_table', con=sql_db)
    sql_db.close()
def test_sql_read():
    sql_db = sqlite3.connect('test.sql')
    sql.read_frame("select * from test_table", sql_db)
    sql_db.close()
def test_hdf_fixed_write(df):
    df.to_hdf('test_fixed.hdf','test',mode='w')
def test_csv_read():
    pd.read_csv('test.csv',index_col=0)
def test_csv_write(df):
    df.to_csv('test.csv',mode='w')    
def test_hdf_fixed_read():
    pd.read_hdf('test_fixed.hdf','test')
def test_hdf_table_write(df):
    df.to_hdf('test_table.hdf','test',format='table',mode='w')
def test_hdf_table_read():
    pd.read_hdf('test_table.hdf','test')
Of course YMMV.
当然是YMMV。
回答by tacaswell
Look into pytables, they might have already done a lot of this legwork for you.  
调查一下pytables,他们可能已经为你做了很多这样的跑腿工作。  
That said, I am not fully clear on how to compare hdf and sqlite.  hdfis a general purpose hierarchical data file format + libraries and sqliteis a relational database.   
也就是说,我并不完全清楚如何比较 hdf 和 sqlite。  hdf是一种通用的分层数据文件格式+库,sqlite是一个关系数据库。   
hdfdoes support parallel I/O at the clevel, but I am not sure how much of that h5pywraps or if it will play nice with NFS.
hdf确实支持该c级别的并行 I/O ,但我不确定其中有多少h5py包裹,或者它是否可以与 NFS 配合使用。
If you really want a highly concurrent relational database, why not just use a real SQL server?
如果你真的想要一个高并发的关系型数据库,为什么不直接使用真正的 SQL 服务器呢?

