Pandas HDF5 作为数据库

Question

提问by prl900

I've been using python pandas for the last year and I'm really impressed by its performance and functionalities, however pandas is not a database yet. I've been thinking lately on ways to integrate the analysis power of pandas into a flat HDF5 file database. Unfortunately HDF5 is not designed to deal natively with concurrency.

去年我一直在使用 python pandas，它的性能和功能给我留下了深刻的印象，但是 pandas 还不是数据库。我最近一直在思考如何将 Pandas 的分析能力集成到平面 HDF5 文件数据库中。不幸的是，HDF5 并不是为处理本地并发而设计的。

I've been looking around for inspiration into locking systems, distributed task queues, parallel HDF5, flat file database managers or multiprocessing but I still don't have a clear idea on where to start.

我一直在寻找锁定系统、分布式任务队列、并行 HDF5、平面文件数据库管理器或多处理的灵感，但我仍然不清楚从哪里开始。

Ultimately, I would like to have a RESTful API to interact with the HDF5 file to create, retrieve, update and delete data. A possible use case for this could be building a time series store where sensors can write data and analytical services can be implemented on top of it.

最终，我希望有一个 RESTful API 来与 HDF5 文件交互以创建、检索、更新和删除数据。一个可能的用例是构建一个时间序列存储，传感器可以在其中写入数据并在其上实现分析服务。

Any ideas about possible paths to follow, existing similar projects or about the convenience/inconvenience of the whole idea will be very much appreciated.

任何关于可能遵循的路径、现有类似项目或关于整个想法的便利/不便的想法都将不胜感激。

PD: I know I can use a SQL/NoSQL database to store the data instead but I want to use HDF5 because I haven't seen anything faster when it comes to retrieve large volumes of the data.

PD：我知道我可以使用 SQL/NoSQL 数据库来存储数据，但我想使用 HDF5，因为在检索大量数据时，我没有看到任何更快的方法。

Answer 1

采纳答案by ümit

HDF5 works fine for concurrent read only access.
For concurrent write access you either have to use parallel HDF5or have a worker process that takes care of writing to an HDF5 store.

HDF5 适用于并发只读访问。
对于并发写入访问，您要么必须使用并行 HDF5，要么有一个工作进程负责写入 HDF5 存储。

There are some efforts to combine HDF5 with a RESTful API from the HDF Group intself. See hereand herefor more details. I am not sure how mature it is.

有一些努力将 HDF5 与来自 HDF Group 的 RESTful API 结合起来。有关更多详细信息，请参阅此处和此处。我不确定它有多成熟。

I recommend to use a hybrid approach and expose it via a RESTful API.
You can store meta-information in a SQL/NoSQL database and keep the raw data (time series data) in one or multiple HDF5 files.

我建议使用混合方法并通过 RESTful API 公开它。
您可以将元信息存储在 SQL/NoSQL 数据库中，并将原始数据（时间序列数据）保存在一个或多个 HDF5 文件中。

There is one public REST API to access the data and the user doesn't have to care what happens behind the curtains.
That's also the approach we are taking for storing biological information.

有一个公共 REST API 来访问数据，用户不必关心幕后发生的事情。
这也是我们用于存储生物信息的方法。

Answer 2

回答by Pietro Battiston

I know the following is not a good answer to the question, but it is perfect for my needs, and I didn't find it implemented somewhere else:

我知道以下不是问题的好答案，但它非常适合我的需求，而且我没有发现它在其他地方实现：

from pandas import HDFStore
import os
import time

class SafeHDFStore(HDFStore):
    def __init__(self, *args, **kwargs):
        probe_interval = kwargs.pop("probe_interval", 1)
        self._lock = "%s.lock" % args[0]
        while True:
            try:
                self._flock = os.open(self._lock, os.O_CREAT |
                                                  os.O_EXCL |
                                                  os.O_WRONLY)
                break
            except FileExistsError:
                time.sleep(probe_interval)

        HDFStore.__init__(self, *args, **kwargs)

    def __exit__(self, *args, **kwargs):
        HDFStore.__exit__(self, *args, **kwargs)
        os.close(self._flock)
        os.remove(self._lock)

I use this as

我用这个作为

result = do_long_operations()
with SafeHDFStore('example.hdf') as store:
    # Only put inside this block the code which operates on the store
    store['result'] = result

and different processes/threads working on a same store will simply queue.

并且在同一商店上工作的不同进程/线程将简单地排队。

Notice that if instead you naively operate on the store from multiple processes, the last closing the store will "win", and what the others "think they have written" will be lost.

请注意，如果您从多个进程中天真地对存储进行操作，最后关闭存储将“获胜”，而其他人“认为他们已经编写”的内容将丢失。

(I know I could instead just let one process manage all writes, but this solution avoids the overhead of pickling)

（我知道我可以改为让一个进程管理所有写入，但此解决方案避免了酸洗的开销）

EDIT:"probe_interval" can now be tuned (one second is too much if writes are frequent)

编辑：现在可以调整“probe_interval”（如果写入频繁，一秒钟就太多了）

Answer 3

回答by John Readey

HDF Group has a REST service for HDF5 out now: http://hdfgroup.org/projects/hdfserver/

HDF Group 现在为 HDF5 提供 REST 服务：http: //hdfgroup.org/projects/hdfserver/

Pandas HDF5 作为数据库

提问by prl900

采纳答案by ümit

回答by Pietro Battiston

回答by John Readey

相关推荐

最近更新

标签

Pandas HDF5 作为数据库

提问by prl900

采纳答案by ümit

回答by Pietro Battiston

回答by John Readey

相关推荐

pandas 熊猫合并列，但不合并“关键”列

从 Pandas DataFrame 创建术语密度矩阵的有效方法

pandas 在 DataFrame 聚合后绘制特定列

Pandas 数据帧中的多索引分组依据

相关推荐

最近更新

标签