python 存储时间序列数据的最佳开源解决方案是什么？

Question

提问by lorg

I am interested in monitoring some objects. I expect to get about 10000 data points every 15 minutes. (Maybe not at first, but this is the 'general ballpark'). I would also like to be able to get daily, weekly, monthly and yearly statistics. It is not critical to keep the data in the highest resolution (15 minutes) for more than two months.

我对监视一些对象感兴趣。我希望每 15 分钟获得大约 10000 个数据点。（也许一开始不是，但这是“一般球场”）。我还希望能够获得每日、每周、每月和每年的统计数据。以最高分辨率（15 分钟）保存数据两个月以上并不重要。

I am considering various ways to store this data, and have been looking at a classic relational database, or at a schemaless database (such as SimpleDB).

我正在考虑各种存储这些数据的方法，并且一直在研究经典的关系数据库，或无模式数据库（例如 SimpleDB）。

My question is, what is the best way to go along doing this? I would very much prefer an open-source (and free) solution to a proprietary costly one.

我的问题是，这样做的最佳方法是什么？与专有的昂贵解决方案相比，我更喜欢开源（和免费）解决方案。

Small note: I am writing this application in Python.

小提示：我正在用 Python 编写这个应用程序。

Answer 1

回答by tom10

HDF5, which can be accessed through h5pyor PyTables, is designed for dealing with very large data sets. Both interfaces work well. For example, both h5py and PyTables have automatic compression and supports Numpy.

HDF5可以通过h5py或PyTables访问，旨在处理非常大的数据集。两个接口都运行良好。比如h5py和PyTables都有自动压缩，支持Numpy。

Answer 2

回答by ThomasH

RRDToolby Tobi Oetiker, definitely! It's open-source, it's been designed for exactly such use cases.

绝对是 Tobi Oetiker 的RRDTool！它是开源的，专为此类用例而设计。

EDIT:

编辑：

To provide a few highlights: RRDTool stores time-series data in a round-robin data base. It keeps raw data for a given period of time, then condenses it in a configurable way so you have fine-grained data say for a month, averaged data over a week for the last 6 months, and averaged data over a month for the last 2 years. As a side effect you data base remains the same size all of the time(so no sweating you disk may run full). This was the storage side. On the retrieval side RRDTool offers data queries that are immediately turned into graphs (e.g. png) that you can readily include in documents and web pages. It's a rock solid, proven solution that is a much generalized form over its predecessor, MRTG (some might have heard of this). And once you got into it, you will find yourself re-using it over and over again.

提供一些亮点： RRDTool 将时间序列数据存储在循环数据库中。它保留给定时间段内的原始数据，然后以可配置的方式对其进行压缩，因此您可以获得一个月的细粒度数据、过去 6 个月一周的平均数据以及最后一个月的平均数据2年。作为副作用，您的数据库始终保持相同的大小（所以不要出汗，您的磁盘可能已满）。这是存储方面。在检索方面，RRDTool 提供的数据查询可立即转换为图形（例如 png），您可以轻松地将其包含在文档和网页中。它是一种坚如磐石、经过验证的解决方案，是其前身 MRTG 的广泛形式（有些人可能听说过这一点）。一旦你进入它，你会发现自己一遍又一遍地重复使用它。

For a quick overview and who uses RRDTool, see also here. If you want to see which kinds of graphics you can produce, make sure you have a look at the gallery.

有关快速概览以及谁使用 RRDTool，另请参阅此处。如果您想了解可以制作哪些类型的图形，请务必查看图库。

Answer 3

回答by SilentGhost

plain text files? It's not clear what your 10k data points per 15 minutes translates to in terms of bytes, but in any way text files are easier to store/archive/transfer/manipulate and you can inspect the directly, just by looking at. fairly easy to work with Python, too.

纯文本文件？不清楚每 15 分钟的 10k 数据点转换为多少字节，但无论如何，文本文件更容易存储/存档/传输/操作，您可以直接检查，只需查看即可。使用 Python 也相当容易。

Answer 4

回答by S.Lott

This is pretty standard data-warehousing stuff.

这是非常标准的数据仓库东西。

Lots of "facts", organized by a number of dimensions, one of which is time. Lots of aggregation.

许多“事实”，由多个维度组织，其中之一是时间。大量聚合。

In many cases, simple flat files that you process with simple aggregation algorithms based on defaultdictwill work wonders -- fast and simple.

在许多情况下，使用基于简单聚合算法处理的简单平面文件defaultdict会产生奇迹——快速而简单。

Look at Efficiently storing 7.300.000.000 rows

查看有效存储 7.300.000.000 行

Database choice for large data volume?

大数据量的数据库选择？

Answer 5

回答by Yurik

There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.

我编写了一个正在积极开发的开源时间序列数据库（目前仅适用于 .NET）。它可以以“二进制平面文件”的方式存储大量（TB 级）统一数据。所有用法都是面向流的（正向或反向）。我们积极将它用于我们公司的股票报价存储和分析。

https://code.google.com/p/timeseriesdb/

// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
   file.UniqueIndexes = true; // enforces index uniqueness
   file.InitializeNewFile(); // create file and write header
   file.AppendData(data); // append data (stream of ArraySegment<>)
}

// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
    // Enumerate one item at a time maxitum 10 items starting at 2011-1-1
    // (can also get one segment at a time with StreamSegments)
    foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
        Console.WriteLine(val);
}

python 存储时间序列数据的最佳开源解决方案是什么？

提问by lorg

回答by tom10

回答by ThomasH

回答by SilentGhost

回答by S.Lott

回答by Yurik

相关推荐

最近更新

标签

python 存储时间序列数据的最佳开源解决方案是什么？

提问by lorg

回答by tom10

回答by ThomasH

回答by SilentGhost

回答by S.Lott

回答by Yurik

相关推荐

python 如何对数字中的数字进行排序？

如何在 Python 中解析具有 UTC 偏移量的时区？

python datetime.strptime() 为 %Z 接受哪些可能的值？

python python中最短的哈希命名缓存文件

相关推荐

最近更新

标签