database 存储时间序列数据,关系型还是非关系型?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4814167/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Storing time-series data, relational or non?
提问by Marcus Whybrow
I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.
我正在创建一个系统,该系统使用 SNMP 以(可能)5 分钟的间隔轮询设备以获取有关不同指标的数据,例如 CPU 利用率、磁盘利用率、温度等。最终目标是以时间序列图的形式向系统用户提供可视化。
I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:
我过去曾考虑使用 RRDTool,但拒绝了它,因为无限期地存储捕获的数据对我的项目很重要,我希望对捕获的数据进行更高级别和更灵活的访问。所以我的问题是:
What is better, a relational database (such as MySQL or PostgreSQL) or a non-relational or NoSQL database (such as MongoDB or Redis) with regard to performance when querying data for graphing.
关系数据库(如 MySQL 或 PostgreSQL)或非关系或 NoSQL 数据库(如 MongoDB 或 Redis)在查询数据以进行绘图时的性能更好。
Relational
关系型
Given a relational database, I would use a data_instances
table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:
给定一个关系数据库,我将使用一个data_instances
表,其中将存储为所有设备测量的每个指标捕获的每个数据实例,其中包含以下字段:
Fields: id
fk_to_device
fk_to_metric
metric_value
timestamp
领域: id
fk_to_device
fk_to_metric
metric_value
timestamp
When I want to draw a graph for a particular metric on a particular device, I must query this singular table filtering outthe other devices, and the other metrics being analysed for this device:
当我想为特定设备上的特定指标绘制图形时,我必须查询此单一表以过滤掉其他设备以及正在为此设备分析的其他指标:
SELECT metric_value, timestamp FROM data_instances
WHERE fk_to_device=1 AND fk_to_metric=2
The number of rows in this table would be:
此表中的行数为:
d * m_d * f * t
where d
is the number of devices, m_d
is the accumulative number of metricsbeing recorded for all devices, f
is the frequencyat which data is polled for and t
is the total amount of timethe system has been collecting data.
其中d
是的数量的装置,m_d
是累计度量的数目被记录为所有设备,f
是频率在其中数据被轮询和t
是总量时间系统已收集数据。
For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 millionrecords.
对于一年中每 5 分钟为 3 台设备记录 10 个指标的用户,我们将拥有不到500 万条记录。
Indexes
索引
Without indexes on fk_to_device
and fk_to_metric
scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also timestamp
(for creating graphs with localised periods) is a requirement.
如果没有索引fk_to_device
并fk_to_metric
扫描这个不断扩展的表将花费太多时间。因此,索引上述字段以及timestamp
(用于创建具有本地化周期的图形)是一项要求。
Non-Relational (NoSQL)
非关系 (NoSQL)
MongoDB has the concept of a collection, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.
MongoDB 具有集合的概念,与表不同,这些表可以在没有设置的情况下以编程方式创建。有了这些,我可以对每个设备的数据存储进行分区,甚至可以对每个设备记录的每个指标进行分区。
I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.
我没有使用 NoSQL 的经验,不知道它们是否提供任何查询性能增强功能,例如索引,但是前一段建议在 NoSQL 下存储数据的结构中执行大多数传统关系查询工作。
Undecided
未定
Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?
具有正确索引的关系解决方案会在一年内减少到爬行吗?或者 NoSQL 方法的基于集合的结构(与我对存储数据的心理模型相匹配)是否提供了显着的好处?
采纳答案by PerformanceDBA
Definitely Relational. Unlimited flexibility and expansion.
绝对是关系。无限的灵活性和扩展性。
Two corrections, both in concept and application, followed by an elevation.
在概念和应用上进行了两次更正,然后是高度。
Correction
更正
It is not "filtering out the un-needed data"; it is selecting onlythe needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).
Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the rowis identified by:
(Device, Metric, DateTime)
The
Id
column does nothing, it is totally and completely redundant.- An
Id
column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means). The
Id
column requires an additional Index, which obviously impedes the speed ofINSERT/DELETE
, and adds to the disk space used.You can get rid of it. Please.
- An
它不是“过滤掉不需要的数据”;它只选择所需的数据。是的,当然,如果你有一个索引来支持 WHERE 子句中标识的列,它是非常快的,并且查询不依赖于表的大小(从 160 亿行表中抓取 1000 行是瞬时的) .
你的桌子有一个严重的障碍。根据您的描述,实际 PK 是(设备、指标、日期时间)。(请不要称它为 TimeStamp,这意味着别的东西,但这是一个小问题。)行的唯一性通过以下方式标识:
(Device, Metric, DateTime)
该
Id
列什么都不做,它完全是多余的。- 一个
Id
列是从未一个密钥(重复的行,这是禁止的在关系数据库中,必须通过其它方式来防止)。 该
Id
列需要一个额外的索引,这显然会阻碍 的速度INSERT/DELETE
,并增加了使用的磁盘空间。你可以摆脱它。请。
- 一个
Elevation
海拔
Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answerfrom the What is Sixth Normal Form ?heading onwards.
(I have one index only, not three; on the Non-SQLs you may need three indices).
I have the exact same table (without the
Id
"key", of course). I have an additional columnServer
. I support multiple customers remotely.(Server, Device, Metric, DateTime)
The table can be used to Pivot the data (ie.
Devices
across the top andMetrics
down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.Monitor Statistics Data Model.
(Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notationhelpful.
现在您已经移除了障碍,您可能没有认出它,但您的表格是第六范式。速度非常快,PK上只有一个Index。为了理解,阅读这个答案从什么是第六范式?前进。
(我只有一个索引,而不是三个;在非 SQL 上,您可能需要三个索引)。
我有完全相同的表(
Id
当然没有“键”)。我有一个额外的专栏Server
。我远程支持多个客户。(Server, Device, Metric, DateTime)
该表可用于使用完全相同的 SQL 代码(是的,切换单元格)来透视数据(即
Devices
跨越顶部和Metrics
侧面,或旋转)。我使用表格为客户建立了无限种类的图表和图表,以了解他们的服务器性能。监控统计数据模型。
(内联太大了;有些浏览器无法加载内联;点击链接。也是过时的演示版本,出于明显的原因,我不能向您展示商业产品DM。)它允许我使用单个 SELECT 命令生成这样的图表,在从客户接收原始监控统计文件后敲击 6 次按键。注意混搭;操作系统和服务器在同一图表上;各种支点。当然,统计矩阵的数量没有限制,图表也没有限制。(在客户的善意许可下使用。)
不熟悉关系数据库建模标准的读者可能会发现IDEF1X 表示法很有帮助。
One More Thing
还有一件事
Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.
最后但并非最不重要的一点是,SQL 是 IEC/ISO/ANSI 标准。免费软件实际上是非 SQL;如果他们不提供标准,则使用术语 SQL 是欺诈性的。他们可能会提供“额外”,但他们缺乏基础知识。
回答by Paolo Bozzola
Found very interesting the above answers. Trying to add a couple more considerations here.
发现上面的答案很有趣。尝试在此处添加更多注意事项。
1) Data aging
1) 数据老化
Time-series management usually need to create aging policies. A typical scenario (e.g. monitoring server CPU) requires to store:
时间序列管理通常需要创建老化策略。一个典型的场景(例如监控服务器 CPU)需要存储:
1-secraw samples for a short period (e.g. for 24 hours)
5-mindetail aggregate samples for a medium period (e.g. 1 week)
1-hourdetail over that (e.g. up to 1 year)
短时间(例如 24 小时)的1 秒原始样本
中期(例如 1 周)的5 分钟详细聚合样本
1 小时的详细信息(例如长达 1 年)
Although relational models make it possible for sure (my company implemented massive centralized databases for some large customers with tens of thousands of data series) to manage it appropriately, the new breed of data stores add interesting functionalities to be explored like:
尽管关系模型可以确保(我的公司为一些拥有数万个数据系列的大客户实施了大规模集中式数据库)对其进行适当的管理,但新型数据存储添加了一些有趣的功能以供探索,例如:
automated data purging (see Redis' EXPIRE command)
multidimensional aggregations (e.g. map-reduce jobs a-la-Splunk)
自动数据清除(请参阅 Redis 的 EXPIRE 命令)
多维聚合(例如 map-reduce 作业 a-la-Splunk)
2) Real-time collection
2)实时采集
Even more importantly some non-relational data stores are inherently distributed and allow for a much more efficient real-time (or near-real time) data collection that could be a problem with RDBMS because of the creation of hotspots (managing indexing while inserting in a single table). This problem in the RDBMS space is typically solved reverting to batch import procedures (we managed it this way in the past) while no-sql technologies have succeeded in massive real-time collection and aggregation (see Splunk for example, mentioned in previous replies).
更重要的是,一些非关系数据存储本质上是分布式的,并且允许更高效的实时(或接近实时)数据收集,由于热点的创建(在插入时管理索引),这可能成为 RDBMS 的问题一张桌子)。RDBMS空间的这个问题一般是通过恢复到批量导入过程来解决的(过去我们是这样管理的),而no-sql技术在大规模实时收集和聚合方面取得了成功(例如之前的回复中提到的Splunk) .
回答by Ravindra
You table has data in single table. So relational vs non relational is not the question. Basically you need to read a lot of sequential data. Now if you have enough RAM to store a years worth data then nothing like using Redis/MongoDB etc.
您的表在单个表中有数据。所以关系与非关系不是问题。基本上你需要读取大量的顺序数据。现在,如果您有足够的 RAM 来存储数年的数据,那么使用 Redis/MongoDB 等就没什么了。
Mostly NoSQL databases will store your data on same location on disk and in compressed form to avoid multiple disk access.
大多数情况下,NoSQL 数据库会将您的数据以压缩形式存储在磁盘上的相同位置,以避免多个磁盘访问。
NoSQL does the same thing as creating the index on device id and metric id, but in its own way. With database even if you do this the index and data may be at different places and there would be a lot of disk IO.
NoSQL 与在设备 id 和指标 id 上创建索引做同样的事情,但以它自己的方式。对于数据库,即使您这样做,索引和数据也可能位于不同的位置,并且会有大量磁盘 IO。
Tools like Splunk are using NoSQL backends to store time series data and then using map reduce to create aggregates (which might be what you want later). So in my opinion to use NoSQL is an option as people have already tried it for similar use cases. But will a million rows bring the database to crawl (maybe not , with decent hardware and proper configurations).
像 Splunk 这样的工具使用 NoSQL 后端来存储时间序列数据,然后使用 map reduce 来创建聚合(这可能是您以后想要的)。所以在我看来,使用 NoSQL 是一种选择,因为人们已经在类似的用例中尝试过它。但是一百万行是否会使数据库爬行(也许不会,如果有不错的硬件和适当的配置)。
回答by hellomichibye
Create a file, name it 1_2.data. weired idea? what you get:
创建一个文件,命名为 1_2.data。奇怪的想法?你得到什么:
- You save up to 50% of space because you don't need to repeat the fk_to_device and fk_to_metric value for every data point.
- You save up even more space because you don't need any indices.
- Save pairs of (timestamp,metric_value) to the file by appending the data so you get a order by timestamp for free. (assuming that your sources don't send out of order data for a device)
- 您可以节省多达 50% 的空间,因为您无需为每个数据点重复 fk_to_device 和 fk_to_metric 值。
- 您可以节省更多空间,因为您不需要任何索引。
- 通过附加数据将 (timestamp,metric_value) 对保存到文件中,这样您就可以免费获得按时间戳排序的订单。(假设您的来源不会发送设备的乱序数据)
=> Queries by timestamp run amazingly fast because you can use binary search to find the right place in the file to read from.
=> 按时间戳查询的运行速度非常快,因为您可以使用二进制搜索在文件中找到要读取的正确位置。
if you like it even more optimized start thinking about splitting your files like that;
如果您喜欢它更优化,请开始考虑像这样拆分文件;
- 1_2_january2014.data
- 1_2_february2014.data
- 1_2_march2014.data
- 1_2_january2014.data
- 1_2_february2014.data
- 1_2_march2014.data
or use kdb+ from http://kx.combecause they do all this for you:) column-oriented is what may help you.
或使用来自http://kx.com 的kdb+,因为它们为您完成所有这些工作:) 面向列的可能对您有所帮助。
There is a cloud-based column-oriented solution popping up, so you may want to have a look at: http://timeseries.guru
有一个基于云的面向列的解决方案弹出,所以你可能想看看:http: //timeseries.guru
回答by sunil
回答by Phil Hymanson
This is a problem we've had to solve at ApiAxle. We wrote up a blog poston how we did it using Redis. It hasn't been out there for very long but it's proving to be effective.
这是我们必须在 ApiAxle 解决的问题。我们写了一篇关于我们如何使用 Redis 做到这一点的博客文章。它推出的时间不长,但已被证明是有效的。
I've also used RRDToolfor another project which was excellent.
我还在另一个非常出色的项目中使用了RRDTool。
回答by Shay
I think that the answer for this kind of question should mainly revolve about the way your Database utilize storage. Some Database servers use RAM and Disk, some use RAM only (optionally Disk for persistency), etc. Most common SQL Database solutions are using memory+disk storage and writes the data in a Row based layout (every inserted raw is written in the same physical location). For timeseries stores, in most cases the workload is something like: Relatively-low interval of massive amount of inserts, while reads are column based (in most cases you want to read a range of data from a specific column, representing a metric)
我认为这类问题的答案应该主要围绕您的数据库利用存储的方式。一些数据库服务器使用 RAM 和磁盘,一些仅使用 RAM(可选磁盘用于持久性)等。最常见的 SQL 数据库解决方案是使用内存+磁盘存储并以基于行的布局写入数据(每个插入的原始数据都以相同的方式写入)物理位置)。对于时间序列存储,在大多数情况下,工作负载类似于:相对较低的大量插入间隔,而读取是基于列的(在大多数情况下,您希望从特定列读取一系列数据,代表一个指标)
I have found Columnar Databases (google it, you'll find MonetDB, InfoBright, parAccel, etc) are doing terrific job for time series.
我发现列式数据库(谷歌它,你会发现 MonetDB、InfoBright、parAccel 等)在时间序列方面做得非常好。
As for your question, which personally I think is somewhat invalid (as all discussions using the fault term NoSQL - IMO): You can use a Database server that can talk SQL on one hand, making your life very easy as everyone knows SQL for many years and this language has been perfected over and over again for data queries; but still utilize RAM, CPU Cache and Disk in a Columnar oriented way, making your solution best fit Time Series
至于你的问题,我个人认为这有点无效(因为所有讨论都使用错误术语 NoSQL - IMO):你可以使用一个可以一方面讨论 SQL 的数据库服务器,让你的生活变得非常轻松,因为每个人都知道很多 SQL多年来,这种语言已经一次又一次地完善用于数据查询;但仍然以面向列的方式利用 RAM、CPU 缓存和磁盘,使您的解决方案最适合时间序列
回答by Juan Asenjo
5 Millions of rows is nothing for today's torrential data. Expect data to be in the TB or PB in just a few months. At this point RDBMS do not scale to the task and we need the linear scalability of NoSql databases. Performance would be achieved for the columnar partition used to store the data, adding more columns and less rows kind of concept to boost performance. Leverage the Open TSDB work done on top of HBASE or MapR_DB, etc.
5 百万行对于今天的洪流数据来说不算什么。预计数据会在短短几个月内出现在 TB 或 PB 中。此时 RDBMS 不能扩展到任务,我们需要 NoSql 数据库的线性可扩展性。用于存储数据的列式分区将实现性能,添加更多列和更少行的概念以提高性能。利用在 HBASE 或 MapR_DB 等之上完成的 Open TSDB 工作。
回答by monch1962
I face similar requirements regularly, and have recently started using Zabbix to gather and store this type of data. Zabbix has its own graphing capability, but it's easy enough to extract the data out of Zabbix's database and process it however you like. If you haven't already checked Zabbix out, you might find it worth your time to do so.
我经常遇到类似的需求,最近开始使用 Zabbix 来收集和存储这种类型的数据。Zabbix 有自己的绘图功能,但是从 Zabbix 的数据库中提取数据并按照您喜欢的方式处理它很容易。如果您还没有检查过 Zabbix,您可能会发现花时间这样做是值得的。
回答by Adam
You should look into Time series database. It was created for this purpose.
您应该查看时间序列数据库。它是为此目的而创建的。
A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range).
时间序列数据库 (TSDB) 是一种针对处理时间序列数据、按时间(日期时间或日期时间范围)索引的数字数组进行了优化的软件系统。
Popular example of time-series database InfluxDB
时间序列数据库InfluxDB 的流行示例