database 我应该选择什么：MongoDB/Cassandra/Redis/CouchDB？

Question

提问by Juanda

We're developing a really big project and I was wondering if anyone can give me some advice about what DB backend should we pick.

我们正在开发一个非常大的项目，我想知道是否有人可以就我们应该选择什么 DB 后端给我一些建议。

Our system is compound by 1100 electronic devices that send a signal to a central server and then the server stores the signal info (the signal is about 35 bytes long). How ever these devices will be sending about 3 signals per minute each, so if we do de numbers, that'll be 4.752.000 new records/day on the database, and a total of 142.560.000 new records/month.

我们的系统由 1100 个电子设备组成，这些设备向中央服务器发送信号，然后服务器存储信号信息（信号长约 35 个字节）。这些设备每分钟将发送大约 3 个信号，所以如果我们进行数字化，那么数据库中每天将有 4.752.000 条新记录，每月总共有 142.560.000 条新记录。

We need a DB Backend that is lighting fast and reliable. Of course we need to do some complex data mining on that DB. We're doing some research on the MongoDB/Cassandra/Redis/CouchDB, however the documentation websites are still on early stages.

我们需要一个快速可靠的数据库后端。当然，我们需要对该数据库进行一些复杂的数据挖掘。我们正在对 MongoDB/Cassandra/Redis/CouchDB 进行一些研究，但是文档网站仍处于早期阶段。

Any help? Ideas?

有什么帮助吗？想法？

Thanks a lot!

非常感谢！

Answer 1

回答by user359996

Don't let the spatial scale (1000+ devices) mislead you as to the computational and/or storage scale. A few dozen 35-byte inserts per second is a trivial workload for any mainstream DBMS, even running on low-end hardware. Likewise, 142 million records per month is only on the order of 1~10 gigabytes of storage per month, without any compression, including indices.

不要让空间规模（1000 多个设备）在计算和/或存储规模方面误导您。每秒几十次 35 字节的插入对于任何主流 DBMS 来说都是微不足道的工作负载，即使在低端硬件上运行也是如此。同样，每月 1.42 亿条记录仅相当于每月 1 至 10 GB 的存储量，没有任何压缩，包括索引。

In your question comment, you said:

在您的问题评论中，您说：

"It's all about reliability, scalability and speed. It's very important that the solution scales easily (MongoDB autosharding?) just throwing in more nodes, and the speed is also very important

“这完全是关于可靠性、可扩展性和速度。解决方案易于扩展（MongoDB 自动分片？）非常重要，只需投入更多节点，速度也非常重要

Reliability? Any mainstream DBMS can guarantee this (assuming you mean it's not going to corrupt your data, and it's not going to crash--see my discussion of the CAP theorem at the bottom of this answer). Speed? Even with a single machine, 10~100 times this workload should not be a problem. Scalability? At the current rate, a full year's data, uncompressed, even fully indexed, would easily fit within 100 gigabytes of disk space (likewise, we've already established the insert rate is not an issue).

可靠性？任何主流 DBMS 都可以保证这一点（假设您的意思是它不会破坏您的数据，也不会崩溃——请参阅我在本答案底部对 CAP 定理的讨论）。速度？即使是单机，10~100倍这个工作量应该不是问题。可扩展性？按照目前的速度，一整年的数据，未压缩，甚至完全索引，都可以轻松容纳 100 GB 的磁盘空间（同样，我们已经确定插入率不是问题）。

As such, I don't see any clear need for an exotic solution like NoSQL, or even a distributed database--a plain, old relational database such as MySQL would be just fine. If you're worried about failover, just setup a backup server in a master-slave configuration. If we're talking 100s or 1000s of times the current scale, just horizontally partition a few instances based on the ID of the data-gathering device (i.e.{partition index} = {device id} modulo {number of partitions}).

因此，我看不出对像 NoSQL 这样的奇特解决方案，甚至分布式数据库有任何明确的需求——像 MySQL 这样的普通的旧关系数据库就可以了。如果您担心故障转移，只需在主从配置中设置备份服务器。如果我们谈论的是当前规模的 100 或 1000 倍，只需根据数据收集设备的 ID（即{partition index} = {device id} modulo {number of partitions}）水平分区几个实例。

Bear in mind that leaving the safe and comfy confines of the relational database world means abandoning both its representational modeland its rich toolset. This will make your "complex datamining" much more difficult--you don't just need to put data into the database, you also need to get it out.

请记住，离开关系数据库世界的安全和舒适范围意味着放弃其表示模型和丰富的工具集。这将使您的“复杂数据挖掘”变得更加困难——您不仅需要将数据放入数据库，还需要将其取出。

All of that being said, MongoDB and CouchDB are uncommonly simple to deploy and work with. They're also very fun, and will make you more attractive to any number of people (not just programmers--executives, too!).

尽管如此，MongoDB 和 CouchDB 的部署和使用都异常简单。它们也很有趣，会让你对任何人（不仅仅是程序员——也包括高管！）更具吸引力。

The common wisdom is that, of the three NoSQL solutions you suggested, Cassandra is the best for high insert volume (of course, relatively speaking, I don't think you havehigh insert volume--this was designed to be used by Facebook); this is countered by being more difficult to work with. So unless you have some strange requirements you didn't mention, I would recommend against it, for your use case.

共同的看法是，你提出三个NoSQL的解决方案，Cassandra是最好的高容量插入（当然，相对而言，我不认为你有高插入量-这是旨在通过使用Facebook的） ; 与此相反，它更难处理。因此，除非您有一些未提及的奇怪要求，否则我建议您不要这样做，以用于您的用例。

If you're positively set on a NoSQL deployment, you might want to consider the CAP theorem. This will help you decide between MongoDB and CouchDB. Here's a good link: http://blog.nahurst.com/visual-guide-to-nosql-systems. It all comes down to what you mean by "reliability": MongoDB trades availability for consistency, whereas CouchDB trades consistency for availability. (Cassandra allows you to finesse this tradeoff, per query, by specifying how many servers must be written/read for a write/read to succeed; UPDATE: Now, so can CouchDB, with BigCouch! Very exciting...)

如果您对 NoSQL 部署持肯定态度，则可能需要考虑 CAP 定理。这将帮助您在 MongoDB 和 CouchDB 之间做出决定。这是一个很好的链接：http: //blog.nahurst.com/visual-guide-to-nosql-systems。这一切都归结为“可靠性”的含义：MongoDB 以可用性换取一致性，而 CouchDB 以一致性换取可用性。（Cassandra 允许您在每个查询中通过指定必须写入/读取多少个服务器才能成功写入/读取来优化此权衡；更新：现在，使用BigCouch 的CouchDB 也可以！非常令人兴奋...）

Best of luck in your project.

祝你的项目好运。

Answer 2

回答by Theo

Much of the answer depends on what you want to do with it after it's been collected. Storing lots of data is easy: just dumt it into log files, no need for a database. On the other hand, if you want to perform complex analysis and data mining on it, then a database is helpful.

大部分答案取决于收集后您想用它做什么。存储大量数据很容易：只需将其转储到日志文件中，无需数据库。另一方面，如果你想对其进行复杂的分析和数据挖掘，那么数据库是有帮助的。

The next question is what kind of analysis you're going to do. Will it be performed on a subset of the data that has a particular property, the last hour/day/week/month only, can the data aggregated or somehow pre-computed? In other words: do you need access to the whole dataset in the form it is collected? Can you archive data when it gets too old to be interesting? Can you aggregate the data and perform the analysis on the aggregation?

下一个问题是你要进行什么样的分析。它是否会在具有特定属性的数据子集上执行，仅最后一个小时/天/周/月，数据是否可以聚合或以某种方式预先计算？换句话说：您是否需要以收集的形式访问整个数据集？当数据太旧而变得有趣时，您可以存档数据吗？你能聚合数据并对聚合进行分析吗？

In my experience from working with advertising analytics (collecting billions of data points about ad exposures) aggregation is key. You collect raw data, sanitize it and then put it into a database like MongoDB, Cassandra or even MySQL that let you do updates and queries. Then you periodically aggregate the data and remove it from the database (but archive the raw data, you may need it later).

根据我使用广告分析（收集有关广告曝光的数十亿个数据点）的经验，聚合是关键。您收集原始数据，对其进行清理，然后将其放入诸如 MongoDB、Cassandra 甚至 MySQL 之类的数据库中，以便您进行更新和查询。然后您定期聚合数据并将其从数据库中删除（但存档原始数据，您以后可能需要它）。

The aggregation essentially asks all the questions that you want to ask about the data, and saves it in a form that makes it easy to retrieve the answer for a particular question. Say that you want to know on which day of the week has the most X. The naive implementation of this would be to keep all recorded signals in a huge table and do a query that sums all rows that have X. As the number of collected signals grow this query will take longer and longer. No amount of indexing, sharding or optimization will help with this. Instead every day/hour/minute (depending on the exact use case and how up to date your reporting needs to be) you look at the new signals you've recorded, and for every X you increment the counter that keeps track of how many X there were on mondays, if it's a monday, tuesdays if it's a tuesday and so on. That way you can later on retrieve the count for each day of the week and compare them. You do this for all questions you want to be able to answer, and then you remove the signals from the database (but again, keep the raw data).

聚合本质上会询问您想询问的有关数据的所有问题，并将其保存为便于检索特定问题答案的形式。假设你想知道一周中哪一天的 X 最多。最简单的实现是将所有记录的信号保存在一个巨大的表中，并执行一个查询，将所有具有 X 的行相加。作为收集的数量信号增长此查询将花费越来越长的时间。再多的索引、分片或优化都无济于事。相反，每天/小时/分钟（取决于确切的用例以及您的报告需要更新到什么程度），您都会查看您记录的新信号，并且对于每个 X，您增加计数器以跟踪多少X 在星期一，如果是星期一，如果是星期二，则是星期二等等。这样您以后就可以检索一周中每一天的计数并进行比较。您对所有想要回答的问题执行此操作，然后从数据库中删除信号（但再次保留原始数据）。

The database type you record the aggregates in can be the same as the one you store the incoming signals in, but it doesn't need to be very fancy. It will store keys that represent a particular answer, and values that are usually just numbers.

您记录聚合的数据库类型可以与您存储传入信号的数据库类型相同，但它不需要非常花哨。它将存储代表特定答案的键，以及通常只是数字的值。

In old school data warehousing speak the database you store the incoming signals in is called an OLTP (for on-line transactional processing) and the database you store the aggregates in is called OLAP (for on-line analytical processing). OLTP is optimized for insertion and OLAP is optimized for querying. The terms are old and when people hear them they tend to immediately think SQL and starschemas and all that. Perhaps I shouldn't use them, but they are convenient terms.

在旧式数据仓库中，您存储传入信号的数据库称为 OLTP（用于在线事务处理），您存储聚合的数据库称为 OLAP（用于在线分析处理）。OLTP 针对插入进行了优化，OLAP 针对查询进行了优化。这些术语很古老，当人们听到它们时，他们往往会立即想到 SQL 和 starchemas 等等。也许我不应该使用它们，但它们是方便的术语。

Anyway, for OLTP you want something that is quick at inserting data, but also something that supports indexing the data and searching for things. The aggregation is greatly helped by a database that does half the work of summing and finding maximums and minimums. I really like MongoDB because it's so easy to set up and work with. The data I work with tends to be messy and not all items have the same set of properties, so the forgiving schemalessness of Mongo is a boon. On the other hand, your data sounds much more uniform, so Mongo would perhaps not give you as much benefits. Don't overlook the good old relational databases just yet though. If you're going to do a lot of summing and so on then SQL is great, that's what it's built for.

无论如何，对于 OLTP，您需要能够快速插入数据的东西，而且还需要支持索引数据和搜索内容的东西。一个数据库完成了求和和查找最大值和最小值的一半工作，对聚合有很大帮助。我真的很喜欢 MongoDB，因为它很容易设置和使用。我使用的数据往往很混乱，而且并非所有项目都具有相同的属性集，因此 Mongo 的无模式架构是一个福音。另一方面，您的数据听起来更加统一，因此 Mongo 可能不会给您带来那么多好处。不过，不要忽视旧的关系数据库。如果您要进行大量求和等操作，那么 SQL 非常棒，这就是它的目的。

For OLAP something much simpler works, a key-value store is all you need. I use Redis because it too is very easy to work with and to set up. It also lets you store more than scalar values, which is convenient. Sometimes your value is actually a list, or a hash, in most key-value stores you have to encode such values, but Redis handles it natively. The downside of Redis is that you can't do queries ("as in give me all rows that has this value for Y"), you have to keep indices to your data yourself. On the other hand you won't need indices very much since the answers to all your questions have been precomputed, all you need to do is look up the answer by a key that is defined by the question. For the question above, which day of the week has the most X you look up the number of X work monday, tuesday, etc. perhaps you've stored them as X:monday, X:tuesday, etc.

对于 OLAP 来说，一些更简单的工作，一个键值存储就是你所需要的。我使用 Redis 是因为它也很容易使用和设置。它还允许您存储多个标量值，这很方便。有时您的值实际上是一个列表或哈希值，在大多数键值存储中您必须对这些值进行编码，但 Redis 本身会处理它。Redis 的缺点是您不能进行查询（“如给我所有具有 Y 值的行”），您必须自己保留数据的索引。另一方面，由于所有问题的答案都已预先计算，因此您不需要非常多的索引，您需要做的就是通过问题定义的键查找答案。对于上面的问题，一周中哪一天的 X 最多，您可以查找星期一、星期二等 X 工作的数量，也许您'

In conclusion: MongoDB and Redis works great for me. I don't think MongoDB is very good for your use case, instead I think you actually might benefit more from a traditional SQL database (but it depends, if your data is really simple you could perhaps use Redis all the way). The most important thing is to not make the mistake of thinking that you need to have the data in one database and keep it forever. Aggregation and throwing away old data is key.

结论：MongoDB 和 Redis 对我来说非常有用。我不认为 MongoDB 非常适合您的用例，相反，我认为您实际上可能会从传统的 SQL 数据库中受益更多（但这取决于，如果您的数据非常简单，您也许可以一直使用 Redis）。最重要的是不要误以为您需要将数据保存在一个数据库中并永久保存。聚合和丢弃旧数据是关键。

Answer 3

回答by Ben Damman

CouchDB is very reliable, provides excellent durability, and you'll experience very low CPU load. It's also excellent at replicating between multiple nodes, either on-demand or continuously.

CouchDB 非常可靠，提供出色的耐用性，并且您会体验到非常低的 CPU 负载。它也非常擅长在多个节点之间进行按需或连续复制。

Thanks to its replication abilities and RESTful API (it uses HTTP for its API) you can scale horizontally pretty easily using mature tools. (Nginx or Apache for reverse proxying, HTTP load balancers, etc.)

由于其复制能力和 RESTful API（它的 API 使用 HTTP），您可以使用成熟的工具轻松地进行水平扩展。（用于反向代理、HTTP 负载均衡器等的 Nginx 或 Apache）

You write map/reduce functions in JavaScript to precompute queries. The results are built up incrementally on disk which means they only neeed to be computed once per signal. In other words, queries can be really fast because it only has to do calculations on the signal data recorded since the last time you ran the query.

您可以在 JavaScript 中编写 map/reduce 函数来预计算查询。结果在磁盘上以增量方式构建，这意味着每个信号只需计算一次。换句话说，查询可以非常快，因为它只需要对自上次运行查询以来记录的信号数据进行计算。

CouchDB trades disk space for performance, so you can expect to use a lot of disk space. Your queries can be lightning fast and conserve disk space if you implement them properly.

CouchDB 以磁盘空间换取性能，因此您可能会使用大量磁盘空间。如果您正确实施，您的查询可以闪电般快速并节省磁盘空间。

Give CouchDB a try.

试试 CouchDB。

Check out Why Large Hadron Collider Scientists are Using CouchDBand CouchDB at the BBC as a fault tolerant, scalable, multi-data center key-value store

查看为什么大型强子对撞机科学家在 BBC 上使用 CouchDB和CouchDB 作为容错、可扩展、多数据中心的键值存储

Answer 4

回答by jbellis

~3000 signals/minute = 50 writes/s which any of these systems will be able to handle easily.

~3000 个信号/分钟 = 50 个写入/秒，这些系统中的任何一个都可以轻松处理。

Cassandra will probably work best as your data set grows larger than memory, though, and the Hadoop integration will help with your data mining.

不过，当您的数据集增长大于内存时，Cassandra 可能会发挥最佳效果，并且 Hadoop 集成将有助于您的数据挖掘。

Answer 5

回答by Kiran Subbaraman

You are looking for a datastore that can allow "lightning fast" writes (data persisted on disk), and the data-mining will occur at a later stage (this is the READ cycle). Also, considering the numbers you state, it turns out you will collect all of 159MB of information per day, or approx 5GB per month.

您正在寻找可以允许“闪电般快速”写入（数据保留在磁盘上）的数据存储，并且数据挖掘将在稍后阶段进行（这是 READ 周期）。此外，考虑到您陈述的数字，事实证明您每天将收集所有 159MB 的信息，或每月大约 5GB。

In this case, why not look at Redis.

既然如此，何不看看Redis。

You could always archive the daily Redis data file, and refer to it later (if you have concerns of loading 5GB or greater amount of RAM space, then you this archiving could be a workaround)

您可以随时归档每日 Redis 数据文件，并在以后参考它（如果您担心加载 5GB 或更大的 RAM 空间，那么您可以通过归档来解决这个问题）

Redis is rather fast, based on the numbers published on that site. Hope this helps. Kiran

根据该网站上发布的数字，Redis 速度相当快。希望这可以帮助。基兰

Answer 6

回答by TTT

So you are storing data in a central db for datamining? No online transaction processing?

因此，您将数据存储在中央数据库中以进行数据挖掘？没有在线交易处理？

I don't think that MongoDB does a good job when it comes to durability. See http://nosql.mypopescu.com/post/392868405/mongodb-durability-a-tradeoff-to-be-aware-of.

我不认为 MongoDB 在持久性方面做得很好。请参阅http://nosql.mypopescu.com/post/392868405/mongodb-durability-a-tradeoff-to-be-aware-of。

Maybe you can use analytics db Infobright, it has a community edition: http://www.infobright.org/?

也许你可以使用分析数据库 Infobright，它有一个社区版：http: //www.infobright.org/？

Answer 7

回答by Evan

If you're liking the look of Cassandra for its designed-from-the-start ability to scale horizontally, tune consistency against availability and such, then you may also want to look at Riak, which has a similar feature set but a different approach.

如果您喜欢 Cassandra 的外观，因为它从一开始就设计了水平扩展能力、根据可用性调整一致性等，那么您可能还想看看Riak，它具有相似的功能集但采用不同的方法.

Answer 8

回答by cryptic_star

I've used MongoDB from Incanterand have liked it. Although I can't speak to the speed with such large datasets, Clojure (which Incanter is based on) is very reliable in terms of transaction management. Incanter also provides some great analysis tools, so if you're planning on analyzing all of that data, MongoDB + Incanter could be a powerful combination.

我使用过Incanter 的MongoDB并且很喜欢它。虽然我无法谈论如此大数据集的速度，但 Clojure（Incanter 所基于的）在事务管理方面非常可靠。Incanter 还提供了一些出色的分析工具，因此如果您计划分析所有这些数据，MongoDB + Incanter 可能是一个强大的组合。

database 我应该选择什么：MongoDB/Cassandra/Redis/CouchDB？

提问by Juanda

回答by user359996

回答by Theo

回答by Ben Damman

回答by jbellis

回答by Kiran Subbaraman

回答by TTT

回答by Evan

回答by cryptic_star

相关推荐

最近更新

标签

database 我应该选择什么：MongoDB/Cassandra/Redis/CouchDB？

提问by Juanda

回答by user359996

回答by Theo

回答by Ben Damman

回答by jbellis

回答by Kiran Subbaraman

回答by TTT

回答by Evan

回答by cryptic_star

相关推荐

database 我在哪里可以找到历史原始天气数据？

Spring post方法“缺少所需的请求正文”

database 需要在不使用管理工作室的情况下创建新数据库

database 从数据库中检索图像并使用 JSTL 将其显示在 JSP 中

相关推荐

最近更新

标签