MongoDB 与 Redis 与 Cassandra 的快速写入临时行存储解决方案

Question

提问by Mark Bao

I'm building a system that tracks and verifies ad impressions and clicks. This means that there are a lot of insert commands (about 90/second average, peaking at 250) and some read operations, but the focus is on performance and making it blazing-fast.

我正在构建一个跟踪和验证广告展示和点击的系统。这意味着有很多插入命令（平均约 90 次/秒，峰值为 250 次）和一些读取操作，但重点是性能并使其速度极快。

The system is currently on MongoDB, but I've been introduced to Cassandra and Redis since then. Would it be a good idea to go to one of these two solutions, rather than stay on MongoDB? Why or why not?

该系统目前在 MongoDB 上，但从那时起我就开始接触 Cassandra 和 Redis。使用这两种解决方案之一而不是留在 MongoDB 上是个好主意吗？为什么或者为什么不？

Thank you

谢谢

Answer 1

回答by Skrylar

For a harvesting solution like this, I would recommend a multi-stage approach. Redis is good at real time communication. Redis is designed as an in-memory key/value store and inherits some very nice benefits of being a memory database: O(1) list operations. For as long as there is RAM to use on a server, Redis will not slow down pushing to the end of your lists which is good when you need to insert items at such an extreme rate. Unfortunately, Redis can't operate with data sets larger than the amount of RAM you have (it only writesto disk, reading is for restarting the server or in case of a system crash) and scaling has to be done by youand your application. (A common way is to spread keys across numerous servers, which is implemented by some Redis drivers especially those for Ruby on Rails.) Redis also has support for simple publish/subscribe messenging, which can be useful at times as well.

对于这样的收获解决方案，我建议采用多阶段方法。Redis 擅长实时通信。Redis 被设计为内存中的键/值存储，并继承了作为内存数据库的一些非常好的好处：O(1) 列表操作。只要服务器上有 RAM 可用，Redis 就不会放慢推送到列表末尾的速度，这在您需要以如此极端的速度插入项目时非常有用。不幸的是，Redis 无法处理大于您拥有的 RAM 量的数据集（它只写入磁盘，读取用于重新启动服务器或系统崩溃的情况）并且扩展必须由您和您的应用程序完成. （一种常见的方法是将密钥分布在多个服务器上，这是由一些 Redis 驱动程序实现的，尤其是那些用于 Ruby on Rails 的驱动程序。）Redis 还支持简单的发布/订阅消息传递，这有时也很有用。

In this scenario, Redis is "stage one." For each specific type of event you create a list in Redis with a unique name; for example we have "page viewed" and "link clicked." For simplicity we want to make sure the data in each list is the same structure; link clicked may have a user token, link name and URL, while the page viewed may only have the user token and URL. Your first concern is just getting the fact it happened and whatever absolutely neccesarydata you need is pushed.

在这种情况下，Redis 是“第一阶段”。对于每种特定类型的事件，您在 Redis 中创建一个具有唯一名称的列表；例如，我们有“页面查看”和“链接点击”。为简单起见，我们希望确保每个列表中的数据具有相同的结构；点击的链接可能有用户令牌、链接名称和 URL，而查看的页面可能只有用户令牌和 URL。您首先关心的是了解它发生的事实，并且推送您需要的任何绝对必要的数据。

Next we have some simple processing workers that take this frantically inserted information off of Redis' hands, by asking it to take an item off the end of the list and hand it over. The worker can make any adjustments/deduplication/ID lookups needed to properly file the data and hand it off to a more permanent storage site. Fire up as many of these workers as you need to keep Redis' memory load bearable. You could write the workers in anything you wish (Node.js, C#, Java, ...) as long as it has a Redis driver (most web languages do now) and one for your desired storage (SQL, Mongo, etc.)

接下来，我们有一些简单的处理工作人员，它们通过要求它从列表的末尾取出一个项目并将其移交给 Redis 的手，从而将这些疯狂插入的信息从 Redis 手中夺走。工作人员可以进行任何调整/重复数据删除/ID 查找以正确归档数据并将其移交给更永久的存储站点。根据需要启动尽可能多的这些工作器，以保持 Redis 的内存负载可承受。只要有 Redis 驱动程序（现在大多数 Web 语言都可以）和用于所需存储的驱动程序（SQL、Mongo 等），您就可以使用任何您希望的方式（Node.js、C#、Java 等）编写工作程序。 )

MongoDB is good at document storage. Unlike Redis it is able to deal with databases larger than RAM and it supports sharding/replication on it's own. An advantage of MongoDB over SQL-based options is that you don't have to have a predetermined schema, you're free to change the way data is stored however you want at any time.

MongoDB 擅长文档存储。与 Redis 不同，它能够处理大于 RAM 的数据库，并且它自己支持分片/复制。与基于 SQL 的选项相比，MongoDB 的一个优势是您不必拥有预先确定的架构，您可以随时随意更改数据的存储方式。

I would, however, suggest Redis or Mongo for the "step one" phase of holding data for processing and use a traditional SQL setup (Postgres or MSSQL, perhaps) to store post-processed data. Tracking client behavior sounds like relational data to me, since you may want to go "Show me everyone who views this page" or "How many pages did this person view on this given day" or "What day had the most viewers in total?". There may be even more complex joins or queries for analytic purposes you come up with, and mature SQL solutions can do a lot of this filtering for you; NoSQL (Mongo or Redis specifically) can't do joins or complex queries across varied sets of data.

但是，我会建议 Redis 或 Mongo 用于保存数据进行处理的“第一步”阶段，并使用传统的 SQL 设置（可能是 Postgres 或 MSSQL）来存储后处理数据。跟踪客户行为对我来说听起来像是关系数据，因为您可能想要“向我显示查看此页面的每个人”或“此人在这一天查看了多少页面”或“哪一天的查看者总数最多？ ”。您可能会出于分析目的提出更复杂的连接或查询，而成熟的 SQL 解决方案可以为您完成很多此类过滤；NoSQL（特别是 Mongo 或 Redis）无法跨不同的数据集进行连接或复杂查询。

Answer 2

回答by Gates VP

I currently work for a verylarge ad network and we write to flat files :)

我目前为一个非常大的广告网络工作，我们写入平面文件:)

I'm personally a Mongo fan, but frankly, Redis and Cassandra are unlikely to perform either better or worse. I mean, all you're doing is throwing stuff into memory and then flushing to disk in the background (both Mongo and Redis do this).

我个人是 Mongo 的粉丝，但坦率地说，Redis 和 Cassandra 不太可能表现得更好或更差。我的意思是，您所做的就是将东西扔进内存，然后在后台刷新到磁盘（Mongo 和 Redis 都这样做）。

If you're looking for blazing fast speed, the other option is to keep several impressions in local memory and then flush them disk every minute or so. Of course, this is basically what Mongo and Redis do for you. Not a real compelling reason to move.

如果您正在寻找极快的速度，另一种选择是在本地内存中保留几个印象，然后每分钟左右刷新一次磁盘。当然，这基本上就是 Mongo 和 Redis 为你做的。不是一个真正令人信服的搬家理由。

Answer 3

回答by Data Monk

All three solutions (four if you count flat-files) will give you blazing fast writes. The non-relational (nosql) solutions will give you tunable fault-tolerance as well for the purposes of disaster recovery.

所有三种解决方案（如果算上平面文件，则为四种）都将为您提供极快的写入速度。非关系 (nosql) 解决方案将为您提供可调整的容错能力以及用于灾难恢复的目的。

In terms of scale, our test environment, with only three MongoDB nodes, can handle 2-3k mixed transactions per second. At 8 nodes, we can handle 12k-15k mixed transactions per second. Cassandra can scale even higher. 250 reads is (or should be) no problem.

在规模方面，我们的测试环境，只有三个 MongoDB 节点，每秒可以处理 2-3k 混合事务。在 8 个节点上，我们每秒可以处理 12k-15k 混合事务。Cassandra 可以扩展得更高。250 次读取是（或应该是）没问题。

The more important question is, what do you want to do with this data? Operational reporting? Time-series analysis? Ad-hoc pattern analysis? real-time reporting?

更重要的问题是，你想用这些数据做什么？运营报告？时间序列分析？临时模式分析？实时报告？

MongoDB is a good option if you want the ability to do ad-hoc analysis based on multiple attributes within a collection. You can put up to 40 indexes on a collection, though the indexes will be stored in-memory, so watch for size. But the result is a flexible analytical solution.

如果您希望能够基于集合中的多个属性进行临时分析，那么 MongoDB 是一个不错的选择。您最多可以在集合上放置 40 个索引，但索引将存储在内存中，因此请注意大小。但结果是一个灵活的分析解决方案。

Cassandra is a key-value store. You define a static column or set of columns that will act as your primary index right up front. All queries run against Cassandra should be tuned to this index. You can put a secondary on it, but that's about as far as it goes. You can, of course, use MapReduce to scan the store for non-key attribution, but it will be just that: a serial scan through the store. Cassandra also doesn't have the notion of "like" or regex operations on the server nodes. If you want to find all customers where the first name starts with "Alex", you'll have to scan through the entire collection, pull the first name out for each entry and run it through a client-side regex.

Cassandra 是一个键值存储。您可以在前面定义一个静态列或一组列作为您的主索引。针对 Cassandra 运行的所有查询都应调整到此索引。你可以在它上面放一个辅助设备，但也就到此为止了。当然，您可以使用 MapReduce 来扫描商店以查找非关键属性，但这只是：通过商店进行串行扫描。Cassandra 也没有服务器节点上的“like”或正则表达式操作的概念。如果要查找名字以“Alex”开头的所有客户，则必须扫描整个集合，为每个条目提取名字并通过客户端正则表达式运行它。

I'm not familiar enough with Redis to speak intelligently about it. Sorry.

我对Redis不够熟悉，无法明智地谈论它。对不起。

If you are evaluating non-relational platforms, you might also want to consider CouchDB and Riak.

如果您正在评估非关系平台，您可能还需要考虑 CouchDB 和 Riak。

Hope this helps.

希望这可以帮助。

Answer 4

回答by drdaeman

Just found this: http://blog.axant.it/archives/236

刚刚发现这个：http: //blog.axant.it/archives/236

Quoting the most interesting part:

引用最有趣的部分：

This second graph is about Redis RPUSH vs Mongo $PUSH vs Mongo insert, and I find this graph to be really interesting. Up to 5000 entries mongodb $push is faster even when compared to Redis RPUSH, then it becames incredibly slow, probably the mongodb array type has linear insertion time and so it becomes slower and slower. mongodb might gain a bit of performances by exposing a constant time insertion list type, but even with the linear time array type (which can guarantee constant time look-up) it has its applications for small sets of data.

第二张图是关于 Redis RPUSH 与 Mongo $PUSH 与 Mongo insert 的对比，我发现这张图非常有趣。即使与 Redis RPUSH 相比，最多 5000 个条目 mongodb $push 也更快，然后它变得非常慢，可能 mongodb 数组类型具有线性插入时间，因此它变得越来越慢。mongodb 可能会通过暴露恒定时间插入列表类型来获得一些性能，但即使是线性时间数组类型（可以保证恒定时间查找），它也可以应用于小数据集。

I guess everything depends at least on data type and volume. Best advice probably would be to benchmark on your typical dataset and see yourself.

我想一切都至少取决于数据类型和数量。最好的建议可能是对你的典型数据集进行基准测试，然后看看你自己。

Answer 5

回答by Phat H. VU

According to the Benchmarking Top NoSQL Databases (download here) I recommend Cassandra. enter image description here

根据 Benchmarking Top NoSQL Databases（在此处下载），我推荐 Cassandra。在此处输入图片说明

Answer 6

回答by Ben Hughes

If you have the choice (and need to move away from flat fies) I would go with Redis. Its blazingly fast, will comfortably handle the load you're talking about, but more importantly you won't have to manage the flushing/IO code. I understand its pretty straight forward but less code to manage is better than more.

如果你有选择（并且需要远离扁平化），我会选择 Redis。它非常快，可以轻松处理您正在谈论的负载，但更重要的是，您不必管理刷新/IO 代码。我理解它非常简单，但管理的代码越少越好。

You will also get horizontal scaling options with Redis that you may not get with file based caching.

您还将获得 Redis 的水平扩展选项，而基于文件的缓存可能无法获得这些选项。

Answer 7

回答by Paul Harrison

The problem with inserts into databases is that they usually require writing to a random block on disk for each insert. What you want is something that only writes to disk every 10 inserts or so, ideally to sequential blocks.

插入数据库的问题在于它们通常需要为每次插入写入磁盘上的随机块。你想要的是每 10 次左右只写入磁盘的东西，最好是顺序块。

Flat files are good. Summary statistics (eg total hits per page) can be obtained from flat files in a scalable manner using merge-sorty map-reducy type algorithms. It's not too hard to roll your own.

平面文件很好。可以使用合并排序映射缩减类型算法以可扩展的方式从平面文件中获得汇总统计信息（例如，每页的总点击量）。自己动手并不太难。

SQLite now supports Write Ahead Logging, which may also provide adequate performance.

SQLite 现在支持预写日志，这也可以提供足够的性能。

Answer 8

回答by EhevuTov

I can get around 30k inserts/sec with MongoDB on a simple $350 Dell. If you only need around 2k inserts/sec, I would stick with MongoDB and shard it for scalability. Maybe also look into doing something with Node.js or something similar to make things more asynchronous.

我可以在 350 美元的简单戴尔上使用 MongoDB 获得大约 3 万次插入/秒。如果您只需要大约 2k 次插入/秒，我会坚持使用 MongoDB 并将其分片以实现可扩展性。也许还可以考虑用 Node.js 或类似的东西做一些事情，使事情变得更加异步。

Answer 9

回答by Peter Long

I have hand-on experience with mongodb, couchdb and cassandra. I converted a lot of files to base64 string and insert these string into nosql.
mongodb is the fastest. cassandra is slowest. couchdb is slow too.

我有使用 mongodb、couchdb 和 cassandra 的实践经验。我将很多文件转换为 base64 字符串并将这些字符串插入到 nosql 中。
mongodb 是最快的。cassandra 是最慢的。couchdb 也很慢。

I think mysql would be much faster than all of them, but I didn't try mysql for my test case yet.

我认为 mysql 会比所有这些都快得多，但我还没有为我的测试用例尝试 mysql。

MongoDB 与 Redis 与 Cassandra 的快速写入临时行存储解决方案

提问by Mark Bao

回答by Skrylar

回答by Gates VP

回答by Data Monk

回答by drdaeman

回答by Phat H. VU

回答by Ben Hughes

回答by Paul Harrison

回答by EhevuTov

回答by Peter Long

相关推荐

最近更新

标签

MongoDB 与 Redis 与 Cassandra 的快速写入临时行存储解决方案

提问by Mark Bao

回答by Skrylar

回答by Gates VP

回答by Data Monk

回答by drdaeman

回答by Phat H. VU

回答by Ben Hughes

回答by Paul Harrison

回答by EhevuTov

回答by Peter Long

相关推荐

git-bash $PATH 无法解析带有空格的 Windows 目录

Windows 7 命令提示符的代理设置

windows WMIC：如何在特定工作目录中使用 **process call create**？

64 位驱动程序在 64 位 Windows 7/8 中不起作用

相关推荐

最近更新

标签

windows WMIC：如何在特定工作目录中使用 process call create？