MongoDB:一个集合中有数十亿个文档

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11320907/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 12:43:35  来源:igfitidea点击:

MongoDB: BIllions of documents in a collection

mongodb

提问by Elliot Chance

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.

我需要将 66 亿个 bigram 加载到一个集合中,但我找不到任何有关执行此操作的最佳方法的信息。

Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?

将那么多文档加载到单个主键索引上会花费很长时间,但据我所知 mongo 不支持等效的分区?

Would sharding help? Should I try and split the data set over many collections and build that logic into my application?

分片会有帮助吗?我是否应该尝试将数据集拆分为多个集合并将该逻辑构建到我的应用程序中?

回答by Mark Hillick

It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.

很难说最佳的批量插入是什么——这部分取决于您插入的对象的大小和其他不可估量的因素。你可以尝试几个范围,看看什么给你最好的表现。作为替代方案,有些人喜欢使用 mongoimport,它非常快,但您的导入数据需要是 json 或 csv。如果数据采用 BSON 格式,显然有 mongodrestore。

Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a documenton using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.

Mongo 可以轻松处理数十亿个文档,并且一个集合中可以包含数十亿个文档,但请记住,最大文档大小为 16mb。MongoDB 中有许多人拥有数十亿个文档,并且在MongoDB Google 用户组上有很多关于它的讨论。如果您改变主意并希望拥有多个集合,这里有一份关于使用大量集合的文档,您可能喜欢阅读这些集合。您拥有的集合越多,索引也就越多,这可能不是您想要的。

Here's a presentationfrom Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.

这是来自 Craigslist 的关于将数十亿文档插入 MongoDB的演示以及该人的博客文章

It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.

看起来分片对您来说确实是一个很好的解决方案,但通常分片用于跨多个服务器进行扩展,很多人这样做是因为他们想要扩展他们的写入或者他们无法保留他们的工作集(数据和索引)在内存中。从单个服务器开始,然后随着数据的增长或需要额外的冗余和弹性而转移到分片或副本集是完全合理的。

However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrencywith db locking, I suspect that there will be much less of a reason for such a deployment.

但是,还有其他用户使用多个 mongod 来绕过具有大量写入的单个 mongod 的锁定限制。很明显,但仍然值得一提,但是多 mongod 设置比单个服务器更易于管理。如果您的 IO 或 cpu 在这里没有达到最大值,您的工作集小于 RAM,并且您的数据很容易保持平衡(非常随机分布),您应该会看到改进(在单个服务器上进行分片)。仅供参考,存在内存和 IO 争用的可能性。随着 2.2 提高了db 锁定的并发性,我怀疑这种部署的理由要少得多。

You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.

您需要正确规划您的分片计划,即仔细考虑选择您的分片键。如果你这样做,那么最好预先分割并关闭平衡器。移动数据以保持平衡会适得其反,这意味着您需要预先决定如何拆分它。此外,有时设计文档时要考虑到某些字段可用于分片或用作主键,这一点很重要。

Here's some good links -

这里有一些很好的链接 -

回答by Eric J.

You can absolutely shard data in MongoDB(which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.

您绝对可以在 MongoDB 中对数据进行分片(在分片上的 N 个服务器上进行分区)。事实上,这是它的核心优势之一。无需在您的应用程序中执行此操作。

For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

对于大多数用例,我强烈建议对 66 亿个文档执行此操作。根据我的经验,MongoDB 在使用多个中端服务器而不是一个大型服务器时性能更好。