MongoDB Schema Design - 许多小文档还是更少的大文档?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3038703/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
MongoDB Schema Design - Many small documents or fewer large documents?
提问by Andre
Background
I'm prototyping a conversion from our RDBMS database to MongoDB. While denormalizing, it seems as if I have two choices, one which leads to many (millions) of smaller documents or one which leads to fewer (hundreds of thousands) large documents.
背景
我正在对从 RDBMS 数据库到 MongoDB 的转换进行原型设计。在非规范化时,似乎我有两种选择,一种会导致许多(数百万)个较小的文档,另一种会导致较少(数十万)个大型文档。
If I could distill it down to a simple analog, it would be the difference between a collection with fewer Customer documents like this (in Java):
如果我可以将其提炼成一个简单的类比,这将是具有较少客户文档的集合之间的区别(在 Java 中):
class Customer { private String name; private Address address; // each CreditCard has hundreds of Payment instances private Set<CreditCard> creditCards; }
or a collection with many, many Payment documents like this:
或包含许多付款文件的集合,如下所示:
class Payment { private Customer customer; private CreditCard creditCard; private Date payDate; private float payAmount; }
Question
Is MongoDB designed to prefer many, many small documents or fewer large documents? Does the answer mostly depend on what queries I plan on running? (i.e. How many credit cards does customer X have? vs What was the average amount all customers paid last month?)
问题
MongoDB 的设计是偏爱很多很多的小文档还是更少的大文档?答案是否主要取决于我计划运行的查询?(即客户 X 有多少张信用卡?与所有客户上个月支付的平均金额是多少?)
I've looked around a lot but I didn't stumble into any MongoDB schema best practices that would help me answer my question.
我环顾四周,但没有偶然发现任何可以帮助我回答问题的 MongoDB 模式最佳实践。
采纳答案by Gates VP
You'll definitely need to optimize for the queries you're doing.
您肯定需要针对您正在执行的查询进行优化。
Here's my best guess based on your description.
这是我根据您的描述做出的最佳猜测。
You'll probably want to know all Credit Cards for each Customer, so keep an array of those within the Customer Object. You'll also probably want to have a Customer reference for each Payment. This will keep the Payment document relatively small.
您可能想知道每个客户的所有信用卡,因此在客户对象中保留一个数组。您可能还希望为每笔付款提供一个客户参考。这将使付款文档保持相对较小。
The Payment object will automatically have its own ID and index. You'll probably want to add an index on the Customer reference as well.
Payment 对象将自动拥有自己的 ID 和索引。您可能还想在 Customer 引用上添加索引。
This will allow you to quickly search for Payments by Customer without storing the whole customer object every time.
这将允许您快速搜索客户付款,而无需每次都存储整个客户对象。
If you want to answer questions like "What was the average amount all customers paid last month"you're instead going to want a map / reduce for any sizeable dataset. You're not getting this response "real-time". You'll find that storing a "reference" to Customer is probably good enough for these map-reduces.
如果您想回答诸如“上个月所有客户支付的平均金额是多少”之类的问题,您将需要为任何大型数据集提供映射/缩减。您不会“实时”收到此响应。您会发现存储对 Customer 的“引用”对于这些 map-reduces 可能已经足够了。
So to answer your question directly: Is MongoDB designed to prefer many, many small documents or fewer large documents?
所以直接回答你的问题:MongoDB的设计是偏爱很多很多小文档还是更少的大文档?
MongoDB is designed to find indexed entries very quickly. MongoDB is very good at finding a fewneedles in a large haystack. MongoDB is notvery good at finding mostof the needles in the haystack. So build your data around your most common use cases and write map/reduce jobs for the rarer use cases.
MongoDB 旨在非常快速地查找索引条目。MongoDB是在寻找一个非常好的少数在一个大草垛针。MongoDB并不是很擅长在大海捞针中寻找大多数针头。因此,围绕最常见的用例构建数据,并为较少见的用例编写映射/减少作业。
回答by bmaupin
According to MongoDB's own documentation, it sounds like it's designed for many small documents.
根据 MongoDB 自己的文档,听起来它是为许多小文档而设计的。
From Performance Best Practices for MongoDB:
The maximum size for documents in MongoDB is 16 MB. In practice most documents are a few kilobytes or less. Consider documents more like rows in a table than the tables themselves. Rather than maintaining lists of records in a single document, instead make each record a document.
MongoDB 中文档的最大大小为 16 MB。实际上,大多数文档只有几千字节或更少。考虑文档更像是表中的行而不是表本身。与其在单个文档中维护记录列表,不如让每个记录成为一个文档。
From 6 Rules of Thumb for MongoDB Schema Design: Part 1:
来自MongoDB 模式设计的 6 条经验法则:第 1 部分:
Modeling One-to-Few
An example of “one-to-few” might be the addresses for a person. This is a good use case for embedding – you'd put the addresses in an array inside of your Person object.
One-to-Many
An example of “one-to-many” might be parts for a product in a replacement parts ordering system. Each product may have up to several hundred replacement parts, but never more than a couple thousand or so. This is a good use case for referencing – you'd put the ObjectIDs of the parts in an array in product document.
One-to-Squillions
An example of “one-to-squillions” might be an event logging system that collects log messages for different machines. Any given host could generate enough messages to overflow the 16 MB document size, even if all you stored in the array was the ObjectID. This is the classic use case for “parent-referencing” – you'd have a document for the host, and then store the ObjectID of the host in the documents for the log messages.
一对多建模
“一对多”的一个例子可能是一个人的地址。这是嵌入的一个很好的用例 - 您将地址放在 Person 对象内的数组中。
一对多
“一对多”的一个例子可能是替换零件订购系统中的产品零件。每个产品可能有多达数百个更换零件,但绝不会超过几千个左右。这是一个很好的引用用例——您将部件的 ObjectID 放在产品文档的数组中。
一对 Squillions
“one-to-squillions”的一个例子可能是一个为不同机器收集日志消息的事件日志系统。任何给定的主机都可以生成足够的消息来溢出 16 MB 的文档大小,即使您存储在数组中的所有内容都是 ObjectID。这是“父引用”的经典用例——您有一个主机文档,然后将主机的 ObjectID 存储在日志消息的文档中。
回答by Terris
Documents that grow substantially over time can be ticking time bombs. Network bandwidth and RAM usage will likely become measurable bottlenecks, forcing you to start over.
随着时间的推移大幅增长的文档可能是定时炸弹。网络带宽和 RAM 使用可能会成为可衡量的瓶颈,迫使您重新开始。
First, let's consider two collections: Customer and Payment. Thus, the grain is fairly small: one document per payment.
首先,让我们考虑两个集合:Customer 和 Payment。因此,粒度相当小:每次付款一份文件。
Next you must decide how to model account information, such as credit cards. Let's consider whether customer documents contain arrays of account information or whether you need a new Account collection.
接下来,您必须决定如何建模帐户信息,例如信用卡。让我们考虑客户文档是否包含帐户信息数组,或者您是否需要一个新的 Account 集合。
If account documents are separate from customer documents, loading all of the accounts for one customer into memory requires fetching multiple documents. That might translate into extra memory, I/O, bandwidth, and CPU usage. Does that immediately mean the Account collection is a bad idea?
如果帐户文档与客户文档分开,则将一个客户的所有帐户加载到内存中需要获取多个文档。这可能会转化为额外的内存、I/O、带宽和 CPU 使用率。这是否立即意味着 Account 集合是个坏主意?
Your decision affects payment documents. If account information is embedded in a customer document, how would you reference it? Separate account documents have their own _id attribute. With embedded account information, your application would either generate new ids for accounts or use the account's attributes (e.g., account number) for the key.
您的决定会影响付款文件。如果帐户信息嵌入在客户文档中,您将如何引用它?单独的帐户文档有自己的 _id 属性。使用嵌入的帐户信息,您的应用程序将为帐户生成新的 ID 或使用帐户的属性(例如,帐号)作为密钥。
Could a payment document actually contain all the payments made in fixed timeframe (e.g., day?). Such complexity will affect all code that reads and writes payment documents. Premature optimization can be deadly to projects.
付款单据能否真正包含在固定时间范围内(例如,天?)进行的所有付款。这种复杂性将影响所有读取和写入支付文件的代码。过早的优化对项目来说可能是致命的。
Like account documents, payments are easily referenced as long as a payment document contains only one payment. A new type of document, credit for example, could reference a payment. But would you create a Credit collection or would you embed credit information inside payment information? What would happen if you later needed to reference a credit?
与帐户文件一样,只要付款文件只包含一笔付款,就可以轻松引用付款。一种新型的单据,例如信用证,可以参考付款。但是,您会创建信用集合还是将信用信息嵌入支付信息中?如果您以后需要参考信用会怎样?
To summarize, I have been successful with lots of small documents and many collections. I implement references with _id and only with _id. Thus, I don't worry about ever-growing documents destroying my application. The schema is easy to understand and index because each entity has its own collection. Important entities aren't hiding inside other documents.
总而言之,我已经成功处理了许多小文档和许多收藏。我使用 _id 并且仅使用 _id 实现引用。因此,我不担心不断增长的文档会破坏我的应用程序。该模式易于理解和索引,因为每个实体都有自己的集合。重要实体不会隐藏在其他文档中。
I'd love to hear about your findings. Good luck!
我很想听听你的发现。祝你好运!