MongoDB 超过 500 万条记录的查询性能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19559405/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 13:27:48  来源:igfitidea点击:

MongoDB querying performance for over 5 million records

mongodbindexingsharding

提问by Yarin Miran

We've recently hit the >2 Million records for one of our main collections and now we started to suffer for major performance issues on that collection.

最近,我们的一个主要集合的记录超过了 200 万,现在我们开始因该集合的主要性能问题而受到影响。

They documents in the collection have about 8 fields which you can filter by using UI and the results are supposed to sorted by a timestamp field the record was processed.

集合中的文档有大约 8 个字段,您可以使用 UI 过滤这些字段,并且结果应该按记录处理的时间戳字段排序。

I've added several compound indexes with the filtered fields and the timetamp e.g:

我添加了几个带有过滤字段和时间戳的复合索引,例如:

db.events.ensureIndex({somefield: 1, timestamp:-1})

I've also added couple of indexes for using several filters at once to hopefully achieve better performance. But some filters still take awfully long time to perform.

我还添加了几个索引以同时使用多个过滤器,以期获得更好的性能。但是有些过滤器仍然需要很长时间才能执行。

I've made sure that using explain that the queries do use the indexes I've created but performance is still not good enough.

我已经确保使用解释查询确实使用了我创建的索引,但性能仍然不够好。

I was wondering if sharding is the way to go now.. but we will soon start to have about 1 million new records per day in that collection.. so I'm not sure if it will scale well..

我想知道分片现在是否可行..但我们很快就会开始在该集合中每天有大约 100 万条新记录..所以我不确定它是否能很好地扩展..

EDIT: example for a query:

编辑:查询示例:

> db.audit.find({'userAgent.deviceType': 'MOBILE', 'user.userName': {$in: ['[email protected]']}}).sort({timestamp: -1}).limit(25).explain()
{
        "cursor" : "BtreeCursor user.userName_1_timestamp_-1",
        "isMultiKey" : false,
        "n" : 0,
        "nscannedObjects" : 30060,
        "nscanned" : 30060,
        "nscannedObjectsAllPlans" : 120241,
        "nscannedAllPlans" : 120241,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 1,
        "nChunkSkips" : 0,
        "millis" : 26495,
        "indexBounds" : {
                "user.userName" : [
                        [
                                "[email protected]",
                                "[email protected]"
                        ]
                ],
                "timestamp" : [
                        [
                                {
                                        "$maxElement" : 1
                                },
                                {
                                        "$minElement" : 1
                                }
                        ]
                ]
        },
        "server" : "yarin:27017"
}

please note that deviceType has only 2 values in my collection.

请注意 deviceType 在我的集合中只有 2 个值。

采纳答案by mnemosyn

This is searching the needle in a haystack. We'd need some output of explain()for those queries that don't perform well. Unfortunately, even that would fix the problem only for that particular query, so here's a strategy on how to approach this:

这是大海捞针。explain()对于那些表现不佳的查询,我们需要一些输出。不幸的是,即使那样也只能解决那个特定查询的问题,所以这里有一个关于如何解决这个问题的策略:

  1. Ensure it's not because of insufficient RAM and excessive paging
  2. Enable the DB profiler (using db.setProfilingLevel(1, timeout)where timeoutis the threshold for the number of milliseconds the query or command takes, anything slower will be logged)
  3. Inspect the slow queries in db.system.profileand run the queries manually using explain()
  4. Try to identify the slow operations in the explain()output, such as scanAndOrderor large nscanned, etc.
  5. Reason about the selectivity of the query and whether it's possible to improve the query using an index at all. If not, consider disallowing the filter setting for the end-user or give him a warning dialog that the operation might be slow.
  1. 确保不是因为内存不足和分页过多
  2. 使DB分析器(使用db.setProfilingLevel(1, timeout)其中timeout为毫秒查询或命令需要的数量的阈值,任何较慢的将被记录)
  3. 检查慢查询db.system.profile并使用手动运行查询explain()
  4. 尝试识别explain()输出中的慢操作,例如scanAndOrderor largenscanned等。
  5. 原因有关查询的选择性以及它是否能够提高使用索引的查询在所有。如果没有,请考虑禁止最终用户进行过滤器设置,或者向他显示操作可能很慢的警告对话框。

A key problem is that you're apparently allowing your users to combine filters at will. Without index intersectioning, that will blow up the number of required indexes dramatically.

一个关键问题是您显然允许您的用户随意组合过滤器。如果没有索引交叉,这将大大增加所需索引的数量。

Also, blindly throwing an index at every possible query is a very bad strategy. It's important to structure the queries and make sure the indexed fields have sufficient selectivity.

此外,在每个可能的查询中盲目地抛出一个索引是一种非常糟糕的策略。构造查询并确保索引字段具有足够的选择性非常重要

Let's say you have a query for all users with status"active" and some other criteria. But of the 5 million users, 3 million are active and 2 million aren't, so over 5 million entries there's only two different values. Such an index doesn't usually help. It's better to search for the other criteria first, then scan the results. On average, when returning 100 documents, you'll have to scan 167 documents, which won't hurt performance too badly. But it's not that simple. If the primary criterion is the joined_atdate of the user and the likelihood of users discontinuing use with time is high, you might end up having to scan thousandsof documents before finding a hundred matches.

假设您要查询具有status“活动”和其他一些条件的所有用户。但是在 500 万用户中,300 万是活跃的,200 万不是,所以超过 500 万个条目只有两个不同的值。这样的索引通常没有帮助。最好先搜索其他条件,然后再扫描结果。平均而言,当返回 100 个文档时,您必须扫描 167 个文档,这不会对性能造成太大影响。但事情没那么简单。如果主要标准是joined_at用户的日期并且用户随着时间的推移停止使用的可能性很高,那么您可能最终必须扫描数千份文档才能找到一百个匹配项。

So the optimization depends very much on the data (not only its structure, but also the data itself), its internal correlations and your query patterns.

所以优化在很大程度上取决于数据(不仅是它的结构,还有数据本身)、它的内部相关性和你的查询模式

Things get worse when the data is too big for the RAM, because then, having an index is great, but scanning (or even simply returning) the results might require fetching a lot of data from disk randomly which takes a lot of time.

当数据对于 RAM 来说太大时,情况会变得更糟,因为这样一来,拥有索引就很好,但是扫描(甚至只是返回)结果可能需要从磁盘随机获取大量数据,这需要很多时间。

The best way to control this is to limit the number of different query types, disallow queries on low selectivity information and try to prevent random access to old data.

控制这种情况的最好方法是限制不同查询类型的数量,禁止对低选择性信息的查询,并尽量防止对旧数据的随机访问。

If all else fails and if you really need that much flexibility in filters, it might be worthwhile to consider a separate search DB that supports index intersections, fetch the mongo ids from there and then get the results from mongo using $in. But that is fraught with its own perils.

如果所有其他方法都失败了,并且您真的需要在过滤器中具有如此大的灵活性,那么考虑一个支持索引交叉的单独搜索数据库可能是值得的,从那里获取 mongo id,然后使用$in. 但这本身就充满了危险。

-- EDIT --

- 编辑 -

The explain you posted is a beautiful example of a the problem with scanning low selectivity fields. Apparently, there's a lot of documents for "[email protected]". Now, finding those documents and sorting them descending by timestamp is pretty fast, because it's supported by high-selectivity indexes. Unfortunately, since there are only two device types, mongo needs to scan 30060 documents to find the first one that matches 'mobile'.

您发布的解释是扫描低选择性字段问题的一个很好的例子。显然,“[email protected]”有很多文档。现在,查找这些文档并按时间戳降序对它们进行排序非常快,因为它受到高选择性索引的支持。不幸的是,由于只有两种设备类型,mongo 需要扫描 30060 个文档才能找到第一个与“移动”匹配的文档。

I assume this is some kind of web tracking, and the user's usage pattern makes the query slow (would he switch mobile and web on a daily basis, the query would be fast).

我假设这是某种网络跟踪,并且用户的使用模式使查询变慢(他是否每天切换移动和网络,查询会很快)。

Making this particular query faster could be done using a compound index that contains the device type, e.g. using

可以使用包含设备类型的复合索引来加快此特定查询的速度,例如使用

a) ensureIndex({'username': 1, 'userAgent.deviceType' : 1, 'timestamp' :-1})

or

或者

b) ensureIndex({'userAgent.deviceType' : 1, 'username' : 1, 'timestamp' :-1})

Unfortunately, that means that queries like find({"username" : "foo"}).sort({"timestamp" : -1});can't use the same index anymore, so, as described, the number of indexes will grow very quickly.

不幸的是,这意味着像这样的查询find({"username" : "foo"}).sort({"timestamp" : -1});不能再使用相同的索引,因此,如上所述,索引的数量将增长得非常快。

I'm afraid there's no very good solution for this using mongodb at this time.

恐怕目前使用 mongodb 没有很好的解决方案。

回答by Mark Meeus

Mongo only uses 1 index per query. So if you want to filter on 2 fields, mongo will use the index with one of the fields, but still needs to scan the entire subset.

Mongo 每个查询只使用 1 个索引。因此,如果您想过滤 2 个字段,mongo 将使用包含其中一个字段的索引,但仍需要扫描整个子集。

This means that basically you'll need an index for every type of query in order to achieve the best performance.

这意味着基本上您需要为每种类型的查询建立一个索引,以实现最佳性能。

Depending on your data, it might not be a bad idea to have one query per field, and process the results in your app. This way you'll only need indexes on every fields, but it may be too much data to process.

根据您的数据,每个字段有一个查询并在您的应用程序中处理结果可能不是一个坏主意。这样你就只需要每个字段的索引,但处理的数据可能太多。

回答by Gopal

If you are using $in, mongodb never uses INDEX. Change your query, by removing this $in. It should use index and it would give better performance than what you got earlier.

如果您使用 $in,mongodb 从不使用 INDEX。通过删除此 $in 来更改您的查询。它应该使用索引,它会提供比您之前获得的更好的性能。

http://docs.mongodb.org/manual/core/query-optimization/

http://docs.mongodb.org/manual/core/query-optimization/