Mongodb 聚合 $group 后跟 $limit 用于分页

Question

提问by Poorna

In MongoDB aggregation pipeline, record flow from stage to stage happens one/batch at a time (or) will wait for the current stage to complete for whole collection before passing it to next stage?

在 MongoDB 聚合管道中，从一个阶段到另一个阶段的记录流一次发生一个/批次（或）将等待当前阶段完成整个收集，然后再将其传递到下一阶段？

For e.g., I have a collection classtest with following sample records

例如，我有一个带有以下示例记录的集合类测试

{name: "Person1", marks: 20}
{name: "Person2", marks: 20}
{name: "Person1", marks: 20}

I have total 1000 records for about 100 students and I have following aggregate query

我有大约 100 名学生的 1000 条记录，我有以下聚合查询

    db.classtest.aggregate(
[
    {$sort: {name: 1}},
    {$group: {_id: '$name',
            total: {$sum: '$marks'}}},
    {$limit: 5}
])

I have following questions.

我有以下问题。

The sort order is lost in final results. If I place another sort after $group, then results are sorted properly. Does that mean $group not maintains the previous sort order?
I would like to limit the results to 5. Does group operation has to be completely done (for all 1000 records) before passing to the limit. (or) The group operation passes the records to limit stage as and when it has record and stops processing when the requirement for limit stage is met?

排序顺序在最终结果中丢失。如果我在 $group 之后放置另一个排序，那么结果会正确排序。这是否意味着 $group 不保持以前的排序顺序？
我想将结果限制为 5。在传递到限制之前，是否必须完全完成组操作（对于所有 1000 条记录）。（或）组操作在有记录时将记录传递到限制阶段，并在满足限制阶段要求时停止处理？

My actual idea is to do pagination on results of aggregate. In above scenario, if $group maintains sort order and processes only required number of records, I want to apply $match condition {$ge: 'lastPersonName'}in subsequent page queries.

我的实际想法是对聚合结果进行分页。在上面的场景中，如果 $group 维护排序顺序并且只处理所需数量的记录，我想$match condition {$ge: 'lastPersonName'}在后续页面查询中应用。

I do not want to apply $limit before $group as I want results for 5 students not first 5 records.
I may not want to use $skip as that means effectively traversing those many records.

我不想在 $group 之前应用 $limit ，因为我想要 5 个学生的结果而不是前 5 个记录。
我可能不想使用 $skip ，因为这意味着有效地遍历那么多记录。

Answer 1

采纳答案by Poorna

I have solved the problem without need of maintaining another collection or even without $group traversing whole collection, hence posting my own answer.

我已经解决了这个问题，不需要维护另一个集合，甚至不需要 $group 遍历整个集合，因此发布了我自己的答案。

As others have pointed:

正如其他人指出的那样：

$groupdoesn't retain order, hence early sorting is not of much help.
$groupdoesn't do any optimization, even if there is a following $limit, i.e., runs $groupon entire collection.

$group不保留顺序，因此早期排序没有多大帮助。
$group不做任何优化，即使有后续$limit，即$group在整个集合上运行。

My usecase has following unique features, which helped me to solve it:

我的用例具有以下独特的功能，这帮助我解决了它：

There will be maximum of 10 records per each student (minimum of 1).

I am not very particular on page size. The front-end capable of handling varying page sizes. The following is the aggregation command I have used.

db.classtest.aggregate(
[
    {$sort: {name: 1}},
    {$limit: 5 * 10},
    {$group: {_id: '$name',
        total: {$sum: '$marks'}}},
    {$sort: {_id: 1}}
])

每个学生最多有 10 条记录（最少 1 条）。

我对页面大小不是很特别。前端能够处理不同的页面大小。下面是我用过的聚合命令。

db.classtest.aggregate(
[
    {$sort: {name: 1}},
    {$limit: 5 * 10},
    {$group: {_id: '$name',
        total: {$sum: '$marks'}}},
    {$sort: {_id: 1}}
])

Explaining the above.

以上解释。

if $sortimmediately precedes $limit, the framework optimizes the amount of data to be sent to next stage. Refer here
To get a minimum of 5 records (page size), I need to pass at least 5 (page size) * 10 (max records per student) = 50 records to the $groupstage. With this, the size of final result may be anywhere between 0 and 50.
If the result is less than 5, then there is no further pagination required.
If the result size is greater than 5, there may be chance that last student record is not completely processed (i.e., not grouped all the records of student), hence I discard the last record from the result.

Then name in last record (among retained results) is used as $match criteria in subsequent page request as shown below.

db.classtest.aggregate(
[
    {$match: {name: {$gt: lastRecordName}}}
    {$sort: {name: 1}},
    {$limit: 5 * 10},
    {$group: {_id: '$name',
        total: {$sum: '$marks'}}},
    {$sort: {_id: 1}}
])

如果$sort紧接在之前$limit，框架会优化要发送到下一阶段的数据量。参考这里
要获得最少 5 条记录（页面大小），我需要将至少 5（页面大小）* 10（每个学生的最大记录数）= 50 条记录传递到$group舞台。这样，最终结果的大小可能介于 0 到 50 之间。
如果结果小于 5，则不需要进一步分页。
如果结果大小大于 5，可能最后一个学生记录没有被完全处理（即没有将学生的所有记录分组），因此我从结果中丢弃最后一个记录。

然后最后一条记录中的名称（在保留的结果中）用作后续页面请求中的 $match 标准，如下所示。

db.classtest.aggregate(
[
    {$match: {name: {$gt: lastRecordName}}}
    {$sort: {name: 1}},
    {$limit: 5 * 10},
    {$group: {_id: '$name',
        total: {$sum: '$marks'}}},
    {$sort: {_id: 1}}
])

In above, the framework will still optimize $match, $sort and $limittogether as single operation, which I have confirmed through explain plan.

在上面，框架仍然会$match, $sort and $limit作为单个操作一起优化，我已经通过解释计划确认了这一点。

Answer 2

回答by Blakes Seven

The first few things to consider here is that the aggregation framework works with a "pipeline" of stages to be applied in order to get a result. If you are familiar with processing things on the "command line" or "shell" of your operating system, then you might have some experience with the "pipe" or |operator.

这里首先要考虑的是聚合框架与要应用的阶段的“管道”一起工作以获得结果。如果您熟悉在操作系统的“命令行”或“外壳”上处理事物，那么您可能对“管道”或|运算符有一些经验。

Here is a common unix idiom:

这是一个常见的unix习语：

ps -ef | grep mongod | tee "out.txt"

In this case the output of the first command here ps -efis being "piped" to the next command grep mongodwhich in turn has it's output "piped" to the tee out.txtwhich both outputs to terminal as well as the specified file name. This is a "pipeline" wher each stage "feeds" the next, and in "order" of the sequence they are written in.

在这种情况下，这里的第一个命令的输出ps -ef被“管道”到下一个命令grep mongod，而下一个命令又将它的输出“管道”到tee out.txt两个输出到终端以及指定文件名的命令。这是一个“管道”，其中每个阶段“馈送”下一个阶段，并按照它们写入的“顺序”。

The same is true of the aggregation pipeline. A "pipeline" here is in fact an "array", which is an ordered set of instructions to be passed in processing the data to a result.

聚合管道也是如此。这里的“管道”实际上是一个“数组”，它是在处理数据到结果时要传递的一组有序指令。

db.classtest.aggregate([
    { "$group": {
      "_id": "$name",
      "total": { "$sum": "$marks"}
    }},
    { "$sort": { "name": 1 } },
    { "$limit": 5 }
])

So what happens here is that all of the items in the collection are first processed by $groupto get their totals. There is no specified "order" to grouping so there is not much sense in pre-ordering the data. Neither is there any point in doing so because you are yet to get to your later stages.

所以这里发生的是集合中的所有项目首先被处理$group以获得它们的总数。分组没有指定的“顺序”，因此对数据进行预排序没有多大意义。这样做也没有任何意义，因为您还没有进入后期阶段。

Then you would $sortthe results and also $limitas required.

然后你会$sort得到结果，也$limit需要。

For your next "page" of data you will want ideally $matchon the last unique name found, like so:

对于您的下一个数据“页面”，您最好$match使用找到的最后一个唯一名称，如下所示：

db.classtest.aggregate([
    { "$match": { "name": { "$gt": lastNameFound } }},
    { "$group": {
      "_id": "$name",
      "total": { "$sum": "$marks"}
    }},
    { "$sort": { "name": 1 } },
    { "$limit": 5 }
])

It's not the best solution, but there really are not alternatives for this type of grouping. It will however notably get "faster" with each iteration towards the end. Alternately, storing all the unqiue names ( or reading that out of another collection ) and "paging" through that list with a "range query" on each aggregation statement may be a viable option, if your data permits it.

这不是最好的解决方案，但对于这种类型的分组确实没有替代方案。然而，随着每次迭代接近尾声，它会明显变得“更快”。或者，如果您的数据允许，存储所有唯一名称（或从另一个集合中读取）并在每个聚合语句上使用“范围查询”通过该列表“分页”可能是一个可行的选择。

Something like:

就像是：

db.classtest.aggregate([
    { "$match": { "name": { "$gte": "Allan", "$lte": "David" } }},
    { "$group": {
      "_id": "$name",
      "total": { "$sum": "$marks"}
    }},
    { "$sort": { "name": 1 } },
])

Unfortunately there is not a "limit grouping up until x results" option, so unless you can work with another list, then you are basically grouping up everything ( and possibly a a gradually smaller set each time ) with each aggregation query you send.

不幸的是，没有“限制分组直到 x 结果”选项，因此除非您可以使用另一个列表，否则您基本上将所有内容（并且每次可能逐渐变小）与您发送的每个聚合查询分组。

Answer 3

回答by dgiugg

"$groupdoes notorder its output documents." See http://docs.mongodb.org/manual/reference/operator/aggregation/group/
$limitlimits the number of processed elements of an immediately preceding $sortoperation, not only the number of elements passed to the next stage. See the note at http://docs.mongodb.org/manual/reference/operator/aggregation/limit/

“$group并没有下令将其输出文档。” 请参阅http://docs.mongodb.org/manual/reference/operator/aggregation/group/
$limit限制前一个$sort操作的已处理元素的数量，而不仅仅是传递到下一阶段的元素数量。请参阅http://docs.mongodb.org/manual/reference/operator/aggregation/limit/ 上的注释

For the very first question you asked, I am not sure, but it appears (see 1.) that a stage n+1 can influence the behaviour of stage n : the limit will limitthe sort operation to its first n elements, and the sort operation will not complete just as if the following limit stage did not exist.

对于您问的第一个问题，我不确定，但似乎（参见 1.）阶段 n+1 可以影响阶段 n 的行为：限制将排序操作限制在它的前 n 个元素，并且排序操作不会完成，就像下面的限制阶段不存在一样。

Mongodb 聚合 $group 后跟 $limit 用于分页

提问by Poorna

采纳答案by Poorna

回答by Blakes Seven

回答by dgiugg

相关推荐

最近更新

标签

Mongodb 聚合 $group 后跟 $limit 用于分页

提问by Poorna

采纳答案by Poorna

回答by Blakes Seven

回答by dgiugg

相关推荐

MongoDB 查找查询与 CurrentDate 的比较

WPF 全局字体大小

WPF Grid 中子控件之间的间距

Mongodb - Mongoimport 错误无效字符

相关推荐

最近更新

标签