mongodb 按多个字段对值进行分组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22932364/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 13:40:01  来源:igfitidea点击:

mongodb group values by multiple fields

mongodbaggregation-framework

提问by fervid

For example, I have these documents:

例如,我有这些文件:

{
  "addr": "address1",
  "book": "book1"
},
{
  "addr": "address2",
  "book": "book1"
},
{
  "addr": "address1",
  "book": "book5"
},
{
  "addr": "address3",
  "book": "book9"
},
{
  "addr": "address2",
  "book": "book5"
},
{
  "addr": "address2",
  "book": "book1"
},
{
  "addr": "address1",
  "book": "book1"
},
{
  "addr": "address15",
  "book": "book1"
},
{
  "addr": "address9",
  "book": "book99"
},
{
  "addr": "address90",
  "book": "book33"
},
{
  "addr": "address4",
  "book": "book3"
},
{
  "addr": "address5",
  "book": "book1"
},
{
  "addr": "address77",
  "book": "book11"
},
{
  "addr": "address1",
  "book": "book1"
}

and so on.


How can I make a request, which will describe the top N addresses and the top M books per address?

Example of expected result:

address1 | book_1: 5
| book_2: 10
| book_3: 50
| total: 65
______________________
address2 | book_1: 10
| book_2: 10
|...
| book_M: 10
| total: M*10
...
______________________
addressN | book_1: 20
| book_2: 20
|...
| book_M: 20
| total: M*20

等等。


我如何提出请求,描述前 N 个地址和每个地址前 M 本书?

预期结果示例:

address1 | book_1: 5
| book_2: 10
| book_3: 50
| 总计:65
____________ 地址2
| book_1: 10
| book_2: 10
|...
| book_M: 10
| 总计:M*10
...
______________________
地址N | book_1: 20
| book_2: 20
|...
| book_M: 20
| 总计:M*20

回答by Neil Lunn

TLDR Summary

TLDR 摘要

In modern MongoDB releases you can brute force this with $slicejust off the basic aggregation result. For "large" results, run parallel queries instead for each grouping ( a demonstration listing is at the end of the answer ), or wait for SERVER-9377to resolve, which would allow a "limit" to the number of items to $pushto an array.

在现代 MongoDB 版本中,您可以$slice仅使用基本聚合结果来强制执行此操作。对于“大”的结果,对每个分组运行并行查询,而不是(示范上市是在回答结束时),或等待SERVER-9377来解决,这将允许“限制”的项目的数量$push至大批。

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 },
    { "$project": {
        "books": { "$slice": [ "$books", 2 ] },
        "count": 1
    }}
])


MongoDB 3.6 Preview

MongoDB 3.6 预览版

Still not resolving SERVER-9377, but in this release $lookupallows a new "non-correlated" option which takes an "pipeline"expression as an argument instead of the "localFields"and "foreignFields"options. This then allows a "self-join" with another pipeline expression, in which we can apply $limitin order to return the "top-n" results.

仍然没有解决SERVER-9377,但在此版本中$lookup允许一个新的“非相关”选项,它将"pipeline"表达式作为参数而不是"localFields"and"foreignFields"选项。然后这允许与另一个管道表达式的“自连接”,我们可以在其中应用$limit以返回“top-n”结果。

db.books.aggregate([
  { "$group": {
    "_id": "$addr",
    "count": { "$sum": 1 }
  }},
  { "$sort": { "count": -1 } },
  { "$limit": 2 },
  { "$lookup": {
    "from": "books",
    "let": {
      "addr": "$_id"
    },
    "pipeline": [
      { "$match": { 
        "$expr": { "$eq": [ "$addr", "$$addr"] }
      }},
      { "$group": {
        "_id": "$book",
        "count": { "$sum": 1 }
      }},
      { "$sort": { "count": -1  } },
      { "$limit": 2 }
    ],
    "as": "books"
  }}
])

The other addition here is of course the ability to interpolate the variable through $exprusing $matchto select the matching items in the "join", but the general premise is a "pipeline within a pipeline" where the inner content can be filtered by matches from the parent. Since they are both "pipelines" themselves we can $limiteach result separately.

这里的另一个添加当然是通过$expr使用$match选择“连接”中的匹配项来插入变量的能力,但一般前提是“管道内的管道”,其中可以通过来自父级的匹配来过滤内部内容. 由于它们本身都是“管道”,因此我们可以$limit分别得出各自的结果。

This would be the next best option to running parallel queries, and actually would be better if the $matchwere allowed and able to use an index in the "sub-pipeline" processing. So which is does not use the "limit to $push" as the referenced issue asks, it actually delivers something that should work better.

这将是运行并行查询的下一个最佳选择,如果$match允许并能够在“子管道”处理中使用索引,实际上会更好。因此,$push正如所引用的问题所要求的那样,它没有使用“限制到”,它实际上提供了应该更好地工作的东西。



Original Content

原创内容

You seem have stumbled upon the top "N" problem. In a way your problem is fairly easy to solve though not with the exact limiting that you ask for:

您似乎偶然发现了最重要的“N”问题。在某种程度上,您的问题很容易解决,尽管没有您要求的确切限制:

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 }
])

Now that will give you a result like this:

现在这会给你一个这样的结果:

{
    "result" : [
            {
                    "_id" : "address1",
                    "books" : [
                            {
                                    "book" : "book4",
                                    "count" : 1
                            },
                            {
                                    "book" : "book5",
                                    "count" : 1
                            },
                            {
                                    "book" : "book1",
                                    "count" : 3
                            }
                    ],
                    "count" : 5
            },
            {
                    "_id" : "address2",
                    "books" : [
                            {
                                    "book" : "book5",
                                    "count" : 1
                            },
                            {
                                    "book" : "book1",
                                    "count" : 2
                            }
                    ],
                    "count" : 3
            }
    ],
    "ok" : 1
}

So this differs from what you are asking in that, while we do get the top results for the address values the underlying "books" selection is not limited to only a required amount of results.

因此,这与您所要求的不同,虽然我们确实获得了地址值的最高结果,但基础“书籍”选择不仅限于所需的结果数量。

This turns out to be very difficult to do, but it can be done though the complexity just increases with the number of items you need to match. To keep it simple we can keep this at 2 matches at most:

事实证明这很难做到,但可以做到,尽管复杂性会随着您需要匹配的项目数量而增加。为简单起见,我们最多可以将其保留为 2 场比赛:

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 },
    { "$unwind": "$books" },
    { "$sort": { "count": 1, "books.count": -1 } },
    { "$group": {
        "_id": "$_id",
        "books": { "$push": "$books" },
        "count": { "$first": "$count" }
    }},
    { "$project": {
        "_id": {
            "_id": "$_id",
            "books": "$books",
            "count": "$count"
        },
        "newBooks": "$books"
    }},
    { "$unwind": "$newBooks" },
    { "$group": {
      "_id": "$_id",
      "num1": { "$first": "$newBooks" }
    }},
    { "$project": {
        "_id": "$_id",
        "newBooks": "$_id.books",
        "num1": 1
    }},
    { "$unwind": "$newBooks" },
    { "$project": {
        "_id": "$_id",
        "num1": 1,
        "newBooks": 1,
        "seen": { "$eq": [
            "$num1",
            "$newBooks"
        ]}
    }},
    { "$match": { "seen": false } },
    { "$group":{
        "_id": "$_id._id",
        "num1": { "$first": "$num1" },
        "num2": { "$first": "$newBooks" },
        "count": { "$first": "$_id.count" }
    }},
    { "$project": {
        "num1": 1,
        "num2": 1,
        "count": 1,
        "type": { "$cond": [ 1, [true,false],0 ] }
    }},
    { "$unwind": "$type" },
    { "$project": {
        "books": { "$cond": [
            "$type",
            "$num1",
            "$num2"
        ]},
        "count": 1
    }},
    { "$group": {
        "_id": "$_id",
        "count": { "$first": "$count" },
        "books": { "$push": "$books" }
    }},
    { "$sort": { "count": -1 } }
])

So that will actually give you the top 2 "books" from the top two "address" entries.

因此,这实际上会为您提供前两个“地址”条目中的前 2 个“书籍”。

But for my money, stay with the first form and then simply "slice" the elements of the array that are returned to take the first "N" elements.

但是为了我的钱,保持第一种形式,然后简单地“切片”返回的数组元素以获取前“N”个元素。



Demonstration Code

演示代码

The demonstration code is appropriate for usage with current LTS versions of NodeJS from v8.x and v10.x releases. That's mostly for the async/awaitsyntax, but there is nothing really within the general flow that has any such restriction, and adapts with little alteration to plain promises or even back to plain callback implementation.

演示代码适用于 v8.x 和 v10.x 版本的当前 LTS 版本的 NodeJS。这主要是针对async/await语法的,但在一般流程中没有任何真正有任何此类限制的内容,并且几乎不改变普通承诺甚至回到普通回调实现。

index.js

索引.js

const { MongoClient } = require('mongodb');
const fs = require('mz/fs');

const uri = 'mongodb://localhost:27017';

const log = data => console.log(JSON.stringify(data, undefined, 2));

(async function() {

  try {
    const client = await MongoClient.connect(uri);

    const db = client.db('bookDemo');
    const books = db.collection('books');

    let { version } = await db.command({ buildInfo: 1 });
    version = parseFloat(version.match(new RegExp(/(?:(?!-).)*/))[0]);

    // Clear and load books
    await books.deleteMany({});

    await books.insertMany(
      (await fs.readFile('books.json'))
        .toString()
        .replace(/\n$/,"")
        .split("\n")
        .map(JSON.parse)
    );

    if ( version >= 3.6 ) {

    // Non-correlated pipeline with limits
      let result = await books.aggregate([
        { "$group": {
          "_id": "$addr",
          "count": { "$sum": 1 }
        }},
        { "$sort": { "count": -1 } },
        { "$limit": 2 },
        { "$lookup": {
          "from": "books",
          "as": "books",
          "let": { "addr": "$_id" },
          "pipeline": [
            { "$match": {
              "$expr": { "$eq": [ "$addr", "$$addr" ] }
            }},
            { "$group": {
              "_id": "$book",
              "count": { "$sum": 1 },
            }},
            { "$sort": { "count": -1 } },
            { "$limit": 2 }
          ]
        }}
      ]).toArray();

      log({ result });
    }

    // Serial result procesing with parallel fetch

    // First get top addr items
    let topaddr = await books.aggregate([
      { "$group": {
        "_id": "$addr",
        "count": { "$sum": 1 }
      }},
      { "$sort": { "count": -1 } },
      { "$limit": 2 }
    ]).toArray();

    // Run parallel top books for each addr
    let topbooks = await Promise.all(
      topaddr.map(({ _id: addr }) =>
        books.aggregate([
          { "$match": { addr } },
          { "$group": {
            "_id": "$book",
            "count": { "$sum": 1 }
          }},
          { "$sort": { "count": -1 } },
          { "$limit": 2 }
        ]).toArray()
      )
    );

    // Merge output
    topaddr = topaddr.map((d,i) => ({ ...d, books: topbooks[i] }));
    log({ topaddr });

    client.close();

  } catch(e) {
    console.error(e)
  } finally {
    process.exit()
  }

})()

books.json

书籍.json

{ "addr": "address1",  "book": "book1"  }
{ "addr": "address2",  "book": "book1"  }
{ "addr": "address1",  "book": "book5"  }
{ "addr": "address3",  "book": "book9"  }
{ "addr": "address2",  "book": "book5"  }
{ "addr": "address2",  "book": "book1"  }
{ "addr": "address1",  "book": "book1"  }
{ "addr": "address15", "book": "book1"  }
{ "addr": "address9",  "book": "book99" }
{ "addr": "address90", "book": "book33" }
{ "addr": "address4",  "book": "book3"  }
{ "addr": "address5",  "book": "book1"  }
{ "addr": "address77", "book": "book11" }
{ "addr": "address1",  "book": "book1"  }

回答by Sikorski

Using aggregate function like below :

使用如下聚合函数:

[
{$group: {_id : {book : '$book',address:'$addr'}, total:{$sum :1}}},
{$project : {book : '$_id.book', address : '$_id.address', total : '$total', _id : 0}}
]

it will give you result like following :

它会给你如下结果:

        {
            "total" : 1,
            "book" : "book33",
            "address" : "address90"
        }, 
        {
            "total" : 1,
            "book" : "book5",
            "address" : "address1"
        }, 
        {
            "total" : 1,
            "book" : "book99",
            "address" : "address9"
        }, 
        {
            "total" : 1,
            "book" : "book1",
            "address" : "address5"
        }, 
        {
            "total" : 1,
            "book" : "book5",
            "address" : "address2"
        }, 
        {
            "total" : 1,
            "book" : "book3",
            "address" : "address4"
        }, 
        {
            "total" : 1,
            "book" : "book11",
            "address" : "address77"
        }, 
        {
            "total" : 1,
            "book" : "book9",
            "address" : "address3"
        }, 
        {
            "total" : 1,
            "book" : "book1",
            "address" : "address15"
        }, 
        {
            "total" : 2,
            "book" : "book1",
            "address" : "address2"
        }, 
        {
            "total" : 3,
            "book" : "book1",
            "address" : "address1"
        }

I didn't quite get your expected result format, so feel free to modify this to one you need.

我没有完全得到您预期的结果格式,因此请随时将其修改为您需要的格式。

回答by Jitendra

Below query will provide exactly the same result as given in the desired response:

以下查询将提供与所需响应中给出的结果完全相同的结果:

db.books.aggregate([
    {
        $group: {
            _id: { addresses: "$addr", books: "$book" },
            num: { $sum :1 }
        }
    },
    {
        $group: {
            _id: "$_id.addresses",
            bookCounts: { $push: { bookName: "$_id.books",count: "$num" } }
        }
    },
    {
        $project: {
            _id: 1,
            bookCounts:1,
            "totalBookAtAddress": {
                "$sum": "$bookCounts.count"
            }
        }
    }

]) 

The response will be looking like below:

响应将如下所示:

/* 1 */
{
    "_id" : "address4",
    "bookCounts" : [
        {
            "bookName" : "book3",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 2 */
{
    "_id" : "address90",
    "bookCounts" : [
        {
            "bookName" : "book33",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 3 */
{
    "_id" : "address15",
    "bookCounts" : [
        {
            "bookName" : "book1",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 4 */
{
    "_id" : "address3",
    "bookCounts" : [
        {
            "bookName" : "book9",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 5 */
{
    "_id" : "address5",
    "bookCounts" : [
        {
            "bookName" : "book1",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 6 */
{
    "_id" : "address1",
    "bookCounts" : [
        {
            "bookName" : "book1",
            "count" : 3
        },
        {
            "bookName" : "book5",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 4
},

/* 7 */
{
    "_id" : "address2",
    "bookCounts" : [
        {
            "bookName" : "book1",
            "count" : 2
        },
        {
            "bookName" : "book5",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 3
},

/* 8 */
{
    "_id" : "address77",
    "bookCounts" : [
        {
            "bookName" : "book11",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 9 */
{
    "_id" : "address9",
    "bookCounts" : [
        {
            "bookName" : "book99",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
}