MongoDB select count(distinct x) on an indexed column - 计算大型数据集的唯一结果

Question

提问by Eran Medan

I have gone through several articles and examples, and have yet to find an efficient way to do this SQL query in MongoDB (where there are millions of ~~rows~~documents)

我已经浏览了几篇文章和示例，但还没有找到一种在 MongoDB 中执行此 SQL 查询的有效方法（其中有数百万个行文件）

First attempt

第一次尝试

(e.g. from this almost duplicate question - Mongo equivalent of SQL's SELECT DISTINCT?)

（例如，来自这个几乎重复的问题 - Mongo 相当于 SQL 的 SELECT DISTINCT？）

db.myCollection.distinct("myIndexedNonUniqueField").length

Obviously I got this error as my dataset is huge

显然我收到了这个错误，因为我的数据集很大

Thu Aug 02 12:55:24 uncaught exception: distinct failed: {
        "errmsg" : "exception: distinct too big, 16mb cap",
        "code" : 10044,
        "ok" : 0
}

Second attempt

第二次尝试

I decided to try and do a group

我决定尝试做一个小组

db.myCollection.group({key: {myIndexedNonUniqueField: 1},
                initial: {count: 0}, 
                 reduce: function (obj, prev) { prev.count++;} } );

But I got this error message instead:

但是我收到了这个错误信息：

exception: group() can't handle more than 20000 unique keys

Third attempt

第三次尝试

I haven't tried yet but there are several suggestions that involve mapReduce

我还没有尝试过，但有几个建议涉及 mapReduce

e.g.

例如

this one how to do distinct and group in mongodb?(not accepted, answer author / OP didn't test it)
this one MongoDB group by Functionalities(seems similar to Second Attempt)
this one http://blog.emmettshear.com/post/2010/02/12/Counting-Uniques-With-MongoDB
this one https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/trDn3jJjqtE
this one http://cookbook.mongodb.org/patterns/unique_items_map_reduce/

这个如何在mongodb中做distinct和group？（不接受，回答作者/OP 没有测试）
这一个MongoDB 组按功能（似乎类似于第二次尝试）
这个http://blog.emmettshear.com/post/2010/02/12/Counting-Uniques-With-MongoDB
这个https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/trDn3jJjqtE
这个http://cookbook.mongodb.org/patterns/unique_items_map_reduce/

Also

还

It seems there is a pull request on GitHub fixing the .distinctmethod to mention it should only return a count, but it's still open: https://github.com/mongodb/mongo/pull/34

GitHub 上似乎有一个 pull request 修复了.distinct提到它应该只返回一个计数的方法，但它仍然是开放的：https: //github.com/mongodb/mongo/pull/34

But at this point I thought it's worth to ask here, what is the latest on the subject? Should I move to SQL or another NoSQL DB for distinct counts? or is there an efficient way?

但在这一点上，我认为值得在这里问一下，关于这个主题的最新消息是什么？我应该转移到 SQL 或另一个 NoSQL DB 以获得不同的计数吗？或者有没有有效的方法？

Update:

更新：

This comment on the MongoDB official docs is not encouraging, is this accurate?

MongoDB 官方文档上的这种评论并不令人鼓舞，这是否准确？

http://www.mongodb.org/display/DOCS/Aggregation#comment-430445808

Update2:

更新2：

Seems the new Aggregation Framework answers the above comment... (MongoDB 2.1/2.2 and above, development preview available, not for production)

似乎新的聚合框架回答了上述评论......（MongoDB 2.1/2.2 及更高版本，开发预览可用，不适用于生产）

http://docs.mongodb.org/manual/applications/aggregation/

Answer 1

采纳答案by William Z

1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values

1) 最简单的方法是通过聚合框架。这需要两个“$group”命令：第一个按不同的值分组，第二个对所有不同的值进行计数

pipeline = [ 
    { $group: { _id: "$myIndexedNonUniqueField"}  },
    { $group: { _id: 1, count: { $sum: 1 } } }
];

//
// Run the aggregation command
//
R = db.runCommand( 
    {
    "aggregate": "myCollection" , 
    "pipeline": pipeline
    }
);
printjson(R);

2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.

2）如果你想用 Map/Reduce 来做到这一点，你可以。这也是一个两阶段的过程：在第一阶段，我们构建一个新集合，其中包含键的每个不同值的列表。在第二个中，我们对新集合执行 count()。

var SOURCE = db.myCollection;
var DEST = db.distinct
DEST.drop();


map = function() {
  emit( this.myIndexedNonUniqueField , {count: 1});
}

reduce = function(key, values) {
  var count = 0;

  values.forEach(function(v) {
    count += v['count'];        // count each distinct value for lagniappe
  });

  return {count: count};
};

//
// run map/reduce
//
res = SOURCE.mapReduce( map, reduce, 
    { out: 'distinct', 
     verbose: true
    }
    );

print( "distinct count= " + res.counts.output );
print( "distinct count=", DEST.count() );

Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You cansave the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().

请注意，您不能内联返回 map/reduce 的结果，因为这可能会超出 16MB 的文档大小限制。您可以将计算保存在一个集合中，然后 count() 集合的大小，也可以从 mapReduce() 的返回值中获取结果的数量。

Answer 2

回答by Stackee007

db.myCollection.aggregate( 
   {$group : {_id : "$myIndexedNonUniqueField"} }, 
   {$group: {_id:1, count: {$sum : 1 }}});

straight to result:

直接结果：

db.myCollection.aggregate( 
   {$group : {_id : "$myIndexedNonUniqueField"} }, 
   {$group: {_id:1, count: {$sum : 1 }}})
   .result[0].count;

Answer 3

回答by Munib mir

Following solution worked for me

以下解决方案对我有用

db.test.distinct('user'); [ "alex", "England", "France", "Australia" ]
db.countries.distinct('country').length 4

db.test.distinct('用户'); [“亚历克斯”、“英格兰”、“法国”、“澳大利亚”]
db.countries.distinct('country').length 4

MongoDB select count(distinct x) on an indexed column - 计算大型数据集的唯一结果

提问by Eran Medan

采纳答案by William Z

回答by Stackee007

回答by Munib mir

相关推荐

最近更新

标签

MongoDB select count(distinct x) on an indexed column - 计算大型数据集的唯一结果

提问by Eran Medan

采纳答案by William Z

回答by Stackee007

回答by Munib mir

相关推荐

MongoDB - 如何查询集合中的嵌套项？

mongodb 从mongodb集合中获取最新记录

mongodb 在mongodb中将字符串转换为日期

mongodb 如何修改副本集配置？

相关推荐

最近更新

标签