MongoDB select count(distinct x) on an indexed column - 计算大型数据集的唯一结果
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11782566/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
MongoDB select count(distinct x) on an indexed column - count unique results for large data sets
提问by Eran Medan
I have gone through several articles and examples, and have yet to find an efficient way to do this SQL query in MongoDB (where there are millions of rowsdocuments)
我已经浏览了几篇文章和示例,但还没有找到一种在 MongoDB 中执行此 SQL 查询的有效方法(其中有数百万个 行文件)
First attempt
第一次尝试
(e.g. from this almost duplicate question - Mongo equivalent of SQL's SELECT DISTINCT?)
(例如,来自这个几乎重复的问题 - Mongo 相当于 SQL 的 SELECT DISTINCT?)
db.myCollection.distinct("myIndexedNonUniqueField").length
Obviously I got this error as my dataset is huge
显然我收到了这个错误,因为我的数据集很大
Thu Aug 02 12:55:24 uncaught exception: distinct failed: {
"errmsg" : "exception: distinct too big, 16mb cap",
"code" : 10044,
"ok" : 0
}
Second attempt
第二次尝试
I decided to try and do a group
我决定尝试做一个小组
db.myCollection.group({key: {myIndexedNonUniqueField: 1},
initial: {count: 0},
reduce: function (obj, prev) { prev.count++;} } );
But I got this error message instead:
但是我收到了这个错误信息:
exception: group() can't handle more than 20000 unique keys
Third attempt
第三次尝试
I haven't tried yet but there are several suggestions that involve mapReduce
我还没有尝试过,但有几个建议涉及 mapReduce
e.g.
例如
- this one how to do distinct and group in mongodb?(not accepted, answer author / OP didn't test it)
- this one MongoDB group by Functionalities(seems similar to Second Attempt)
- this one http://blog.emmettshear.com/post/2010/02/12/Counting-Uniques-With-MongoDB
- this one https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/trDn3jJjqtE
- this one http://cookbook.mongodb.org/patterns/unique_items_map_reduce/
- 这个如何在mongodb中做distinct和group?(不接受,回答作者/OP 没有测试)
- 这一个MongoDB 组按功能(似乎类似于第二次尝试)
- 这个http://blog.emmettshear.com/post/2010/02/12/Counting-Uniques-With-MongoDB
- 这个https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/trDn3jJjqtE
- 这个http://cookbook.mongodb.org/patterns/unique_items_map_reduce/
Also
还
It seems there is a pull request on GitHub fixing the .distinct
method to mention it should only return a count, but it's still open: https://github.com/mongodb/mongo/pull/34
GitHub 上似乎有一个 pull request 修复了.distinct
提到它应该只返回一个计数的方法,但它仍然是开放的:https: //github.com/mongodb/mongo/pull/34
But at this point I thought it's worth to ask here, what is the latest on the subject? Should I move to SQL or another NoSQL DB for distinct counts? or is there an efficient way?
但在这一点上,我认为值得在这里问一下,关于这个主题的最新消息是什么?我应该转移到 SQL 或另一个 NoSQL DB 以获得不同的计数吗?或者有没有有效的方法?
Update:
更新:
This comment on the MongoDB official docs is not encouraging, is this accurate?
MongoDB 官方文档上的这种评论并不令人鼓舞,这是否准确?
http://www.mongodb.org/display/DOCS/Aggregation#comment-430445808
http://www.mongodb.org/display/DOCS/Aggregation#comment-430445808
Update2:
更新2:
Seems the new Aggregation Framework answers the above comment... (MongoDB 2.1/2.2 and above, development preview available, not for production)
似乎新的聚合框架回答了上述评论......(MongoDB 2.1/2.2 及更高版本,开发预览可用,不适用于生产)
采纳答案by William Z
1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values
1) 最简单的方法是通过聚合框架。这需要两个“$group”命令:第一个按不同的值分组,第二个对所有不同的值进行计数
pipeline = [
{ $group: { _id: "$myIndexedNonUniqueField"} },
{ $group: { _id: 1, count: { $sum: 1 } } }
];
//
// Run the aggregation command
//
R = db.runCommand(
{
"aggregate": "myCollection" ,
"pipeline": pipeline
}
);
printjson(R);
2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.
2)如果你想用 Map/Reduce 来做到这一点,你可以。这也是一个两阶段的过程:在第一阶段,我们构建一个新集合,其中包含键的每个不同值的列表。在第二个中,我们对新集合执行 count()。
var SOURCE = db.myCollection;
var DEST = db.distinct
DEST.drop();
map = function() {
emit( this.myIndexedNonUniqueField , {count: 1});
}
reduce = function(key, values) {
var count = 0;
values.forEach(function(v) {
count += v['count']; // count each distinct value for lagniappe
});
return {count: count};
};
//
// run map/reduce
//
res = SOURCE.mapReduce( map, reduce,
{ out: 'distinct',
verbose: true
}
);
print( "distinct count= " + res.counts.output );
print( "distinct count=", DEST.count() );
Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You cansave the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().
请注意,您不能内联返回 map/reduce 的结果,因为这可能会超出 16MB 的文档大小限制。您可以将计算保存在一个集合中,然后 count() 集合的大小,也可以从 mapReduce() 的返回值中获取结果的数量。
回答by Stackee007
db.myCollection.aggregate(
{$group : {_id : "$myIndexedNonUniqueField"} },
{$group: {_id:1, count: {$sum : 1 }}});
straight to result:
直接结果:
db.myCollection.aggregate(
{$group : {_id : "$myIndexedNonUniqueField"} },
{$group: {_id:1, count: {$sum : 1 }}})
.result[0].count;
回答by Munib mir
Following solution worked for me
以下解决方案对我有用
db.test.distinct('user'); [ "alex", "England", "France", "Australia" ]
db.countries.distinct('country').length 4
db.test.distinct('用户'); [“亚历克斯”、“英格兰”、“法国”、“澳大利亚”]
db.countries.distinct('country').length 4