Fast way to find duplicates on indexed column in mongodb

Question

提问by Piotr Czapla

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce. Or should I just iterate over all records and check for duplicates manually?

My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):

res = db.files.mapReduce(
    function () {
        emit(this.md5, 1);
    }, 
    function (key, vals) {
        return Array.sum(vals);
    }
)

db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
    out.duplicates.insert(obj)
});

Answer 1

回答by expert

I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:

db.places.aggregate(
    { $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
    { $match : { total : { $gte : 2 } } },
    { $sort : {total : -1} },
    { $limit : 5 }
    );

It searches for documents whose extra_info.idis used twice or more times, sorts results in descending order of given field and prints first 5 values of it.

Answer 2

回答by Gates VP

The easiest way to do it in one pass is to sort by md5 and then process appropriately.

Something like:

var previous_md5;
db.files.find( {"md5" : {$exists:true} }, {"md5" : 1} ).sort( { "md5" : 1} ).forEach( function(current) {

  if(current.md5 == previous_md5){
    db.duplicates.update( {"_id" : current.md5}, { "$inc" : {count:1} }, true);
  }

  previous_md5 = current.md5;

});

That little script sorts the md5 entries and loops through them in order. If an md5 is repeated, then they will be "back-to-back" after sorting. So we just keep a pointer to previous_md5and compare it current.md5. If we find a duplicate, I'm dropping it into the duplicatescollection (and using $inc to count the number of duplicates).

This script means that you only have to loop through the primary data set once. Then you can loop through the duplicatescollection and perform clean-up.

Answer 3

回答by Scott Hernandez

You can do a group by that field and then query to get the duplicated (having a count > 1). http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group

Although, the fastest thing might be to just do a query which only returns that field and then to do the aggregation in the client. Group/Map-Reduce need to provide access to the whole document which is much more costly than just providing the data from the index (which is now covered in 1.7.3+).

If this is a general problem you need to run periodically, you might want to keep a collection which is just {md5:value, count:value} so you can skip the aggregation, and it will be extremely fast when you need to cull duplicates.

Fast way to find duplicates on indexed column in mongodb

提问by Piotr Czapla

回答by expert

回答by Gates VP

回答by Scott Hernandez

相关推荐

最近更新

标签

Fast way to find duplicates on indexed column in mongodb

提问by Piotr Czapla

回答by expert

回答by Gates VP

回答by Scott Hernandez

相关推荐

mongodb Mongo复杂排序？

NoSQL（MongoDB）与 Lucene（或 Solr）作为您的数据库

如何在 MongoDB 中设置主键？

删除 MongoDB 数据库中的所有内容

相关推荐

最近更新

标签