在 mongodb 中删除重复文档的最快方法

Question

提问by ewooycom

I have approximately 1.7M documents in mongodb (in future 10m+). Some of them represent duplicate entry which I do not want. Structure of document is something like this:

我在 mongodb 中有大约 170 万个文档（未来 1000 万+）。其中一些代表我不想要的重复条目。文档的结构是这样的：

{
    _id: 14124412,
    nodes: [
        12345,
        54321
        ],
    name: "Some beauty"
}

Document is duplicate if it has at least one node sameas another document with same name. What is the fastest way to remove duplicates?

如果文档至少有一个节点与另一个同名文档相同，则该文档是重复的。删除重复项的最快方法是什么？

Answer 1

回答by Somnath Muluk

dropDups: trueoption is not available in 3.0.

dropDups: true选项在 3.0 中不可用。

I have solution with aggregation framework for collecting duplicates and then removing in one go.

我有聚合框架的解决方案，用于收集重复项，然后一次性删除。

It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.

它可能比系统级别的“索引”更改慢一些。但是通过考虑您想要删除重复文档的方式是好的。

a. Remove all documents in one go

一种。一次删除所有文档

var duplicates = [];

db.collectionName.aggregate([
  { $match: { 
    name: { "$ne": '' }  // discard selection criteria
  }},
  { $group: { 
    _id: { name: "$name"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    doc.dups.forEach( function(dupId){ 
        duplicates.push(dupId);   // Getting all duplicate ids
        }
    )    
})

// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);     

// Remove all duplicates in one go    
db.collectionName.remove({_id:{$in:duplicates}})

b. You can delete documents one by one.

湾您可以一张一张地删除文件。

db.collectionName.aggregate([
  // discard selection criteria, You can remove "$match" section if you want
  { $match: { 
    source_references.key: { "$ne": '' }  
  }},
  { $group: { 
    _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    db.collectionName.remove({_id : {$in: doc.dups }});  // Delete remaining duplicates
})

Answer 2

回答by JohnnyHK

Assuming you want to permanently delete docs that contain a duplicate name+ nodesentry from the collection, you can add a uniqueindex with the dropDups: trueoption:

假设您要从集合中永久删除包含重复name+nodes条目的文档，您可以unique使用以下dropDups: true选项添加索引：

db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})

As the docs say, use extreme caution with this as it will delete data from your database. Back up your database first in case it doesn't do exactly as you're expecting.

正如文档所说，对此要格外小心，因为它会从您的数据库中删除数据。首先备份您的数据库，以防它不完全符合您的预期。

UPDATE

更新

This solution is only valid through MongoDB 2.x as the dropDupsoption is no longer available in 3.0 (docs).

此解决方案仅在 MongoDB 2.x 中有效，因为该dropDups选项在 3.0 ( docs) 中不再可用。

Answer 3

回答by dhythhsba

Create collection dump with mongodump

使用 mongodump 创建集合转储

Clear collection

清除收藏

Add unique index

添加唯一索引

Restore collection with mongorestore

使用 mongorestore 恢复集合

Answer 4

回答by Ali Abul Hawa

I found this solution that works with MongoDB 3.4: I'll assume the field with duplicates is called fieldX

我发现这个解决方案适用于 MongoDB 3.4：我假设有重复的字段称为 fieldX

db.collection.aggregate([
{
    // only match documents that have this field
    // you can omit this stage if you don't have missing fieldX
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}}
},
{
    $replaceRoot: { "newRoot": "$doc"}
}
],
{allowDiskUse:true})

Being new to mongoDB, I spent a lot of time and used other lengthy solutions to find and delete duplicates. However, I think this solution is neat and easy to understand.

作为 mongoDB 的新手，我花了很多时间并使用其他冗长的解决方案来查找和删除重复项。但是，我认为这个解决方案简洁易懂。

It works by first matching documents that contain fieldX (I had some documents without this field, and I got one extra empty result).

它首先匹配包含 fieldX 的文档（我有一些没有这个字段的文档，我得到了一个额外的空结果）。

The next stage groups documents by fieldX, and only inserts the $firstdocument in each group using $$ROOT. Finally, it replaces the whole aggregated group by the document found using $first and $$ROOT.

下一阶段按 fieldX 对文档进行分组，并且仅使用$$ROOT在每个组中插入$first文档。最后，它用使用 $first 和 $$ROOT 找到的文档替换整个聚合组。

I had to add allowDiskUse because my collection is large.

我不得不添加 allowDiskUse 因为我的收藏很大。

You can add this after any number of pipelines, and although the documentation for $first mentions a sort stage prior to using $first, it worked for me without it. " couldnt post a link here, my reputation is less than 10 :( "

您可以在任意数量的管道之后添加它，尽管 $first 的文档在使用$first之前提到了一个排序阶段，但没有它它对我有用。" 不能在这里发链接，我的声誉不到 10 :( "

You can save the results to a new collection by adding an $out stage...

您可以通过添加 $out 阶段将结果保存到新集合中...

Alternatively, if one is only interested in a few fields e.g. field1, field2, and not the whole document, in the group stage without replaceRoot:

或者，如果一个人只对几个字段感兴趣，例如 field1、field2，而不是整个文档，则在没有 replaceRoot 的小组赛阶段：

db.collection.aggregate([
{
    // only match documents that have this field
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }}
}
],
{allowDiskUse:true})

Answer 5

回答by Fernando

Here is a slightly more 'manual' way of doing it:

这是一种稍微“手动”的方法：

Essentially, first, get a list of all the unique keys you are interested.

本质上，首先，获取您感兴趣的所有唯一键的列表。

Then perform a search using each of those keys and delete if that search returns bigger than one.

然后使用这些键中的每一个执行搜索，如果该搜索返回大于 1，则删除。

  db.collection.distinct("key").forEach((num)=>{
    var i = 0;
    db.collection.find({key: num}).forEach((doc)=>{
      if (i)   db.collection.remove({key: num}, { justOne: true })
      i++
    })
  });

Answer 6

回答by amateur

General idea is to use findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/to retrieve one random id from the duplicate records in the collection.
Delete all the records in the collection other than the random-id that we retrieved from findOne option.

一般的想法是使用 findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/从集合中的重复记录中检索一个随机 id。
删除集合中除我们从 findOne 选项中检索到的 random-id 之外的所有记录。

You can do something like this if you are trying to do it in pymongo.

如果您尝试在 pymongo 中执行此操作，则可以执行此类操作。

def _run_query():

        try:

            for record in (aggregate_based_on_field(collection)):
                if not record:
                    continue
                _logger.info("Working on Record %s", record)

                try:
                    retain = db.collection.find_one(find_one({'fie1d1': 'x',  'field2':'y'}, {'_id': 1}))
                    _logger.info("_id to retain from duplicates %s", retain['_id'])

                    db.collection.remove({'fie1d1': 'x',  'field2':'y', '_id': {'$ne': retain['_id']}})

                except Exception as ex:
                    _logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex))

        except Exception as e:
            _logger.error("Mongo error when deleting duplicates %s", str(e))


def aggregate_based_on_field(collection):
    return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])

From the shell:

从外壳：

Replace find_one to findOne
Same remove command should work.

将 find_one 替换为 findOne
相同的删除命令应该可以工作。

Answer 7

回答by sanair96

The following method merges documents with the same name while only keeping the unique nodes without duplicating them.

以下方法合并具有相同名称的文档，同时仅保留唯一节点而不复制它们。

I found using the $outoperator to be a simple way. I unwind the array and then group it by adding to set. The $outoperator allows the aggregation result to persist [docs]. If you put the name of the collection itself it will replace the collection with the new data. If the name does not exist it will create a new collection.

我发现使用$out运算符是一种简单的方法。我展开数组，然后通过添加到集合对其进行分组。的$out操作者允许聚合结果坚持[文档]。如果您输入集合本身的名称，它将用新数据替换集合。如果名称不存在，它将创建一个新集合。

Hope this helps.

希望这可以帮助。

allowDiskUsemay have to be added to the pipeline.

allowDiskUse可能必须添加到管道中。

db.collectionName.aggregate([
  {
    $unwind:{path:"$nodes"},
  },
  {
    $group:{
      _id:"$name",
      nodes:{
        $addToSet:"$nodes"
      }
  },
  {
    $project:{
      _id:0,
      name:"$_id.name",
      nodes:1
    }
  },
  {
    $out:"collectionNameWithoutDuplicates"
  }
])

Answer 8

回答by Renny

Using pymongothis should work.

使用pymongo这应该可以工作。

Add the fields that need to be unique for the collection in unique_field

在 unique_field 中添加集合需要唯一的字段

unique_field = {"field1":"$field1","field2":"$field2"}

cursor = DB.COL.aggregate([{"$group":{"_id":unique_field, "dups":{"$push":"$uuid"}, "count": {"$sum": 1}}},{"$match":{"count": {"$gt": 1}}},{"$group":"_id":None,"dups":{"$addToSet":{"$arrayElemAt":["$dups",1]}}}}],allowDiskUse=True)

slice the dups array depending on the duplications count(here i had only one extra duplicate for all)

根据重复次数对 dups 数组进行切片（这里我只有一个额外的重复）

items = list(cursor)
removeIds = items[0]['dups']
hold.remove({"uuid":{"$in":removeIds}})

在 mongodb 中删除重复文档的最快方法

提问by ewooycom

回答by Somnath Muluk

回答by JohnnyHK

回答by dhythhsba

回答by Ali Abul Hawa

回答by Fernando

回答by amateur

回答by sanair96

回答by Renny

相关推荐

最近更新

标签

在 mongodb 中删除重复文档的最快方法

提问by ewooycom

回答by Somnath Muluk

回答by JohnnyHK

回答by dhythhsba

回答by Ali Abul Hawa

回答by Fernando

回答by amateur

回答by sanair96

回答by Renny

相关推荐

为什么我不能用 Homebrew 更新到最新的 MongoDB？

mongodb 如何根据Mongodb中的键删除重复项？

使用聚合框架使用 MongoDB 进行组计数

是否可以展平 MongoDB 结果查询？

相关推荐

最近更新

标签