mongodb 如何根据Mongodb中的键删除重复项?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13190370/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 12:54:26  来源:igfitidea点击:

How to remove duplicates based on a key in Mongodb?

mongodboptimizationduplicateskey

提问by user1518659

I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,

我在 MongoDB 中有一个集合,其中大约有(约 300 万条记录)。我的样本记录看起来像,

 { "_id" = ObjectId("50731xxxxxxxxxxxxxxxxxxxx"),
   "source_references" : [
                           "_id" : ObjectId("5045xxxxxxxxxxxxxx"),
                           "name" : "xxx",
                           "key" : 123
                          ]
 }

I am having a lot of duplicate records in the collection having same source_references.key. (By Duplicate I mean, source_references.keynot the _id).

我在集合中有很多重复记录具有相同的source_references.key. (我的意思是重复,source_references.key而不是_id)。

I want to remove duplicate records based on source_references.key, I'm thinking of writing some PHP code to traverse each record and remove the record if exists.

我想根据 删除重复记录source_references.key,我正在考虑编写一些 PHP 代码来遍历每条记录并删除记录(如果存在)。

Is there a way to remove the duplicates in Mongo Internal command line?

有没有办法删除 Mongo 内部命令行中的重复项?

回答by Stennie

This answer is obsolete?:the dropDupsoption was removed in MongoDB 3.0, so a different approach will be required in most cases. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key.

此答案已过时?:dropDups选项已在 MongoDB 3.0删除,因此在大多数情况下需要采用不同的方法。例如,您可以按照以下建议使用聚合:MongoDB 重复文档即使在添加唯一键之后

If you are certain that the source_references.keyidentifies duplicate records, you can ensure a unique index with the dropDups:trueindex creation option in MongoDB 2.6 or older:

如果您source_references.key确定识别重复记录,您可以dropDups:true在 MongoDB 2.6 或更早版本中使用索引创建选项确保唯一索引:

db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})

This will keep the first unique document for each source_references.keyvalue, and drop any subsequent documents that would otherwise cause a duplicate key violation.

这将保留每个source_references.key值的第一个唯一文档,并删除可能导致重复键违规的任何后续文档。

Important Note: Any documents missing the source_references.keyfield will be considered as having a nullvalue, so subsequent documents missing the key field will be deleted. You can add the sparse:trueindex creation option so the index only applies to documents with a source_references.keyfield.

重要说明:任何缺少该source_references.key字段的文档都将被视为具有值,因此后续缺少该关键字段的文档将被删除。您可以添加sparse:true索引创建选项,以便索引仅适用于具有source_references.key字段的文档。

Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.

明显的警告:如果您担心意外的数据丢失,请备份您的数据库,并首先在临时环境中尝试此操作。

回答by Kanak Singhal

This is the easiest query I used on my MongoDB 3.2

这是我在 MongoDB 3.2 上使用的最简单的查询

db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){
    db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey});
})

Index your customKeybefore running this to increase speed

customKey在运行之前索引您以提高速度

回答by Aravind Yarram

While @Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options

虽然@Stennie's 是一个有效的答案,但它不是唯一的方法。事实上,MongoDB 手册要求您在执行此操作时要非常谨慎。还有另外两个选项

  1. Let the MongoDB do that for you using Map Reduce
  2. You do programaticallywhich is less efficient.
  1. 让 MongoDB 为您使用 Map Reduce
  2. 您以编程方式执行,效率较低。

回答by Fernando

Here is a slightly more 'manual' way of doing it:

这是一种稍微“手动”的方法:

Essentially, first, get a list of all the unique keys you are interested.

本质上,首先,获取您感兴趣的所有唯一键的列表。

Then perform a search using each of those keys and delete if that search returns bigger than one.

然后使用这些键中的每一个执行搜索,如果该搜索返回大于 1,则删除。

    db.collection.distinct("key").forEach((num)=>{
      var i = 0;
      db.collection.find({key: num}).forEach((doc)=>{
        if (i)   db.collection.remove({key: num}, { justOne: true })
        i++
      })
    });

回答by octohedron

Expanding on Fernando's answer, I found that it was taking too long, so I modified it.

扩展费尔南多的回答,我发现它花了太长时间,所以我修改了它。

var x = 0;
db.collection.distinct("field").forEach(fieldValue => {
  var i = 0;
  db.collection.find({ "field": fieldValue }).forEach(doc => {
    if (i) {
      db.collection.remove({ _id: doc._id });
    }
    i++;
    x += 1;
    if (x % 100 === 0) {
      print(x); // Every time we process 100 docs.
    }
  });
});

The improvement is basically using the document id for removing, which should be faster, and also adding the progress of the operation, you can change the iteration value to your desired amount.

改进基本上是使用文档id进行删除,应该更快,并且还添加了操作的进度,您可以将迭代值更改为您想要的数量。

Also, indexing the field before the operation helps.

此外,在操作之前索引字段会有所帮助。

回答by octohedron

pip install mongo_remove_duplicate_indexes

pip 安装 mongo_remove_duplicate_indexes

  1. create a script in any language
  2. iterate over your collection
  3. create new collection and create new index in this collection with unique set to true ,remember this index has to be same as index u wish to remove duplicates from in ur original collection with same name for ex-u have a collection gaming,and in this collection u have field genre which contains duplicates,which u wish to remove,so just create new collection db.createCollection("cname") create new index db.cname.createIndex({'genre':1},unique:1) now when u will insert document with similar genre only first will be accepted,other will be rejected with duplicae key error
  4. now just insert the json format values u received into new collection and handle exception using exception handling for ex pymongo.errors.DuplicateKeyError
  1. 创建任何语言的脚本
  2. 迭代你的集合
  3. 创建新集合并在此集合中创建新索引,并将 unique 设置为 true ,请记住此索引必须与您希望从您的原始集合中删除重复项的索引相同,因为您有一个集合游戏,并且在此集合你有包含重复项的字段流派,你想删除它,所以现在只需创建新集合 db.createCollection("cname") 创建新索引 db.cname.createIndex({'genre':1},unique:1)当你插入相似类型的文件时,只有第一个会被接受,其他的将被拒绝,重复密钥错误
  4. 现在只需将您收到的 json 格式值插入到新集合中,并使用 ex pymongo.errors.DuplicateKeyError 的异常处理来处理异常

check out the package source code for the mongo_remove_duplicate_indexes for better understanding

查看 mongo_remove_duplicate_indexes 的包源代码以更好地理解

回答by gilcu2

If you have enough memory, you can in scala do something like that:

如果你有足够的内存,你可以在 Scala 中做这样的事情:

cole.find().groupBy(_.customField).filter(_._2.size>1).map(_._2.tail).flatten.map(_.id)
.foreach(x=>cole.remove({id $eq x})