在 mongodb 中删除重复文档的最快方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14184099/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest way to remove duplicate documents in mongodb
提问by ewooycom
I have approximately 1.7M documents in mongodb (in future 10m+). Some of them represent duplicate entry which I do not want. Structure of document is something like this:
我在 mongodb 中有大约 170 万个文档(未来 1000 万+)。其中一些代表我不想要的重复条目。文档的结构是这样的:
{
_id: 14124412,
nodes: [
12345,
54321
],
name: "Some beauty"
}
Document is duplicate if it has at least one node sameas another document with same name. What is the fastest way to remove duplicates?
如果文档至少有一个节点与另一个同名文档相同,则该文档是重复的。删除重复项的最快方法是什么?
回答by Somnath Muluk
dropDups: true
option is not available in 3.0.
dropDups: true
选项在 3.0 中不可用。
I have solution with aggregation framework for collecting duplicates and then removing in one go.
我有聚合框架的解决方案,用于收集重复项,然后一次性删除。
It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.
它可能比系统级别的“索引”更改慢一些。但是通过考虑您想要删除重复文档的方式是好的。
a. Remove all documents in one go
一种。一次删除所有文档
var duplicates = [];
db.collectionName.aggregate([
{ $match: {
name: { "$ne": '' } // discard selection criteria
}},
{ $group: {
_id: { name: "$name"}, // can be grouped on multiple properties
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
],
{allowDiskUse: true} // For faster processing if set is larger
) // You can display result until this and check duplicates
.forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting
doc.dups.forEach( function(dupId){
duplicates.push(dupId); // Getting all duplicate ids
}
)
})
// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);
// Remove all duplicates in one go
db.collectionName.remove({_id:{$in:duplicates}})
b. You can delete documents one by one.
湾 您可以一张一张地删除文件。
db.collectionName.aggregate([
// discard selection criteria, You can remove "$match" section if you want
{ $match: {
source_references.key: { "$ne": '' }
}},
{ $group: {
_id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
],
{allowDiskUse: true} // For faster processing if set is larger
) // You can display result until this and check duplicates
.forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting
db.collectionName.remove({_id : {$in: doc.dups }}); // Delete remaining duplicates
})
回答by JohnnyHK
Assuming you want to permanently delete docs that contain a duplicate name
+ nodes
entry from the collection, you can add a unique
index with the dropDups: true
option:
假设您要从集合中永久删除包含重复name
+nodes
条目的文档,您可以unique
使用以下dropDups: true
选项添加索引:
db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})
As the docs say, use extreme caution with this as it will delete data from your database. Back up your database first in case it doesn't do exactly as you're expecting.
正如文档所说,对此要格外小心,因为它会从您的数据库中删除数据。首先备份您的数据库,以防它不完全符合您的预期。
UPDATE
更新
This solution is only valid through MongoDB 2.x as the dropDups
option is no longer available in 3.0 (docs).
此解决方案仅在 MongoDB 2.x 中有效,因为该dropDups
选项在 3.0 ( docs) 中不再可用。
回答by dhythhsba
Create collection dump with mongodump
使用 mongodump 创建集合转储
Clear collection
清除收藏
Add unique index
添加唯一索引
Restore collection with mongorestore
使用 mongorestore 恢复集合
回答by Ali Abul Hawa
I found this solution that works with MongoDB 3.4: I'll assume the field with duplicates is called fieldX
我发现这个解决方案适用于 MongoDB 3.4:我假设有重复的字段称为 fieldX
db.collection.aggregate([
{
// only match documents that have this field
// you can omit this stage if you don't have missing fieldX
$match: {"fieldX": {$nin:[null]}}
},
{
$group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}}
},
{
$replaceRoot: { "newRoot": "$doc"}
}
],
{allowDiskUse:true})
Being new to mongoDB, I spent a lot of time and used other lengthy solutions to find and delete duplicates. However, I think this solution is neat and easy to understand.
作为 mongoDB 的新手,我花了很多时间并使用其他冗长的解决方案来查找和删除重复项。但是,我认为这个解决方案简洁易懂。
It works by first matching documents that contain fieldX (I had some documents without this field, and I got one extra empty result).
它首先匹配包含 fieldX 的文档(我有一些没有这个字段的文档,我得到了一个额外的空结果)。
The next stage groups documents by fieldX, and only inserts the $firstdocument in each group using $$ROOT. Finally, it replaces the whole aggregated group by the document found using $first and $$ROOT.
下一阶段按 fieldX 对文档进行分组,并且仅使用$$ROOT在每个组中插入$first文档。最后,它用使用 $first 和 $$ROOT 找到的文档替换整个聚合组。
I had to add allowDiskUse because my collection is large.
我不得不添加 allowDiskUse 因为我的收藏很大。
You can add this after any number of pipelines, and although the documentation for $first mentions a sort stage prior to using $first, it worked for me without it. " couldnt post a link here, my reputation is less than 10 :( "
您可以在任意数量的管道之后添加它,尽管 $first 的文档在使用$first之前提到了一个排序阶段,但没有它它对我有用。" 不能在这里发链接,我的声誉不到 10 :( "
You can save the results to a new collection by adding an $out stage...
您可以通过添加 $out 阶段将结果保存到新集合中...
Alternatively, if one is only interested in a few fields e.g. field1, field2, and not the whole document, in the group stage without replaceRoot:
或者,如果一个人只对几个字段感兴趣,例如 field1、field2,而不是整个文档,则在没有 replaceRoot 的小组赛阶段:
db.collection.aggregate([
{
// only match documents that have this field
$match: {"fieldX": {$nin:[null]}}
},
{
$group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }}
}
],
{allowDiskUse:true})
回答by Fernando
Here is a slightly more 'manual' way of doing it:
这是一种稍微“手动”的方法:
Essentially, first, get a list of all the unique keys you are interested.
本质上,首先,获取您感兴趣的所有唯一键的列表。
Then perform a search using each of those keys and delete if that search returns bigger than one.
然后使用这些键中的每一个执行搜索,如果该搜索返回大于 1,则删除。
db.collection.distinct("key").forEach((num)=>{
var i = 0;
db.collection.find({key: num}).forEach((doc)=>{
if (i) db.collection.remove({key: num}, { justOne: true })
i++
})
});
回答by amateur
General idea is to use findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/to retrieve one random id from the duplicate records in the collection.
Delete all the records in the collection other than the random-id that we retrieved from findOne option.
一般的想法是使用 findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/从集合中的重复记录中检索一个随机 id。
删除集合中除我们从 findOne 选项中检索到的 random-id 之外的所有记录。
You can do something like this if you are trying to do it in pymongo.
如果您尝试在 pymongo 中执行此操作,则可以执行此类操作。
def _run_query():
try:
for record in (aggregate_based_on_field(collection)):
if not record:
continue
_logger.info("Working on Record %s", record)
try:
retain = db.collection.find_one(find_one({'fie1d1': 'x', 'field2':'y'}, {'_id': 1}))
_logger.info("_id to retain from duplicates %s", retain['_id'])
db.collection.remove({'fie1d1': 'x', 'field2':'y', '_id': {'$ne': retain['_id']}})
except Exception as ex:
_logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex))
except Exception as e:
_logger.error("Mongo error when deleting duplicates %s", str(e))
def aggregate_based_on_field(collection):
return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])
From the shell:
从外壳:
- Replace find_one to findOne
- Same remove command should work.
- 将 find_one 替换为 findOne
- 相同的删除命令应该可以工作。
回答by sanair96
The following method merges documents with the same name while only keeping the unique nodes without duplicating them.
以下方法合并具有相同名称的文档,同时仅保留唯一节点而不复制它们。
I found using the $out
operator to be a simple way. I unwind the array and then group it by adding to set. The $out
operator allows the aggregation result to persist [docs].
If you put the name of the collection itself it will replace the collection with the new data. If the name does not exist it will create a new collection.
我发现使用$out
运算符是一种简单的方法。我展开数组,然后通过添加到集合对其进行分组。的$out
操作者允许聚合结果坚持[文档]。如果您输入集合本身的名称,它将用新数据替换集合。如果名称不存在,它将创建一个新集合。
Hope this helps.
希望这可以帮助。
allowDiskUse
may have to be added to the pipeline.
allowDiskUse
可能必须添加到管道中。
db.collectionName.aggregate([
{
$unwind:{path:"$nodes"},
},
{
$group:{
_id:"$name",
nodes:{
$addToSet:"$nodes"
}
},
{
$project:{
_id:0,
name:"$_id.name",
nodes:1
}
},
{
$out:"collectionNameWithoutDuplicates"
}
])
回答by Renny
Using pymongothis should work.
使用pymongo这应该可以工作。
Add the fields that need to be unique for the collection in unique_field
在 unique_field 中添加集合需要唯一的字段
unique_field = {"field1":"$field1","field2":"$field2"}
cursor = DB.COL.aggregate([{"$group":{"_id":unique_field, "dups":{"$push":"$uuid"}, "count": {"$sum": 1}}},{"$match":{"count": {"$gt": 1}}},{"$group":"_id":None,"dups":{"$addToSet":{"$arrayElemAt":["$dups",1]}}}}],allowDiskUse=True)
slice the dups array depending on the duplications count(here i had only one extra duplicate for all)
根据重复次数对 dups 数组进行切片(这里我只有一个额外的重复)
items = list(cursor)
removeIds = items[0]['dups']
hold.remove({"uuid":{"$in":removeIds}})