通过关键字段查找 MongoDB 集合中的所有重复文档
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9491920/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find all duplicate documents in a MongoDB collection by a key field
提问by frazman
Suppose I have a collection with some set of documents. something like this.
假设我有一个包含一些文档集的集合。像这样的东西。
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"}
I want to find all the duplicated entries in this collection by the "name" field. E.g. "foo" appears twice and "bar" appears 3 times.
我想通过“名称”字段查找此集合中所有重复的条目。例如,“foo”出现两次,“bar”出现 3 次。
采纳答案by driangle
Note: this solution is the easiest to understand, but not the best.
注意:这个解决方案最容易理解,但不是最好的。
You can use mapReduce
to find out how many times a document contains a certain field:
您可以使用mapReduce
来找出文档包含某个字段的次数:
var map = function(){
if(this.name) {
emit(this.name, 1);
}
}
var reduce = function(key, values){
return Array.sum(values);
}
var res = db.collection.mapReduce(map, reduce, {out:{ inline : 1}});
db[res.result].find({value: {$gt: 1}}).sort({value: -1});
回答by expert
The accepted answer is terribly slow on large collections, and doesn't return the _id
s of the duplicate records.
接受的答案在大型集合上非常慢,并且不会返回_id
重复记录的s。
Aggregation is much faster and can return the _id
s:
聚合要快得多,并且可以返回_id
s:
db.collection.aggregate([
{ $group: {
_id: { name: "$name" }, // replace `name` here twice
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
} },
{ $match: {
count: { $gte: 2 }
} },
{ $sort : { count : -1} },
{ $limit : 10 }
]);
In the first stage of the aggregation pipeline, the $groupoperator aggregates documents by the name
field and stores in uniqueIds
each _id
value of the grouped records.
The $sumoperator adds up the values of the fields passed to it, in this case the constant 1
- thereby counting the number of grouped records into the count
field.
在聚合管道的第一阶段,$group运算符按name
字段聚合文档并存储在分组记录的uniqueIds
每个_id
值中。该$和运营商加起来传递给它,在这种情况下,不断的字段的值1
-借此计算的分组的记录数量到count
现场。
In the second stage of the pipeline, we use $matchto filter documents with a count
of at least 2, i.e. duplicates.
在管道的第二阶段,我们使用$match过滤 acount
至少为 2 的文档,即重复项。
Then, we sort the most frequent duplicates first, and limit the results to the top 10.
然后,我们首先对最频繁的重复项进行排序,并将结果限制在前 10 个。
This query will output up to $limit
records with duplicate names, along with their _id
s. For example:
此查询将输出最多$limit
具有重复名称的记录及其_id
s。例如:
{
"_id" : {
"name" : "Toothpick"
},
"uniqueIds" : [
"xzuzJd2qatfJCSvkN",
"9bpewBsKbrGBQexv4",
"fi3Gscg9M64BQdArv",
],
"count" : 3
},
{
"_id" : {
"name" : "Broom"
},
"uniqueIds" : [
"3vwny3YEj2qBsmmhA",
"gJeWGcuX6Wk69oFYD"
],
"count" : 2
}
回答by Prasanth Bendra
For a generic Mongo solution, see the MongoDB cookbook recipe for finding duplicates using group
. Note that aggregation is faster and more powerful in that it can return the _id
s of the duplicate records.
有关通用 Mongo 解决方案,请参阅使用group
. 请注意,聚合更快、更强大,因为它可以返回_id
重复记录的s。
For pymongo, the accepted answer (using mapReduce) is not that efficient. Instead, we can use the groupmethod:
对于pymongo,公认的答案(使用 mapReduce)并不是那么有效。相反,我们可以使用group方法:
$connection = 'mongodb://localhost:27017';
$con = new Mongo($connection); // mongo db connection
$db = $con->test; // database
$collection = $db->prb; // table
$keys = array("name" => 1); Select name field, group by it
// set intial values
$initial = array("count" => 0);
// JavaScript function to perform
$reduce = "function (obj, prev) { prev.count++; }";
$g = $collection->group($keys, $initial, $reduce);
echo "<pre>";
print_r($g);
Output will be this :
输出将是这样的:
Array
(
[retval] => Array
(
[0] => Array
(
[name] =>
[count] => 1
)
[1] => Array
(
[name] => MongoDB
[count] => 2
)
)
[count] => 3
[keys] => 2
[ok] => 1
)
The equivalent SQL query would be: SELECT name, COUNT(name) FROM prb GROUP BY name
. Note that we still need to filter out elements with a count of 0 from the array. Again, refer to the MongoDB cookbook recipe for finding duplicates using group
for the canonical solution using group
.
等效SQL查询是:SELECT name, COUNT(name) FROM prb GROUP BY name
。请注意,我们仍然需要从数组中过滤掉计数为 0 的元素。同样,请参阅MongoDB 食谱以group
使用group
.
回答by Krunal Shah
aggregation pipeline frameworkcan be used to easily identify documents with duplicate key values:
聚合管道框架可用于轻松识别具有重复键值的文档:
// Desired unique index:
// db.collection.ensureIndex({ firstField: 1, secondField: 1 }, { unique: true})
db.collection.aggregate([
{ $group: {
_id: { firstField: "$firstField", secondField: "$secondField" },
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
}},
{ $match: {
count: { $gt: 1 }
}}
])
~ Ref: useful information on an official mongo lab blog:
~ Ref:官方 mongo 实验室博客上的有用信息:
https://blog.mlab.com/2014/03/finding-duplicate-keys-with-the-mongodb-aggregation-framework
https://blog.mlab.com/2014/03/finding-duplicate-keys-with-the-mongodb-aggregation-framework
回答by Donato
The highest accepted answer here has this:
这里接受的最高答案是:
uniqueIds: { $addToSet: "$_id" },
That would also return to you a new field called uniqueIds with a list of ids. But what if you just want the field and its count? Then it would be this:
这也将返回给您一个名为 uniqueIds 的新字段,其中包含一个 id 列表。但是,如果您只想要字段及其计数怎么办?那么它会是这样的:
db.collection.aggregate([
{$group: { _id: {name: "$name"},
count: {$sum: 1} } },
{$match: { count: {"$gt": 1} } }
]);
To explain this, if you come from SQL databases like MySQL and PostgreSQL, you are accustomed to aggregate functions (e.g. COUNT(), SUM(), MIN(), MAX()) which work with the GROUP BY statement allowing you, for example, to find the total count that a column value appears in a table.
为了解释这一点,如果您来自像 MySQL 和 PostgreSQL 这样的 SQL 数据库,您习惯于与 GROUP BY 语句一起使用的聚合函数(例如 COUNT()、SUM()、MIN()、MAX())允许您,对于例如,查找列值出现在表中的总数。
SELECT COUNT(*), my_type FROM table GROUP BY my_type;
+----------+-----------------+
| COUNT(*) | my_type |
+----------+-----------------+
| 3 | Contact |
| 1 | Practice |
| 1 | Prospect |
| 1 | Task |
+----------+-----------------+
As you can see, our output shows the count that each my_type value appears. To find duplicates in MongoDB, we would tackle the problem in a similar way. MongoDB boasts aggregation operations, which group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. It's a similar concept to aggregate functions in SQL.
如您所见,我们的输出显示了每个 my_type 值出现的次数。要在 MongoDB 中查找重复项,我们将以类似的方式解决该问题。MongoDB 拥有聚合操作,它将来自多个文档的值组合在一起,并且可以对分组的数据执行各种操作以返回单个结果。它与 SQL 中的聚合函数的概念类似。
Assuming a collection called contacts, the initial setup looks as follows:
假设有一个名为 contacts 的集合,初始设置如下所示:
db.contacts.aggregate([ ... ]);
This aggregate function takes an array of aggregation operators, and in our case, we desire the $group operator, since our goal is to group the data by the field's count, that is, the number of occurances of the field value.
这个聚合函数采用聚合运算符数组,在我们的例子中,我们需要 $group 运算符,因为我们的目标是按字段的计数对数据进行分组,即字段值的出现次数。
db.contacts.aggregate([
{$group: {
_id: {name: "$name"}
}
}
]);
There's a little idiosyncracy to this approach. The _id field is required to use the group by operator. In this case, we are grouping the $name field. The key name within _id can have any name. But we use name since it is intuitive here.
这种方法有点特殊性。_id 字段是使用 group by 运算符所必需的。在这种情况下,我们对 $name 字段进行分组。_id 中的键名可以是任何名称。但是我们使用 name 是因为它在这里很直观。
By running the aggregation using only the $group operator, we will get a list of all the name fields (regardless if they appear once or more than once in the collection):
通过仅使用 $group 运算符运行聚合,我们将获得所有名称字段的列表(无论它们在集合中出现一次还是多次):
db.contacts.aggregate([
{$group: {
_id: {name: "$name"}
}
}
]);
{ "_id" : { "name" : "John" } }
{ "_id" : { "name" : "Joan" } }
{ "_id" : { "name" : "Stephen" } }
{ "_id" : { "name" : "Rod" } }
{ "_id" : { "name" : "Albert" } }
{ "_id" : { "name" : "Amanda" } }
Notice above how aggregation works. It took documents with name fields and returns a new collection of the name fields extracted.
注意上面的聚合是如何工作的。它获取带有名称字段的文档并返回提取的名称字段的新集合。
But what we want to know is how many times does the field value reappear. The $group operator takes a count field which uses the $sum operator to add the expression 1 to the total for each document in the group. So the $group and $sum together returns the collective sum of all the numeric values that result for a given field (e.g. name).
但是我们想知道的是字段值重新出现了多少次。$group 运算符采用计数字段,该字段使用 $sum 运算符将表达式 1 添加到组中每个文档的总数中。所以 $group 和 $sum 一起返回给定字段(例如名称)产生的所有数值的总和。
db.contacts.aggregate([
{$group: {
_id: {name: "$name"},
count: {$sum: 1}
}
}
]);
{ "_id" : { "name" : "John" }, "count" : 1 }
{ "_id" : { "name" : "Joan" }, "count" : 3 }
{ "_id" : { "name" : "Stephen" }, "count" : 2 }
{ "_id" : { "name" : "Rod" }, "count" : 3 }
{ "_id" : { "name" : "Albert" }, "count" : 2 }
{ "_id" : { "name" : "Amanda" }, "count" : 1 }
Since the goal was to eliminate duplicates, it requires one extra step. To get only the groups that have a count of more than one, we can use the $match operator to filter our results. Within the $match operator, we'll tell it to look at the count field and tell it to look for counts greater than one using the $gt operator representing "greater than" and the number 1.
由于目标是消除重复项,因此需要一个额外的步骤。要仅获取计数超过 1 的组,我们可以使用 $match 运算符来过滤我们的结果。在 $match 运算符中,我们将告诉它查看计数字段,并使用表示“大于”和数字 1 的 $gt 运算符告诉它查找大于 1 的计数。
db.contacts.aggregate([
{$group: { _id: {name: "$name"},
count: {$sum: 1} } },
{$match: { count: {"$gt": 1} } }
]);
{ "_id" : { "name" : "Joan" }, "count" : 3 }
{ "_id" : { "name" : "Stephen" }, "count" : 2 }
{ "_id" : { "name" : "Rod" }, "count" : 3 }
{ "_id" : { "name" : "Albert" }, "count" : 2 }
As a side note, if you are using MongoDB through a ORM like Mongoid for Ruby, you might get this error:
附带说明一下,如果您通过像 Mongoid for Ruby 这样的 ORM 使用 MongoDB,您可能会收到以下错误:
The 'cursor' option is required, except for aggregate with the explain argument
This most likely means your ORM is out of date and is performing operations that MongoDB no longer supports. Consequently, either update your ORM or find a fix. For Mongoid, this was the fix for me:
这很可能意味着您的 ORM 已过时并且正在执行 MongoDB 不再支持的操作。因此,要么更新您的 ORM,要么找到修复程序。对于 Mongoid,这是对我的修复:
module Moped
class Collection
# Mongo 3.6 requires a `cursor` option be passed as part of aggregate queries. This overrides
# `Moped::Collection#aggregate` to include a cursor, which is not provided by Moped otherwise.
#
# Per the [MongoDB documentation](https://docs.mongodb.com/manual/reference/command/aggregate/):
#
# Changed in version 3.6: MongoDB 3.6 removes the use of `aggregate` command *without* the `cursor` option unless
# the command includes the `explain` option. Unless you include the `explain` option, you must specify the
# `cursor` option.
#
# To indicate a cursor with the default batch size, specify `cursor: {}`.
#
# To indicate a cursor with a non-default batch size, use `cursor: { batchSize: <num> }`.
#
def aggregate(*pipeline)
# Ordering of keys apparently matters to Mongo -- `aggregate` has to come before `cursor` here.
extract_result(session.command(aggregate: name, pipeline: pipeline.flatten, cursor: {}))
end
private
def extract_result(response)
response.key?("cursor") ? response["cursor"]["firstBatch"] : response["result"]
end
end
end