来自 MongoDB 的随机记录

Question

提问by Will M

I am looking to get a random record from a huge (100 million record) mongodb.

我希望从一个巨大的（1 亿条记录）中获得一个随机记录mongodb。

What is the fastest and most efficient way to do so? The data is already there and there are no field in which I can generate a random number and obtain a random row.

这样做的最快和最有效的方法是什么？数据已经存在，并且没有可以生成随机数并获得随机行的字段。

Any suggestions?

有什么建议？

Answer 1

回答by JohnnyHK

Starting with the 3.2 release of MongoDB, you can get N random docs from a collection using the $sampleaggregation pipeline operator:

从 MongoDB 3.2 版本开始，您可以使用$sample聚合管道运算符从集合中获取 N 个随机文档：

// Get one random document from the mycoll collection.
db.mycoll.aggregate([{ $sample: { size: 1 } }])

If you want to select the random document(s) from a filtered subset of the collection, prepend a $matchstage to the pipeline:

如果要从集合的过滤子集中选择随机文档$match，请在管道中添加一个阶段：

// Get one random document matching {a: 10} from the mycoll collection.
db.mycoll.aggregate([
    { $match: { a: 10 } },
    { $sample: { size: 1 } }
])

As noted in the comments, when sizeis greater than 1, there may be duplicates in the returned document sample.

如注释中所述，当size大于 1 时，返回的文档样本中可能存在重复项。

Answer 2

回答by ceejayoz

Do a count of all records, generate a random number between 0 and the count, and then do:

对所有记录做一个计数，生成一个介于 0 和计数之间的随机数，然后执行：

db.yourCollection.find().limit(-1).skip(yourRandomNumber).next()

Answer 3

回答by Michael

Update for MongoDB 3.2

MongoDB 3.2 更新

3.2 introduced $sampleto the aggregation pipeline.

3.2 将$sample引入聚合管道。

There's also a good blog poston putting it into practice.

还有一篇关于将其付诸实践的好博客文章。

For older versions (previous answer)

对于旧版本（以前的答案）

This was actually a feature request: http://jira.mongodb.org/browse/SERVER-533but it was filed under "Won't fix."

这实际上是一个功能请求：http: //jira.mongodb.org/browse/SERVER-533但它被归档在“不会修复”下。

The cookbook has a very good recipe to select a random document out of a collection: http://cookbook.mongodb.org/patterns/random-attribute/

食谱有一个很好的方法可以从集合中选择一个随机文档：http: //cookbook.mongodb.org/patterns/random-attribute/

To paraphrase the recipe, you assign random numbers to your documents:

解释一下配方，您可以为文档分配随机数：

db.docs.save( { key : 1, ..., random : Math.random() } )

Then select a random document:

然后选择一个随机文档：

rand = Math.random()
result = db.docs.findOne( { key : 2, random : { $gte : rand } } )
if ( result == null ) {
  result = db.docs.findOne( { key : 2, random : { $lte : rand } } )
}

Querying with both $gteand $lteis necessary to find the document with a random number nearest rand.

使用$gte和进行查询$lte是找到具有最接近的随机数的文档所必需的rand。

And of course you'll want to index on the random field:

当然，您会希望在随机字段上建立索引：

db.docs.ensureIndex( { key : 1, random :1 } )

If you're already querying against an index, simply drop it, append random: 1to it, and add it again.

如果您已经在查询索引，只需删除它，附加random: 1到它，然后再次添加它。

Answer 4

回答by Nico de Poel

You can also use MongoDB's geospatial indexing feature to select the documents 'nearest' to a random number.

您还可以使用 MongoDB 的地理空间索引功能来选择“最接近”随机数的文档。

First, enable geospatial indexing on a collection:

首先，在集合上启用地理空间索引：

db.docs.ensureIndex( { random_point: '2d' } )

To create a bunch of documents with random points on the X-axis:

要在 X 轴上创建一堆带有随机点的文档：

for ( i = 0; i < 10; ++i ) {
    db.docs.insert( { key: i, random_point: [Math.random(), 0] } );
}

Then you can get a random document from the collection like this:

然后你可以从集合中获取一个随机文档，如下所示：

db.docs.findOne( { random_point : { $near : [Math.random(), 0] } } )

Or you can retrieve several document nearest to a random point:

或者您可以检索最接近随机点的几个文档：

db.docs.find( { random_point : { $near : [Math.random(), 0] } } ).limit( 4 )

This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.

这只需要一个查询，不需要空检查，而且代码干净、简单和灵活。您甚至可以使用地理点的 Y 轴为您的查询添加第二个随机性维度。

Answer 5

回答by spam_eggs

The following recipe is a little slower than the mongo cookbook solution (add a random key on every document), but returns more evenly distributed random documents. It's a little less-evenly distributed than the skip( random )solution, but much faster and more fail-safe in case documents are removed.

以下配方比 mongo 食谱解决方案慢一点（在每个文档上添加一个随机密钥），但返回更均匀分布的随机文档。与skip( random )解决方案相比，它的分布不那么均匀，但在文档被删除的情况下，速度更快且故障安全性更高。

function draw(collection, query) {
    // query: mongodb query object (optional)
    var query = query || { };
    query['random'] = { $lte: Math.random() };
    var cur = collection.find(query).sort({ rand: -1 });
    if (! cur.hasNext()) {
        delete query.random;
        cur = collection.find(query).sort({ rand: -1 });
    }
    var doc = cur.next();
    doc.random = Math.random();
    collection.update({ _id: doc._id }, doc);
    return doc;
}

It also requires you to add a random "random" field to your documents so don't forget to add this when you create them : you may need to initialize your collection as shown by Geoffrey

它还要求您向文档中添加一个随机的“随机”字段，因此在创建它们时不要忘记添加它：您可能需要按照 Geoffrey 所示初始化您的集合

function addRandom(collection) { 
    collection.find().forEach(function (obj) {
        obj.random = Math.random();
        collection.save(obj);
    }); 
} 
db.eval(addRandom, db.things);

Benchmark results

基准测试结果

This method is much faster than the skip()method (of ceejayoz) and generates more uniformly random documents than the "cookbook" method reported by Michael:

这种方法比skip()（ceejayoz 的）方法快得多，并且比迈克尔报告的“食谱”方法生成更均匀的随机文档：

For a collection with 1,000,000 elements:

对于包含 1,000,000 个元素的集合：

This method takes less than a millisecond on my machine
the skip()method takes 180 ms on average

这个方法在我的机器上用时不到一毫秒
该skip()方法平均需要 180 毫秒

The cookbook method will cause large numbers of documents to never get picked because their random number does not favor them.

食谱方法将导致大量文档永远不会被选中，因为它们的随机数不利于它们。

This method will pick all elements evenly over time.
In my benchmark it was only 30% slower than the cookbook method.
the randomness is not 100% perfect but it is very good (and it can be improved if necessary)

此方法将随着时间的推移均匀地选取所有元素。
在我的基准测试中，它只比食谱方法慢 30%。
随机性不是 100% 完美，但非常好（如有必要可以改进）

This recipe is not perfect - the perfect solution would be a built-in feature as others have noted.
However it should be a good compromise for many purposes.

这个秘诀并不完美 - 完美的解决方案是其他人已经指出的内置功能。
然而，对于许多目的来说，这应该是一个很好的折衷方案。

Answer 6

回答by Blakes Seven

Here is a way using the default ObjectIdvalues for _idand a little math and logic.

这是一种使用默认ObjectId值_id和一些数学和逻辑的方法。

// Get the "min" and "max" timestamp values from the _id in the collection and the 
// diff between.
// 4-bytes from a hex string is 8 characters

var min = parseInt(db.collection.find()
        .sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    max = parseInt(db.collection.find()
        .sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    diff = max - min;

// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;

// Use "random" in the range and pad the hex string to a valid ObjectId
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")

// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
   .sort({ "_id": 1 }).limit(1).toArray()[0];

That's the general logic in shell representation and easily adaptable.

这是 shell 表示中的一般逻辑，并且很容易适应。

So in points:

所以在几点：

Find the min and max primary key values in the collection
Generate a random number that falls between the timestamps of those documents.
Add the random number to the minimum value and find the first document that is greater than or equal to that value.

查找集合中的最小和最大主键值
生成一个介于这些文档的时间戳之间的随机数。
将随机数与最小值相加，找到大于或等于该值的第一个文档。

This uses "padding" from the timestamp value in "hex" to form a valid ObjectIdvalue since that is what we are looking for. Using integers as the _idvalue is essentially simplier but the same basic idea in the points.

这使用“十六进制”中时间戳值的“填充”来形成有效值，ObjectId因为这就是我们正在寻找的。使用整数作为_id值本质上更简单，但点中的基本思想相同。

Answer 7

回答by dbam

Now you can use the aggregate. Example:

现在您可以使用聚合。例子：

db.users.aggregate(
   [ { $sample: { size: 3 } } ]
)

See the doc.

请参阅文档。

Answer 8

回答by Jabba

In Python using pymongo:

在 Python 中使用 pymongo：

import random

def get_random_doc():
    count = collection.count()
    return collection.find()[random.randrange(count)]

Answer 9

回答by Daniel

Using Python (pymongo), the aggregate function also works.

使用 Python (pymongo)，聚合函数也可以工作。

collection.aggregate([{'$sample': {'size': sample_size }}])

This approach is a lot fasterthan running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.

这种做法是快了很多比运行查询随机数（如collection.find（[random_int]）。这是特别适用于大集合的情况。

Answer 10

回答by dm.

it is tough if there is no data there to key off of. what are the _id field? are they mongodb object id's? If so, you could get the highest and lowest values:

如果没有数据可以关闭，那就很难了。_id 字段是什么？它们是 mongodb 对象 ID 吗？如果是这样，您可以获得最高和最低值：

lowest = db.coll.find().sort({_id:1}).limit(1).next()._id;
highest = db.coll.find().sort({_id:-1}).limit(1).next()._id;

then if you assume the id's are uniformly distributed (but they aren't, but at least it's a start):

那么如果你假设 id 是均匀分布的（但它们不是，但至少它是一个开始）：

unsigned long long L = first_8_bytes_of(lowest)
unsigned long long H = first_8_bytes_of(highest)

V = (H - L) * random_from_0_to_1();
N = L + V;
oid = N concat random_4_bytes();

randomobj = db.coll.find({_id:{$gte:oid}}).limit(1);

来自 MongoDB 的随机记录

提问by Will M

回答by JohnnyHK

回答by ceejayoz

回答by Michael

Update for MongoDB 3.2

MongoDB 3.2 更新

For older versions (previous answer)

对于旧版本（以前的答案）

回答by Nico de Poel

回答by spam_eggs

回答by Blakes Seven

回答by dbam

回答by Jabba

回答by Daniel

回答by dm.

相关推荐

最近更新

标签

来自 MongoDB 的随机记录

提问by Will M

回答by JohnnyHK

回答by ceejayoz

回答by Michael

Update for MongoDB 3.2

MongoDB 3.2 更新

For older versions (previous answer)

对于旧版本（以前的答案）

回答by Nico de Poel

回答by spam_eggs

回答by Blakes Seven

回答by dbam

回答by Jabba

回答by Daniel

回答by dm.

相关推荐

windows 如何自动复制包含内容的子文件夹

windows URI 方案启动

windows 批处理脚本的日志文件

在 rdp 文件中保存密码 | Windows 7的

相关推荐

最近更新

标签