来自 MongoDB 的随机记录

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2824157/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 11:44:04  来源:igfitidea点击:

Random record from MongoDB

mongodbmongodb-query

提问by Will M

I am looking to get a random record from a huge (100 million record) mongodb.

我希望从一个巨大的(1 亿条记录)中获得一个随机记录mongodb

What is the fastest and most efficient way to do so? The data is already there and there are no field in which I can generate a random number and obtain a random row.

这样做的最快和最有效的方法是什么?数据已经存在,并且没有可以生成随机数并获得随机行的字段。

Any suggestions?

有什么建议?

回答by JohnnyHK

Starting with the 3.2 release of MongoDB, you can get N random docs from a collection using the $sampleaggregation pipeline operator:

从 MongoDB 3.2 版本开始,您可以使用$sample聚合管道运算符从集合中获取 N 个随机文档:

// Get one random document from the mycoll collection.
db.mycoll.aggregate([{ $sample: { size: 1 } }])

If you want to select the random document(s) from a filtered subset of the collection, prepend a $matchstage to the pipeline:

如果要从集合的过滤子集中选择随机文档$match,请在管道中添加一个阶段:

// Get one random document matching {a: 10} from the mycoll collection.
db.mycoll.aggregate([
    { $match: { a: 10 } },
    { $sample: { size: 1 } }
])

As noted in the comments, when sizeis greater than 1, there may be duplicates in the returned document sample.

如注释中所述,当size大于 1 时,返回的文档样本中可能存在重复项。

回答by ceejayoz

Do a count of all records, generate a random number between 0 and the count, and then do:

对所有记录做一个计数,生成一个介于 0 和计数之间的随机数,然后执行:

db.yourCollection.find().limit(-1).skip(yourRandomNumber).next()

回答by Michael

Update for MongoDB 3.2

MongoDB 3.2 更新

3.2 introduced $sampleto the aggregation pipeline.

3.2 将$sample引入聚合管道。

There's also a good blog poston putting it into practice.

还有一篇关于将其付诸实践的好博客文章

For older versions (previous answer)

对于旧版本(以前的答案)

This was actually a feature request: http://jira.mongodb.org/browse/SERVER-533but it was filed under "Won't fix."

这实际上是一个功能请求:http: //jira.mongodb.org/browse/SERVER-533但它被归档在“不会修复”下。

The cookbook has a very good recipe to select a random document out of a collection: http://cookbook.mongodb.org/patterns/random-attribute/

食谱有一个很好的方法可以从集合中选择一个随机文档:http: //cookbook.mongodb.org/patterns/random-attribute/

To paraphrase the recipe, you assign random numbers to your documents:

解释一下配方,您可以为文档分配随机数:

db.docs.save( { key : 1, ..., random : Math.random() } )

Then select a random document:

然后选择一个随机文档:

rand = Math.random()
result = db.docs.findOne( { key : 2, random : { $gte : rand } } )
if ( result == null ) {
  result = db.docs.findOne( { key : 2, random : { $lte : rand } } )
}

Querying with both $gteand $lteis necessary to find the document with a random number nearest rand.

使用$gte和 进行查询$lte是找到具有最接近 的随机数的文档所必需的rand

And of course you'll want to index on the random field:

当然,您会希望在随机字段上建立索引:

db.docs.ensureIndex( { key : 1, random :1 } )

If you're already querying against an index, simply drop it, append random: 1to it, and add it again.

如果您已经在查询索引,只需删除它,附加random: 1到它,然后再次添加它。

回答by Nico de Poel

You can also use MongoDB's geospatial indexing feature to select the documents 'nearest' to a random number.

您还可以使用 MongoDB 的地理空间索引功能来选择“最接近”随机数的文档。

First, enable geospatial indexing on a collection:

首先,在集合上启用地理空间索引:

db.docs.ensureIndex( { random_point: '2d' } )

To create a bunch of documents with random points on the X-axis:

要在 X 轴上创建一堆带有随机点的文档:

for ( i = 0; i < 10; ++i ) {
    db.docs.insert( { key: i, random_point: [Math.random(), 0] } );
}

Then you can get a random document from the collection like this:

然后你可以从集合中获取一个随机文档,如下所示:

db.docs.findOne( { random_point : { $near : [Math.random(), 0] } } )

Or you can retrieve several document nearest to a random point:

或者您可以检索最接近随机点的几个文档:

db.docs.find( { random_point : { $near : [Math.random(), 0] } } ).limit( 4 )

This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.

这只需要一个查询,不需要空检查,而且代码干净、简单和灵活。您甚至可以使用地理点的 Y 轴为您的查询添加第二个随机性维度。

回答by spam_eggs

The following recipe is a little slower than the mongo cookbook solution (add a random key on every document), but returns more evenly distributed random documents. It's a little less-evenly distributed than the skip( random )solution, but much faster and more fail-safe in case documents are removed.

以下配方比 mongo 食谱解决方案慢一点(在每个文档上添加一个随机密钥),但返回更均匀分布的随机文档。与skip( random )解决方案相比,它的分布不那么均匀,但在文档被删除的情况下,速度更快且故障安全性更高。

function draw(collection, query) {
    // query: mongodb query object (optional)
    var query = query || { };
    query['random'] = { $lte: Math.random() };
    var cur = collection.find(query).sort({ rand: -1 });
    if (! cur.hasNext()) {
        delete query.random;
        cur = collection.find(query).sort({ rand: -1 });
    }
    var doc = cur.next();
    doc.random = Math.random();
    collection.update({ _id: doc._id }, doc);
    return doc;
}

It also requires you to add a random "random" field to your documents so don't forget to add this when you create them : you may need to initialize your collection as shown by Geoffrey

它还要求您向文档中添加一个随机的“随机”字段,因此在创建它们时不要忘记添加它:您可能需要按照 Geoffrey 所示初始化您的集合

function addRandom(collection) { 
    collection.find().forEach(function (obj) {
        obj.random = Math.random();
        collection.save(obj);
    }); 
} 
db.eval(addRandom, db.things);

Benchmark results

基准测试结果

This method is much faster than the skip()method (of ceejayoz) and generates more uniformly random documents than the "cookbook" method reported by Michael:

这种方法比skip()(ceejayoz 的)方法快得多,并且比迈克尔报告的“食谱”方法生成更均匀的随机文档:

For a collection with 1,000,000 elements:

对于包含 1,000,000 个元素的集合:

  • This method takes less than a millisecond on my machine

  • the skip()method takes 180 ms on average

  • 这个方法在我的机器上用时不到一毫秒

  • skip()方法平均需要 180 毫秒

The cookbook method will cause large numbers of documents to never get picked because their random number does not favor them.

食谱方法将导致大量文档永远不会被选中,因为它们的随机数不利于它们。

  • This method will pick all elements evenly over time.

  • In my benchmark it was only 30% slower than the cookbook method.

  • the randomness is not 100% perfect but it is very good (and it can be improved if necessary)

  • 此方法将随着时间的推移均匀地选取所有元素。

  • 在我的基准测试中,它只比食谱方法慢 30%。

  • 随机性不是 100% 完美,但非常好(如有必要可以改进)

This recipe is not perfect - the perfect solution would be a built-in feature as others have noted.
However it should be a good compromise for many purposes.

这个秘诀并不完美 - 完美的解决方案是其他人已经指出的内置功能。
然而,对于许多目的来说,这应该是一个很好的折衷方案。

回答by Blakes Seven

Here is a way using the default ObjectIdvalues for _idand a little math and logic.

这是一种使用默认ObjectId_id和一些数学和逻辑的方法。

// Get the "min" and "max" timestamp values from the _id in the collection and the 
// diff between.
// 4-bytes from a hex string is 8 characters

var min = parseInt(db.collection.find()
        .sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    max = parseInt(db.collection.find()
        .sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    diff = max - min;

// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;

// Use "random" in the range and pad the hex string to a valid ObjectId
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")

// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
   .sort({ "_id": 1 }).limit(1).toArray()[0];

That's the general logic in shell representation and easily adaptable.

这是 shell 表示中的一般逻辑,并且很容易适应。

So in points:

所以在几点:

  • Find the min and max primary key values in the collection

  • Generate a random number that falls between the timestamps of those documents.

  • Add the random number to the minimum value and find the first document that is greater than or equal to that value.

  • 查找集合中的最小和最大主键值

  • 生成一个介于这些文档的时间戳之间的随机数。

  • 将随机数与最小值相加,找到大于或等于该值的第一个文档。

This uses "padding" from the timestamp value in "hex" to form a valid ObjectIdvalue since that is what we are looking for. Using integers as the _idvalue is essentially simplier but the same basic idea in the points.

这使用“十六进制”中时间戳值的“填充”来形成有效值,ObjectId因为这就是我们正在寻找的。使用整数作为_id值本质上更简单,但点中的基本思想相同。

回答by dbam

Now you can use the aggregate. Example:

现在您可以使用聚合。例子:

db.users.aggregate(
   [ { $sample: { size: 3 } } ]
)

See the doc.

请参阅文档

回答by Jabba

In Python using pymongo:

在 Python 中使用 pymongo:

import random

def get_random_doc():
    count = collection.count()
    return collection.find()[random.randrange(count)]

回答by Daniel

Using Python (pymongo), the aggregate function also works.

使用 Python (pymongo),聚合函数也可以工作。

collection.aggregate([{'$sample': {'size': sample_size }}])

This approach is a lot fasterthan running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.

这种做法是快了很多比运行查询随机数(如collection.find([random_int])。这是特别适用于大集合的情况。

回答by dm.

it is tough if there is no data there to key off of. what are the _id field? are they mongodb object id's? If so, you could get the highest and lowest values:

如果没有数据可以关闭,那就很难了。_id 字段是什么?它们是 mongodb 对象 ID 吗?如果是这样,您可以获得最高和最低值:

lowest = db.coll.find().sort({_id:1}).limit(1).next()._id;
highest = db.coll.find().sort({_id:-1}).limit(1).next()._id;

then if you assume the id's are uniformly distributed (but they aren't, but at least it's a start):

那么如果你假设 id 是均匀分布的(但它们不是,但至少它是一个开始):

unsigned long long L = first_8_bytes_of(lowest)
unsigned long long H = first_8_bytes_of(highest)

V = (H - L) * random_from_0_to_1();
N = L + V;
oid = N concat random_4_bytes();

randomobj = db.coll.find({_id:{$gte:oid}}).limit(1);