Javascript MongoDB - 更新集合中所有记录的最快方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4146452/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-23 11:05:51  来源:igfitidea点击:

MongoDB - what is the fastest way to update all records in a collection?

javascriptperformancemongodb

提问by mattjvincent

I have a collection with 9 million records. I am currently using the following script to update the entire collection:

我有一个包含 900 万条记录的集合。我目前正在使用以下脚本来更新整个集合:

simple_update.js

simple_update.js

db.mydata.find().forEach(function(data) {
  db.mydata.update({_id:data._id},{$set:{pid:(2571 - data.Y + (data.X * 2572))}});
});

This is run from the command line as follows:

这是从命令行运行的,如下所示:

mongo my_test simple_update.js

So all I am doing is adding a new field pidbased upon a simple calculation.

所以我所做的就是基于一个简单的计算添加一个新的字段pid

Is there a faster way? This takes a significant amount of time.

有没有更快的方法?这需要大量时间。

采纳答案by Gates VP

There are two things that you can do.

您可以做两件事。

  1. Send an update with the 'multi' flag set to true.
  2. Store the function server-side and try using server-side code execution.
  1. 发送'multi' 标志设置为true 的更新。
  2. 存储功能服务器端并尝试使用服务器端代码执行

That link also contains the following advice:

该链接还包含以下建议:

This is a good technique for performing batch administrative work. Run mongo on the server, connecting via the localhost interface. The connection is then very fast and low latency. This is friendlier than db.eval() as db.eval() blocks other operations.

这是执行批处理管理工作的好方法。在服务器上运行 mongo,通过 localhost 接口连接。然后连接非常快且延迟低。这比 db.eval() 更友好,因为 db.eval() 会阻止其他操作。

This is probably the fastest you'll get. You have to realize that issuing 9M updates on a single server is going to be a heavy operation. Let's say that you could get 3k updates / second, you're still talking about running for nearly an hour.

这可能是你得到的最快的。您必须意识到在单个服务器上发布 900 万次更新将是一项繁重的操作。假设您可以获得 3k 更新/秒,您仍然在谈论运行近一个小时。

And that's not really a "mongo problem", that's going to be a hardware limitation.

这并不是真正的“mongo 问题”,而是硬件限制。

回答by Telmo Dias

I am using the: db.collection.update method

我正在使用:db.collection.update 方法

// db.collection.update( criteria, objNew, upsert, multi ) // --> for reference
db.collection.update( { "_id" : { $exists : true } }, objNew, upsert, true);

回答by shijin

I won't recommend using {multi: true} for a larger data set, because it is less configurable.

对于较大的数据集,我不建议使用 {multi: true},因为它的可配置性较差。

A better way using bulk insert.

使用批量插入的更好方法。

Bulk operation is really helpful for scheduler tasks. Say you have to delete data older that 6 months daily. Use bulk operation. Its fast and won't slow down server. The CPU, memory usage is not noticeable when you do insert, delete or update over a billion documents. I found {multi:true} slowing down the server when you are dealing with million+ documents(require more research in this.)

批量操作对于调度任务真的很有帮助。假设您必须每天删除 6 个月之前的数据。使用批量操作。它的速度很快,不会减慢服务器的速度。当您插入、删除或更新超过 10 亿个文档时,CPU 和内存使用情况并不明显。我发现 {multi:true} 在您处理百万以上的文档时会减慢服务器的速度(需要对此进行更多研究。)

See a sample below. It's a js shell script, can run it in server as a node program as well.(use npm module shelljs or similar to achieve this)

请参阅下面的示例。它是一个 js shell 脚本,也可以作为节点程序在服务器中运行。(使用 npm 模块 shelljs 或类似的来实现这一点)

update mongo to 3.2+

将 mongo 更新到 3.2+

The normal way of updating multiple unique document is

更新多个唯一文档的正常方法是

let counter = 0;
db.myCol.find({}).sort({$natural:1}).limit(1000000).forEach(function(document){
    counter++;
    document.test_value = "just testing" + counter
    db.myCol.save(document)
});

It took 310-315 seconds when I tried. That's more than 5 minutes for updating a million documents.

我尝试时花了 310-315 秒。更新一百万个文档需要 5 多分钟。

My collection includes 100 million+ documents, so speed may differ for others.

我的收藏包括 1 亿多个文档,因此其他人的速度可能会有所不同。

The same using bulk insert is

使用批量插入相同的是

    let counter = 0;
// magic no.- depends on your hardware and document size. - my document size is around 1.5kb-2kb
// performance reduces when this limit is not in 1500-2500 range.
// try different range and find fastest bulk limit for your document size or take an average.
let limitNo = 2222; 
let bulk = db.myCol.initializeUnorderedBulkOp();
let noOfDocsToProcess = 1000000;
db.myCol.find({}).sort({$natural:1}).limit(noOfDocsToProcess).forEach(function(document){
    counter++;
    noOfDocsToProcess --;
    limitNo--;
    bulk.find({_id:document._id}).update({$set:{test_value : "just testing .. " + counter}});
    if(limitNo === 0 || noOfDocsToProcess === 0){
        bulk.execute();
        bulk = db.myCol.initializeUnorderedBulkOp();
        limitNo = 2222;
    }
});

The best time was 8972 millis. So in average it took only 10 seconds to update a million documents. 30 times faster than old way.

最佳时间是 8972 毫秒。因此,更新一百万个文档平均只需要 10 秒。比旧方式快 30 倍。

Put the code in a .js file and execute as mongo shell script.

将代码放在 .js 文件中并作为 mongo shell 脚本执行。

If someone found a better way, please update. Lets use mongo in a faster way.

如果有人找到更好的方法,请更新。让我们以更快的方式使用 mongo。

回答by Xavier Guihot

Starting Mongo 4.2, db.collection.update()can accept an aggregation pipeline, finally allowing the update/creation of a field based on another field; and thus allowing us to fully apply this kind of query server-side:

开始Mongo 4.2db.collection.update()可以接受一个聚合管道,最后允许基于另一个字段更新/创建一个字段;从而允许我们完全应用这种查询服务器端:

// { Y: 456,  X: 3 }
// { Y: 3452, X: 2 }
db.collection.update(
  {},
  [{ $set: { pid: {
    $sum: [ 2571, { $multiply: [ -1, "$Y" ] }, { $multiply: [ 2572, "$X" ] } ]
  }}}],
  { multi: true }
)
// { Y: 456,  X: 3, pid: 9831 }
// { Y: 3452, X: 2, pid: 4263 }
  • The first part {}is the match query, filtering which documents to update (all documents in this case).

  • The second part [{ $set: { pid: ... } }]is the update aggregation pipeline (note the squared brackets signifying the use of an aggregation pipeline). $setis a new aggregation operator and an alias of $addFields. Note how pidis created directly based on the values of X($X) and Y($Y) from the same document.

  • Don't forget { multi: true }, otherwise only the first matching document will be updated.

  • 第一部分{}是匹配查询,过滤要更新的文档(在本例中为所有文档)。

  • 第二部分[{ $set: { pid: ... } }]是更新聚合管道(注意方括号表示使用聚合管道)。$set是一个新的聚合运算符,是 的别名$addFields。请注意 howpid是根据同一文档中的X( $X) 和Y( $Y)的值直接创建的。

  • 不要忘记{ multi: true },否则只会更新第一个匹配的文档。

回答by Gandalf

Not sure if it will be any faster but you could do a multi-update. Just say update where _id > 0(this will be true for every object) and then set the 'multi' flag to true and it should do the same without having to iterate through the entire collection.

不确定它是否会更快,但您可以进行多次更新。只需说update where _id > 0(这对于每个对象都是如此),然后将 'multi' 标志设置为 true,它应该执行相同的操作,而不必遍历整个集合。

Check this out: MongoDB - Server Side Code Execution

看看这个: MongoDB - 服务器端代码执行