mongodb 自动压缩mongodb中已删除的空间?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4555938/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 11:53:46  来源:igfitidea点击:

Auto compact the deleted space in mongodb?

mongodbdiskspacerepair

提问by Zealot Ke

The mongodb document says that

mongodb 文档说

To compact this space, run db.repairDatabase() from the mongo shell (note this operation will block and is slow).

要压缩此空间,请从 mongo shell 运行 db.repairDatabase()(请注意,此操作会阻塞且速度较慢)。

in http://www.mongodb.org/display/DOCS/Excessive+Disk+Space

http://www.mongodb.org/display/DOCS/Excessive+Disk+Space

I wonder how to make the mongodb free deleted disk space automatically?

我想知道如何使 mongodb自动释放已删除的磁盘空间?

p.s. We stored many downloading task in mongodb, up to 20GB, and finished these in half an hour.

ps 我们在mongodb中存储了很多下载任务,最大20GB,半小时就完成了。

回答by Justin Jenkins

In general if you don't need to shrink your datafiles you shouldn't shrink them at all. This is because "growing" your datafiles on disk is a fairly expensive operation and the more space that MongoDB can allocate in datafiles the less fragmentation you will have.

一般来说,如果您不需要缩小数据文件,则根本不应该缩小它们。这是因为在磁盘上“增长”您的数据文件是一项相当昂贵的操作,MongoDB 可以在数据文件中分配的空间越多,您的碎片就越少。

So, you should try to provide as much disk-space as possible for the database.

因此,您应该尝试为数据库提供尽可能多的磁盘空间。

Howeverif you must shrink the database you should keep two things in mind.

但是,如果您必须缩小数据库,您应该记住两件事。

  1. MongoDB grows it's data files by doubling so the datafiles may be 64MB, then 128MB, etc up to 2GB (at which point it stops doubling to keep files until 2GB.)

  2. As with most any database ... to do operations like shrinking you'll need to schedule a separate job to do so, there is no "autoshrink" in MongoDB. In fact of the major noSQL databases (hate that name) only Riak will autoshrink. So, you'll need to create a job using your OS's scheduler to run a shrink. You could use an bash script, or have a job run a php script, etc.

  1. MongoDB 通过加倍增长它的数据文件,因此数据文件可能是 64MB,然后是 128MB,等等直到 2GB(此时它停止加倍以将文件保留到 2GB。)

  2. 与大多数数据库一样……要执行收缩等操作,您需要安排单独的作业来执行此操作,MongoDB 中没有“自动收缩”。事实上,在主要的 noSQL 数据库(讨厌那个名字)中,只有 Riak 会自动收缩。因此,您需要使用操作系统的调度程序创建一个作业来运行收缩。您可以使用 bash 脚本,或者让作业运行 php 脚本等。

Serverside Javascript

服务器端 Javascript

You can use server side Javascript to do the shrink and run that JS via mongo's shell on a regular bases via a job (like cron or the windows scheduling service) ...

您可以使用服务器端 Javascript 进行收缩并通过 mongo 的 shell 通过作业(如 cron 或 Windows 调度服务)定期运行该 JS ...

Assuming a collection called fooyou would save the javascript below into a file called bar.jsand run ...

假设一个名为foo的集合,您会将下面的 javascript 保存到名为bar.js的文件中并运行...

$ mongo foo bar.js

The javascript file would look something like ...

javascript 文件看起来像……

// Get a the current collection size.
var storage = db.foo.storageSize();
var total = db.foo.totalSize();

print('Storage Size: ' + tojson(storage));

print('TotalSize: ' + tojson(total));

print('-----------------------');
print('Running db.repairDatabase()');
print('-----------------------');

// Run repair
db.repairDatabase()

// Get new collection sizes.
var storage_a = db.foo.storageSize();
var total_a = db.foo.totalSize();

print('Storage Size: ' + tojson(storage_a));
print('TotalSize: ' + tojson(total_a));

This will run and return something like ...

这将运行并返回类似...

MongoDB shell version: 1.6.4
connecting to: foo
Storage Size: 51351
TotalSize: 79152
-----------------------
Running db.repairDatabase()
-----------------------
Storage Size: 40960
TotalSize: 65153

Run this on a schedule (during none peak hours) and you are good to go.

按计划运行(在非高峰时段),您就可以开始了。

Capped Collections

封顶收藏

However there is one other option, capped collections.

但是,还有另一种选择,上限集合

Capped collections are fixed sized collections that have a very high performance auto-FIFO age-out feature (age out is based on insertion order). They are a bit like the "RRD" concept if you are familiar with that.

In addition, capped collections automatically, with high performance, maintain insertion order for the objects in the collection; this is very powerful for certain use cases such as logging.

上限集合是固定大小的集合,具有非常高性能的自动 FIFO 过期功能(过期基于插入顺序)。如果您熟悉的话,它们有点像“RRD”概念。

此外,上限集合自动,高性能,维护集合中对象的插入顺序;这对于某些用例(例如日志记录)非常强大。

Basically you can limit the size of (or number of documents in ) a collection to say .. 20GB and once that limit is reached MongoDB will start to throw out the oldest records and replace them with newer entries as they come in.

基本上,您可以将集合的大小(或文档数量)限制为 .. 20GB,一旦达到该限制,MongoDB 将开始丢弃最旧的记录,并在它们进来时用新条目替换它们。

This is a great way to keep a large amount of data, discarding the older data as time goes by and keeping the same amount of disk-space used.

这是保留大量数据的好方法,随着时间的推移丢弃旧数据并保持相同的磁盘空间使用量。

回答by Mojo

I have another solution that might work better than doing db.repairDatabase() if you can't afford for the system to be locked, or don't have double the storage.

如果您负担不起系统被锁定的费用,或者没有双倍的存储空间,我还有另一种可能比执行 db.repairDatabase() 更好的解决方案。

You must be using a replica set.

您必须使用副本集。

My thought is once you've removed all of the excess data that's gobbling your disk, stop a secondary replica, wipe its data directory, start it up and let it resynchronize with the master.

我的想法是,一旦您删除了所有占用磁盘的多余数据,停止辅助副本,擦除其数据目录,启动它并让它与主副本重新同步。

The process is time consuming, but it should only cost a few seconds of down time, when you do the rs.stepDown().

这个过程很耗时,但是当您执行 rs.stepDown() 时,它应该只需要几秒钟的停机时间。

Also this can not be automated. Well it could, but I don't think I'm willing to try.

这也不能自动化。嗯,它可以,但我不认为我愿意尝试。

回答by Robert Jobson

Running db.repairDatabase() will require that you have space equal to the current size of the database available on the file system. This can be bothersome when you know that the collections left or data you need to retain in the database would currently use much less space than what is allocated and you do not have enough space to make the repair.

运行 db.repairDatabase() 将要求您具有等于文件系统上可用数据库的当前大小的空间。当您知道剩余的集合或需要保留在数据库中的数据当前使用的空间比分配的空间少得多并且您没有足够的空间进行修复时,这可能会很麻烦。

As an alternative if you have few collections you actually need to retain or only want a subset of the data, then you can move the data you need to keep into a new database and drop the old one. If you need the same database name you can then move them back into a fresh db by the same name. Just make sure you recreate any indexes.

作为替代方案,如果您实际上需要保留的集合很少或只想要数据的一个子集,那么您可以将需要保留的数据移动到新数据库中并删除旧数据库。如果您需要相同的数据库名称,则可以将它们移回同名的新数据库中。只要确保您重新创建了任何索引。

use cleanup_database
db.dropDatabase();

use oversize_database

db.collection.find({},{}).forEach(function(doc){
    db = db.getSiblingDB("cleanup_database");
    db.collection_subset.insert(doc);
});

use oversize_database
db.dropDatabase();

use cleanup_database

db.collection_subset.find({},{}).forEach(function(doc){
    db = db.getSiblingDB("oversize_database");
    db.collection.insert(doc);
});

use oversize_database

<add indexes>
db.collection.ensureIndex({field:1});

use cleanup_database
db.dropDatabase();

An export/drop/import operation for databases with many collections would likely achieve the same result but I have not tested.

具有许多集合的数据库的导出/删除/导入操作可能会达到相同的结果,但我尚未测试。

Also as a policy you can keep permanent collections in a separate database from your transient/processing data and simply drop the processing database once your jobs complete. Since MongoDB is schema-less, nothing except indexes would be lost and your db and collections will be recreated when the inserts for the processes run next. Just make sure your jobs include creating any nessecary indexes at an appropriate time.

此外,作为一项策略,您可以将永久集合与您的临时/处理数据保存在一个单独的数据库中,并在您的工作完成后简单地删除处理数据库。由于 MongoDB 是无模式的,因此除了索引之外的任何内容都不会丢失,并且您的数据库和集合将在接下来运行进程的插入时重新创建。只要确保您的工作包括在适当的时间创建任何必要的索引。

回答by Adam Comerford

If you are using replica sets, which were not available when this question was originally written, then you can set up a process to automatically reclaim space without incurring significant disruption or performance issues.

如果您使用的副本集在最初编写此问题时不可用,那么您可以设置一个过程来自动回收空间,而不会导致严重的中断或性能问题。

To do so, you take advantage of the automatic initial sync capabilities of a secondary in a replica set. To explain: if you shut down a secondary, wipe its data files and restart it, the secondary will re-sync from scratch from one of the other nodes in the set (by default it picks the node closest to it by looking at ping response times). When this resync occurs, all data is rewritten from scratch (including indexes), effectively do the same thing as a repair, and disk space it reclaimed.

为此,您可以利用副本集中辅助节点的自动初始同步功能。解释一下:如果您关闭辅助节点,擦除其数据文件并重新启动它,辅助节点将从集合中的其他节点之一从头开始重新同步(默认情况下,它通过查看 ping 响应来选择最接近它的节点次)。当这种重新同步发生时,所有数据都从头开始重写(包括索引),有效地做与修复相同的事情,并回收磁盘空间。

By running this on secondaries (and then stepping down the primary and repeating the process) you can effectively reclaim disk space on the whole set with minimal disruption. You do need to be careful if you are reading from secondaries, since this will take a secondary out of rotation for a potentially long time. You also want to make sure your oplogwindow is sufficient to do a successful resync, but that is generally something you would want to make sure of whether you do this or not.

通过在辅助节点上运行它(然后逐步降低主节点并重复该过程),您可以有效地回收整个集上的磁盘空间,同时将中断降至最低。如果您正在从辅助读取,则确实需要小心,因为这可能会使辅助停止旋转很长时间。您还想确保您的oplog窗口足以进行成功的重新同步,但这通常是您想要确定是否这样做的事情。

To automate this process you would simply need to have a script run to perform this action on separate days (or similar) for each member of your set, preferably during your quiet time or maintenance window. A very naive version of this script would look like this in bash:

要自动执行此过程,您只需要运行一个脚本,以便在不同日期(或类似日期)为您的集合中的每个成员执行此操作,最好是在您的安静时间或维护窗口期间。这个脚本的一个非常简单的版本看起来像这样bash

NOTE: THIS IS BASICALLY PSEUDO CODE - FOR ILLUSTRATIVE PURPOSES ONLY - DO NOT USE FOR PRODUCTION SYSTEMS WITHOUT SIGNIFICANT CHANGES

注意:这基本上是伪代码 - 仅用于说明目的 - 不要用于没有重大变化的生产系统

#!/bin/bash 

# First arg is host MongoDB is running on, second arg is the MongoDB port

MONGO=/path/to/mongo
MONGOHOST=
MONGOPORT=
DBPATH = /path/to/dbpath

# make sure the node we are connecting to is not the primary
while (`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'db.isMaster().ismaster'`)
do
    `$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'rs.stepDown()'`
    sleep 2
done    
echo "Node is no longer primary!\n"

# Now shut down that server 
# something like (assuming user is set up for key based auth and has password-less sudo access a la ec2-user in EC2)
ssh -t user@$MONGOHOST sudo service mongodb stop

# Wipe the data files for that server

ssh -t user@$MONGOHOST sudo rm -rf $DBPATH
ssh -t user@$MONGOHOST sudo mkdir $DBPATH
ssh -t user@$MONGOHOST sudo chown mongodb:mongodb $DBPATH

# Start up server again
# similar to shutdown something like 
ssh -t user@$MONGOHOST sudo service mongodb start