在 MongoDB 中查找最大的文档大小

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16953282/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 13:15:55  来源:igfitidea点击:

Find largest document size in MongoDB

mongodb

提问by sashkello

Is it possible to find the largest document size in MongoDB?

是否可以在 MongoDB 中找到最大的文档大小?

db.collection.stats()shows average size, which is not really representative because in my case sizes can differ considerably.

db.collection.stats()显示平均大小,这并不具有代表性,因为在我的情况下,大小可能会有很大差异。

回答by Abhishek Kumar

You can use a small shell script to get this value.

您可以使用一个小的 shell 脚本来获取这个值。

Note: this will perform a full table scan, which will be slow on large collections.

注意:这将执行全表扫描,这在大型集合上会很慢。

let max = 0, id = null;
db.test.find().forEach(doc => {
    const size = Object.bsonsize(doc); 
    if(size > max) {
        max = size;
        id = doc._id;
    } 
});
print(id, max);

回答by Mike Graf

Note: this will attempt to store the whole result set in memory (from .toArray) . Careful on big data sets. Do not use in production! Abishek's answer has the advantage of working over a cursor instead of across an in memory array.

注意:这将尝试将整个结果集存储在 memory (from .toArray) 中。小心处理大数据集。不要在生产中使用!Abishek 的答案具有在游标上工作而不是在内存数组上工作的优势。

If you also want the _id, try this. Given a collection called "requests" :

如果你还想要_id,试试这个。给定一个名为“requests”的集合:

// Creates a sorted list, then takes the max
db.requests.find().toArray().map(function(request) { return {size:Object.bsonsize(request), _id:request._id}; }).sort(function(a, b) { return a.size-b.size; }).pop();

// { "size" : 3333, "_id" : "someUniqueIdHere" }

回答by Dan Dascalescu

Finding the largest documents in a MongoDB collection can be ~100x faster than the other answers using the aggregation frameworkand a tiny bit of knowledge about the documents in the collection. Also, you'll get the results in seconds, vs. minutes with the other approaches (forEach, or worse, getting all documents to the client).

在 MongoDB 集合中查找最大的文档可以比使用聚合框架和有关集合中文档的一点点知识的其他答案快约 100 倍。此外,您将在几秒钟内获得结果,而使用其他方法(forEach或更糟,将所有文档发送给客户端)则需要几分钟。

You need to know which field(s) in your document might be the largest ones - which you almost always will know. There are only two practical1MongoDB typesthat can have variable sizes:

您需要知道文档中的哪些字段可能是最大的 -您几乎总是会知道的。只有两种实用的1MongoDB类型可以具有可变大小:

  • arrays
  • strings
  • 数组
  • 字符串

The aggregation framework can calculate the length of each. Note that you won't get the size in bytes for arrays, but the length in elements. However, what matters more typically is whichthe outlier documents are, not exactly how many bytes they take.

聚合框架可以计算每个的长度。请注意,您不会获得数组的字节大小,而是元素的长度。然而,更重要的是离群文档是哪些,而不是它们占用了多少字节。

Here's how it's done for arrays. As an example, let's say we have a collections of users in a social network and we suspect the array friends.idsmight be very large (in practice you should probably keep a separate field like friendsCountin sync with the array, but for the sake of example, we'll assume that's not available):

以下是对数组的处理方式。举个例子,假设我们有一个社交网络中的用户集合,我们怀疑这个数组friends.ids可能非常大(实际上你应该保持一个单独的字段,比如friendsCount与数组同步,但为了举例,我们' 会假设那不可用):

db.users.aggregate([
    { $match: {
        'friends.ids': { $exists: true }
    }},
    { $project: { 
        sizeLargestField: { $size: '$friends.ids' } 
    }},
    { $sort: {
        sizeLargestField: -1
    }},
])

The key is to use the $sizeaggregation pipeline operator. It only works on arrays though, so what about text fields? We can use the $strLenBytesoperator. Let's say we suspect the biofield might also be very large:

关键是使用$size聚合管道运算符。不过它只适用于数组,那么文本字段呢?我们可以使用$strLenBytes运算符。假设我们怀疑该bio字段也可能非常大:

db.users.aggregate([
    { $match: {
        bio: { $exists: true }
    }},
    { $project: { 
        sizeLargestField: { $strLenBytes: '$bio' } 
    }},
    { $sort: {
        sizeLargestField: -1
    }},
])

You can also combine $sizeand $strLenBytesusing $sumto calculate the size of multiple fields. In the vast majority of cases, 20% of the fields will take up 80% of the size(if not 10/90 or even 1/99), and large fields must be either strings or arrays.

您也可以组合$size$strLenBytes使用$sum计算多个字段的大小。在绝大多数情况下,20% 的字段将占据 80% 的大小(如果不是 10/90 甚至 1/99),并且大字段必须是字符串或数组。



1Technically, the rarely used binDatatype can also have variable size.

1从技术上讲,很少使用的binData类型也可以具有可变大小。

回答by Elad Nava

If you're working with a huge collection, loading it all at once into memory will not work, since you'll need more RAM than the size of the entire collection for that to work.

如果您正在处理一个巨大的集合,将它一次全部加载到内存中是行不通的,因为您需要比整个集合的大小更多的 RAM 才能工作。

Instead, you can process the entire collection in batches using the following package I created: https://www.npmjs.com/package/mongodb-largest-documents

相反,您可以使用我创建的以下包批量处理整个集合:https: //www.npmjs.com/package/mongodb-largest-documents

All you have to do is provide the MongoDB connection string and collection name. The script will output the top X largest documents when it finishes traversing the entire collection in batches.

您所要做的就是提供 MongoDB 连接字符串和集合名称。该脚本在完成批量遍历整个集合时将输出前 X 个最大的文档。

Preview

预览

回答by ymz

Well.. this is an old question.. but - I thought to share my cent about it

嗯..这是一个老问题..但是 - 我想分享我的钱

My approach - use Mongo mapReducefunction

我的方法 - 使用 MongomapReduce函数

First - let's get the size for each document

首先 - 让我们获取每个文档的大小

db.myColection.mapReduce
(
   function() { emit(this._id, Object.bsonsize(this)) }, // map the result to be an id / size pair for each document
   function(key, val) { return val }, // val = document size value (single value for each document)
   { 
       query: {}, // query all documents
       out: { inline: 1 } // just return result (don't create a new collection for it)
   } 
)

This will return all documents sizes although it worth mentioning that saving it as a collection is a better approach (the result is an array of results inside the resultfield)

这将返回所有文档大小,尽管值得一提的是将其保存为集合是一种更好的方法(结果是result字段内的结果数组)

Second - let's get the max size of document by manipulating this query

其次 - 让我们通过操作这个查询来获得文档的最大大小

db.metadata.mapReduce
(
    function() { emit(0, Object.bsonsize(this))}, // mapping a fake id (0) and use the document size as value
    function(key, vals) { return Math.max.apply(Math, vals) }, // use Math.max function to get max value from vals (each val = document size)
    { query: {}, out: { inline: 1 } } // same as first example
)

Which will provide you a single result with value equals to the max document size

这将为您提供一个值等于最大文档大小的结果

In short:

简而言之:

you may want to use the first example and save its output as a collection (change outoption to the name of collection you want) and applying further aggregations on it (max size, min size, etc.)

您可能希望使用第一个示例并将其输出保存为一个集合(将out选项更改为您想要的集合名称)并对其应用进一步的聚合(最大大小、最小大小等)

-OR-

-或者-

you may want to use a single query (the second option) for getting a single stat (min, max, avg, etc.)

您可能想要使用单个查询(第二个选项)来获取单个统计信息(最小值、最大值、平均值等)

回答by Xavier Guihot

Starting Mongo 4.4, the new aggregation operator $bsonSizereturns the size in bytes of a given document when encoded as BSON.

开始Mongo 4.4,新的聚合运算符$bsonSize在编码为 BSON 时返回给定文档的大小(以字节为单位)。

Thus, in order to find the bson size of the document whose size is the biggest:

因此,为了找到最大的文档的bson大小:

// { "_id" : ObjectId("5e6abb2893c609b43d95a985"), "a" : 1, "b" : "hello" }
// { "_id" : ObjectId("5e6abb2893c609b43d95a986"), "c" : 1000, "a" : "world" }
// { "_id" : ObjectId("5e6abb2893c609b43d95a987"), "d" : 2 }
db.collection.aggregate([
  { $group: {
    _id: null,
    max: { $max: { $bsonSize: "$$ROOT" } }
  }}
])
// { "_id" : null, "max" : 46 }

This:

这个:

  • $groups all items together
  • $projects the $maxof documents' $bsonSize
  • $$ROOTrepresents the current document for which we get the bsonsize
  • $group将所有项目放在一起
  • $projects 的$max文件'$bsonSize
  • $$ROOT表示我们获得 bsonsize 的当前文档

回答by u890106

Inspired by Elad Nana's package, but usable in a MongoDB console :

Elad Nana 包的启发,但可在 MongoDB 控制台中使用:

function biggest(collection, limit=100, sort_delta=100) {
  var documents = [];
  cursor = collection.find().readPref("nearest");
  while (cursor.hasNext()) {
    var doc = cursor.next();
    var size = Object.bsonsize(doc);
    if (documents.length < limit || size > documents[limit-1].size) {
      documents.push({ id: doc._id.toString(), size: size });
    }
    if (documents.length > (limit + sort_delta) || !cursor.hasNext()) {
      documents.sort(function (first, second) {
        return second.size - first.size;
      });
      documents = documents.slice(0, limit);
    }
  }
  return documents;
}; biggest(db.collection)
  • Uses cursor
  • Gives a list of the limitbiggest documents, not just the biggest
  • Sort & cut output list to limitevery sort_delta
  • Use nearestas read preference(you might also want to use rs.slaveOk()on the connection to be able to list collections if you're on a slave node)
  • 使用光标
  • 提供limit最大文件的列表,而不仅仅是最大的文件
  • 排序和剪切输出列表到limit每个sort_delta
  • 使用nearest读偏好(你可能也想用rs.slaveOk()在连接上,以便能够列表收藏,如果你是一个从属节点上)