MongoDB 作为文件存储

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15030532/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 13:03:52  来源:igfitidea点击:

MongoDB as file storage

mongodbstoragegridfsbigdata

提问by cmd

i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes.

我正在尝试找到为大文件创建可扩展存储的最佳解决方案。文件大小可以从 1-2 MB 到 500-600 GB 不等。

I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution.

我找到了一些关于 Hadoop 和 HDFS 的信息,但它看起来有点复杂,因为我不需要任何 Map/Reduce 作业和许多其他功能。现在我正在考虑使用 MongoDB 和它的 GridFS 作为文件存储解决方案。

And now the questions:

现在的问题:

  1. What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
  2. Will files from gridfs be cached in ram and how it will affect read-write perfomance?
  3. Maybe there are some other solutions that can solve my problem more efficiently?
  1. 当我尝试同时写入几个文件时,gridfs 会发生什么。读/写操作会有锁吗?(我只会将它用作文件存储)
  2. gridfs 中的文件是否会缓存在 ram 中,它将如何影响读写性能?
  3. 也许还有其他一些解决方案可以更有效地解决我的问题?

Thanks.

谢谢。

采纳答案by Sammaye

I can only answer for MongoDB here, I will not pretend I know much about HDFS and other such technologies.

我在这里只能回答MongoDB,我不会假装我对HDFS和其他此类技术了解很多。

The GridFs implementation is totally client side within the driver itself. This means there is no special loading or understanding of the context of file serving within MongoDB itself, effectively MongoDB itself does not even understand they are files ( http://docs.mongodb.org/manual/applications/gridfs/).

GridFs 实现完全是驱动程序本身的客户端。这意味着在 MongoDB 本身中没有特殊的加载或理解文件服务的上下文,实际上 MongoDB 本身甚至不理解它们是文件(http://docs.mongodb.org/manual/applications/gridfs/)。

This means that querying for any part of the filesor chunkscollection will result in the same process as it would for any other query, whereby it loads the data it needs into your working set ( http://en.wikipedia.org/wiki/Working_set) which represents a set of data (or all loaded data at that time) required by MongoDB within a given time frame to maintain optimal performance. It does this by paging it into RAM (well technically the OS does).

这意味着查询filesorchunks集合的任何部分将导致与任何其他查询相同的过程,从而将所需的数据加载到您的工作集中(http://en.wikipedia.org/wiki/Working_set) 表示 MongoDB 在给定时间范围内保持最佳性能所需的一组数据(或当时所有加载的数据)。它通过将它分页到 RAM 中来做到这一点(从技术上讲操作系统确实如此)。

Another point to take into consideration is that this is driver implemented. This means that the specification can vary, however, I don't think it does. All drivers will allow you to query for a set of documents from the filescollection which only houses the files meta data allowing you to later serve the file itself from the chunkscollection with a single query.

要考虑的另一点是这是驱动程序实现的。这意味着规范可能会有所不同,但是,我认为不会。所有驱动程序都允许您从files集合中查询一组文档,该集合仅包含文件元数据,允许您稍后chunks通过单个查询从集合中提供文件本身。

However that is not the important thing, you want to serve the file itself, including its data; this means that you will be loading the filescollection and its subsequent chunkscollection into your working set.

然而,这不是重要的事情,您希望为文件本身提供服务,包括其数据;这意味着您将把files集合及其后续chunks集合加载到您的工作集中。

With that in mind we have already hit the first snag:

考虑到这一点,我们已经遇到了第一个障碍:

Will files from gridfs be cached in ram and how it will affect read-write perfomance?

gridfs 中的文件是否会缓存在 ram 中,它将如何影响读写性能?

The read performance of small files could be awesome, directly from RAM; the writes would be just as good.

直接从 RAM 读取小文件的性能可能很棒;写的也一样好。

For larger files, not so. Most computers will not have 600 GB of RAM and it is likely, quite normal in fact, to house a 600 GB partition of a single file on a single mongodinstance. This creates a problem since that file, in order to be served, needs to fit into your working set however it is impossibly bigger than your RAM; at this point you could have page thrashing ( http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29) whereby the server is just page faulting 24/7 trying to load the file. The writes here are no better as well.

对于较大的文件,则不然。大多数计算机不会有 600 GB 的 RAM,事实上,在单个mongod实例上容纳单个文件的 600 GB 分区很可能是正常的。这会产生一个问题,因为该文件需要适合您的工作集才能提供服务,但它不可能比您的 RAM 大;在这一点上,您可能会遇到页面抖动(http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29),其中服务器只是 24/7 页面错误尝试加载文件。这里的写作也没有更好。

The only way around this is to starting putting a single file across many shards :\.

解决此问题的唯一方法是开始将单个文件跨多个分片放置:\

Note: one more thing to consider is that the default average size of a chunks"chunk" is 256KB, so that's a lot of documents for a 600GB file. This setting is manipulatable in most drivers.

注意:还要考虑的另一件事是chunks“块”的默认平均大小为 256KB,因此对于 600GB 的文件来说,这是很多文档。此设置在大多数驱动程序中都是可操作的。

What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)

当我尝试同时写入几个文件时,gridfs 会发生什么。读/写操作会有锁吗?(我只会将它用作文件存储)

GridFS, being only a specification uses the same locks as on any other collection, both read and write locks on a database level (2.2+) or on a global level (pre-2.2). The two do interfere with each other as well, i.e. how can you ensure a consistent read of a document that is being written to?

GridFS 只是一个规范,它使用与任何其他集合相同的锁,在数据库级别 (2.2+) 或全局级别 (2.2 之前) 上的读和写锁。两者也确实相互干扰,即如何确保对正在写入的文档的读取一致?

That being said the possibility for contention exists based on your scenario specifics, traffic, number of concurrent writes/reads and many other things we have no idea about.

话虽如此,根据您的场景细节、流量、并发写入/读取的数量以及我们不知道的许多其他事情,存在争用的可能性。

Maybe there are some other solutions that can solve my problem more efficiently?

也许还有其他一些解决方案可以更有效地解决我的问题?

I personally have found that S3 (as @mluggy said) in reduced redundancy format works best storing a mere portion of meta data about the file within MongoDB, much like using GridFS but without the chunks collection, let S3 handle all that distribution, backup and other stuff for you.

我个人发现,减少冗余格式的 S3(如@mluggy 所说)最适合在 MongoDB 中存储关于文件的元数据的一部分,就像使用 GridFS 但没有块集合一样,让 S3 处理所有的分发、备份和给你的其他东西。

Hopefully I have been clear, hope it helps.

希望我已经清楚了,希望它有所帮助。

Edit: Unlike what I accidently said, MongoDB does not have a collection level lock, it is a database level lock.

编辑:与我无意中所说的不同,MongoDB 没有集合级锁,它是数据库级锁。

回答by mluggy

Have you considered saving meta data onto MongoDB and writing actual files to Amazon S3? Both have excellent drivers and the latter is highly redundant, cloud/cdn-ready file storage. I would give it a shot.

您是否考虑过将元数据保存到 MongoDB 并将实际文件写入 Amazon S3?两者都有出色的驱动程序,而后者是高度冗余的、云/cdn 就绪的文件存储。我想试一试。

回答by Christopher WJ Rueber

I'll start by answering the first two:

我将首先回答前两个:

  1. There is a write lock when writing in to GridFS, yes. No lock for reads.
  2. The files wont be cached in memory when you query them, but their metadata will.
  1. 写入 GridFS 时有一个写锁,是的。没有读锁。
  2. 当您查询文件时,这些文件不会缓存在内存中,但它们的元数据会。

GridFS may not be the best solution for your problem. Write locks can become something of a pain when you're dealing with this type of situation, particularly for huge files. There are other databases out there that may solve this problem for you. HDFS is a good choice, but as you say, it is very complicated. I would recommend considering a storage mechanism like Riak or Amazon's S3. They're more oriented around being storage for files, and don't end up with major drawbacks. S3 and Riak both have excellent admin facilities, and can handle huge files. Though with Riak, last I knew, you had to do some file chunking to store files over 100mb. Despite that, it generally is a best practice to do some level of chunking for huge file sizes. There are a lot of bad things that can happen when transferring files in to DBs- From network time outs, to buffer overflows, etc. Either way, your solution is going to require a fair amount of tuning for massive file sizes.

GridFS 可能不是您问题的最佳解决方案。当您处理这种类型的情况时,写锁可能会变得很痛苦,尤其是对于大文件。还有其他数据库可以为您解决这个问题。HDFS 是一个不错的选择,但是正如你所说,它非常复杂。我建议考虑使用 Riak 或 Amazon 的 S3 之类的存储机制。它们更侧重于存储文件,并且不会出现重大缺陷。S3 和 Riak 都有出色的管理设施,可以处理大文件。尽管使用 Riak,最后我知道,你必须做一些文件分块来存储超过 100mb 的文件。尽管如此,对于巨大的文件大小进行某种程度的分块通常是最佳实践。