Linux mmap 和内存使用
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10303534/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
mmap and memory usage
提问by Elektito
I am writing a program that receives huge amounts of data (in pieces of different sizes) from the network, processes them and writes them to memory. Since some pieces of data can be very large, my current approach is limiting the buffer size used. If a piece is larger than the maximum buffer size, I write the data to a temporary file and later read the file in chunks for processing and permanent storage.
我正在编写一个程序,从网络接收大量数据(不同大小的数据),处理它们并将它们写入内存。由于某些数据可能非常大,我目前的方法是限制使用的缓冲区大小。如果一块大于最大缓冲区大小,我将数据写入临时文件,然后分块读取文件以进行处理和永久存储。
I'm wondering if this can be improved. I've been reading about mmap for a while but I'm not one hundred percent sure if it can help me. My idea is to use mmap for reading the temporary file. Does this help in any way? The main thing I'm concerned about is that an occasional large piece of data should not fill up my main memory causing everything else to be swapped out.
我想知道这是否可以改进。我一直在阅读有关 mmap 的文章,但我不能百分百确定它是否可以帮助我。我的想法是使用 mmap 来读取临时文件。这有什么帮助吗?我主要担心的是,偶尔的大块数据不应该填满我的主内存,从而导致其他所有内容都被换出。
Also, do you think the approach with temporary files is useful? Should I even be doing that or, perhaps, should I trust the linux memory manager to do the job for me? Or should I do something else altogether?
另外,您认为使用临时文件的方法有用吗?我是否应该这样做,或者,我是否应该相信 linux 内存管理器为我完成这项工作?还是我应该完全做其他事情?
采纳答案by Carlos Lint
Mmap can help you in some ways, I'll explain with some hypothetical examples:
Mmap 可以在某些方面帮助你,我会用一些假设的例子来解释:
First thing: Let's say you're running out of memory, and your application that have a 100MB chunk of malloc'ed memory get 50% of it swapped out, that means that the OS had to write 50MB to the swapfile, and if you need to read it back, you have written, occupied and then read it back again 50MB of your swapfile.
第一件事:假设您的内存不足,并且您的应用程序拥有 100MB 的 malloc 内存块,其中 50% 被换出,这意味着操作系统必须将 50MB 写入交换文件,如果您需要读回它,你已经写了,占用了你的交换文件的50MB,然后再读回来。
In case the memory was just mmap'ed, the operating system will not write that piece of information to the swapfile (as it knows that that data is identical to the file itself), instead, it will just scratch 50MB of information (again: supposing you have not written anything for now) and that's that. If you ever need that memory to be read again, the OS will fetch the contents not from the swapfile, but from the original file you've mmaped, so if any other program needs 50MB of swap, they're available. Also there is not overhead with swapfile manipulation at all.
如果内存只是 mmap'ed,操作系统不会将该信息写入交换文件(因为它知道该数据与文件本身相同),相反,它只会刮擦 50MB 的信息(再次:假设你现在还没有写任何东西),就是这样。如果您需要再次读取该内存,操作系统将不会从交换文件中获取内容,而是从您已映射的原始文件中获取内容,因此如果任何其他程序需要 50MB 的交换空间,则它们可用。也根本没有交换文件操作的开销。
Let's say you read a 100MB chunk of data, and according to the initial 1MB of header data, the information that you want is located at offset 75MB, so you don't need anything between 1~74.9MB! You have read it for nothing but to make your code simpler. With mmap, you will only read the data you have actually accessed (rounded 4kb, or the OS page size, which is mostly 4kb), so it would only read the first and the 75th MB. I think it's very hard to make a simpler and more effective way to avoid disk reading than mmaping files. And if by some reason you need the data at offset 37MB, you can just use it! You don't have to mmap it again, as the whole file is accessible in memory (of course limited by your process' memory space).
假设您读取了100MB的数据块,根据最初的1MB头数据,您想要的信息位于偏移量75MB处,因此您不需要1~74.9MB之间的任何信息!你读它只是为了让你的代码更简单。使用 mmap,您将只读取您实际访问过的数据(四舍五入 4kb,或操作系统页面大小,主要为 4kb),因此它只会读取第一个和第 75 个 MB。我认为制作一种比 mmaping 文件更简单、更有效的方法来避免磁盘读取是非常困难的。如果由于某种原因您需要偏移量 37MB 的数据,您可以使用它!您不必再次映射它,因为整个文件都可以在内存中访问(当然受进程内存空间的限制)。
All files mmap'ed are backed up by themselves, not by the swapfile, the swapfile is made to grant data that doesn't have a file to back up, which usually is data malloc'ed or data that is backed up by a file, but it was altered and [can not/shall not] be written back to it before the program actually tells the OS to do so via a msync call.
mmap'ed 的所有文件都由自己备份,而不是由交换文件备份,交换文件用于授予没有要备份的文件的数据,这通常是数据 malloc'ed 或由文件备份的数据,但它已被更改,并且[不能/不应]在程序通过 msync 调用实际告诉操作系统这样做之前将其写回。
Beware that you don't need to map the whole file in the memory, you can map any amount (2nd arg is "size_t length") starting from any place (6th arg - "off_t offset"), but unless your file is likely to be enormous, you can safely map 1GB of data with no fear, even if the system only packs 64mb of physical memory, but that's for reading, if you plan on writing then you should be more conservative and map only the stuff that you need.
请注意,您不需要在内存中映射整个文件,您可以从任何位置(第 6 个参数 - “off_t 偏移量”)开始映射任意数量(第二个参数是“size_t 长度”),但除非您的文件可能巨大的,你可以放心地映射 1GB 的数据,即使系统只包含 64mb 的物理内存,但那是为了阅读,如果你打算写,那么你应该更加保守,只映射你需要的东西.
Mapping files will help you making your code simpler (you already have the file contents on the memory, ready to use, with much less memory overhead since it's not anonymous memory) and faster (you will only read the data that your program accessed).
映射文件将帮助您使代码更简单(您已经在内存中拥有文件内容,可以使用,由于它不是匿名内存,因此内存开销要少得多)和更快(您将只读取程序访问的数据)。
回答by Jon Ander Ortiz Durántez
The main advantage of mmap with big files is to share the same memory mapping between two or more file: if you mmap with MAP_SHARED
, it will be loaded into memory only once for all the processes that will use the data with the memory saving.
大文件的 mmap 的主要优点是在两个或多个文件之间共享相同的内存映射:如果你使用 mmap MAP_SHARED
,它只会被加载到内存中,所有进程将使用内存保存数据。
But AFAIK , mmap maps the entire file into memory (Hereyou can find examples of how mmap fails with files bigger than physical mem + swap space.) so if you access the file from a single process, it will not help you with the physical memory consumption.
但是 AFAIK , mmap 将整个文件映射到内存中(在这里您可以找到有关 mmap 如何因文件大于物理内存 + 交换空间而失败的示例。)因此,如果您从单个进程访问该文件,它将无法帮助您处理物理内存内存消耗。
回答by user1277476
I believe mmap doesn't require all data to be in memory at the same moment - it uses the page cache to keep recently used pages in memory, and the rest on disk.
我相信 mmap 不需要所有数据同时在内存中——它使用页面缓存将最近使用的页面保存在内存中,其余的保存在磁盘上。
If you are reading one chunk at a time, using a temporary file probably won't help you, but if you are reading multiple chunks concurrently using multiple threads, processes, or using select/poll, then it might.
如果您一次读取一个块,使用临时文件可能对您没有帮助,但如果您使用多个线程、进程或使用 select/poll 同时读取多个块,那么它可能会帮助您。