C++ mmap() 与读取块

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45972/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 12:27:04  来源:igfitidea点击:

mmap() vs. reading blocks

c++file-iofstreammmap

提问by jbl

I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.

我正在开发一个程序,该程序将处理大小可能为 100GB 或更大的文件。这些文件包含多组可变长度记录。我已经启动并运行了第一个实现,现在正在寻求提高性能,特别是在更有效地执行 I/O 方面,因为输入文件被多次扫描。

Is there a rule of thumb for using mmap()versus reading in blocks via C++'s fstreamlibrary? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.

是否有mmap()通过 C++fstream库在块中使用和读取的经验法则?我想要做的是将大块从磁盘读入缓冲区,处理缓冲区中的完整记录,然后读取更多。

The mmap()code could potentially get very messy since mmap'd blocks need to lie on page sized boundaries (my understanding) and records could potentially like across page boundaries. With fstreams, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.

mmap()代码可能会得到非常混乱,因为mmap“d块需要躺在页大小的边界(我的理解)和记录可能潜在般划过页面边界。使用fstreams,我可以只查找记录的开头并再次开始读取,因为我们不仅限于读取位于页面大小边界上的块。

How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap()is 2x faster) or simple tests?

如何在不实际编写完整实现的情况下在这两个选项之间做出决定?任何经验法则(例如,mmap()快 2 倍)或简单的测试?

回答by Dietrich Epp

I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmapor readmight be faster or slower.

我试图找到关于 Linux 上 mmap / read 性能的最后一句话,我在 Linux 内核邮件列表上看到了一篇很好的帖子(链接)。它是从 2000 年开始的,所以从那时起内核中的 IO 和虚拟内存有了很多改进,但它很好地解释了为什么mmapread可能更快或更慢的原因。

  • A call to mmaphas more overhead than read(just like epollhas more overhead than poll, which has more overhead than read). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive.
  • The IO system can already use the disk cache, so if you read a file, you'll hit the cache or miss it no matter what method you use.
  • 调用 的mmap开销比 多read(就像epoll开销比poll多,开销比 多read)。更改虚拟内存映射在某些处理器上是一项非常昂贵的操作,原因与在不同进程之间切换的成本相同。
  • IO 系统已经可以使用磁盘缓存,因此如果您读取文件,无论您使用何种方法,您都会命中缓存或错过它。

However,

然而,

  • Memory maps are generally faster for random access, especially if your access patterns are sparse and unpredictable.
  • Memory maps allow you to keepusing pages from the cache until you are done. This means that if you use a file heavily for a long period of time, then close it and reopen it, the pages will still be cached. With read, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try to mlockpages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance).
  • Reading a file directly is very simple and fast.
  • 内存映射对于随机访问通常更快,尤其是在您的访问模式稀疏且不可预测的情况下。
  • 内存映射允许您继续使用缓存中的页面,直到完成为止。这意味着如果你长时间大量使用一个文件,然后关闭它并重新打开它,页面仍然会被缓存。使用read,您的文件可能很久以前就已从缓存中刷新了。如果您使用文件并立即丢弃它,则这不适用。(如果您尝试mlock分页只是为了将它们保留在缓存中,那么您就是在尝试超越磁盘缓存,而这种愚蠢的做法很少有助于系统性能)。
  • 直接读取文件非常简单快捷。

The discussion of mmap/read reminds me of two other performance discussions:

mmap/read 的讨论让我想起了另外两个性能讨论:

  • Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.

  • Some other network programmers were shocked to learn that epollis often slower than poll, which makes perfect sense if you know that managing epollrequires making more syscalls.

  • 一些 Java 程序员惊讶地发现非阻塞 I/O 通常比阻塞 I/O 慢,如果您知道非阻塞 I/O 需要进行更多的系统调用,这完全有道理。

  • 其他一些网络程序员惊讶地发现它epoll通常比 慢poll,如果你知道管理epoll需要进行更多的系统调用,这是完全有道理的。

Conclusion:Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHAREDisn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.

结论:如果您随机访问数据,将其保留很长时间,或者您知道可以与其他进程共享数据(MAP_SHARED如果没有实际共享,则不是很有趣),请使用内存映射。如果按顺序访问数据或在读取后丢弃它,则正常读取文件。如果任一方法使您的程序不那么复杂,请执行此操作。对于许多现实世界的案例,如果不测试您的实际应用程序而不是基准测试,就没有确定的方法来证明一个更快。

(Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)

(对不起,我把这个问题弄得一团糟,但我一直在寻找答案,而这个问题一直出现在谷歌搜索结果的顶部。)

回答by Tim Cooper

The main performance cost is going to be disk i/o. "mmap()" is certainly quicker than istream, but the difference might not be noticeable because the disk i/o will dominate your run-times.

主要的性能成本将是磁盘 i/o。“mmap()”肯定比 istream 快,但差异可能并不明显,因为磁盘 i/o 将主导您的运行时间。

I tried Ben Collins's code fragment (see above/below) to test his assertion that "mmap() is wayfaster" and found no measurable difference. See my comments on his answer.

我试着本柯林斯的代码段(见上文/下文),以测试他断言,“mmap()的是方式更快”,并没有发现可测量的差异。请参阅我对他的回答的评论。

I would certainly notrecommend separately mmap'ing each record in turn unless your "records" are huge - that would be horribly slow, requiring 2 system calls for each record and possibly losing the page out of the disk-memory cache.....

我当然不会建议依次单独映射每个记录,除非您的“记录”很大 - 这会非常慢,每个记录需要 2 次系统调用,并且可能会丢失磁盘内存缓存中的页面...... .

In your case I think mmap(), istream and the low-level open()/read() calls will all be about the same. I would recommend mmap() in these cases:

在您的情况下,我认为 mmap()、istream 和低级 open()/read() 调用都大致相同。在这些情况下,我会推荐 mmap():

  1. There is random access (not sequential) within the file, AND
  2. the whole thing fits comfortably in memory OR there is locality-of-reference within the file so that certain pages can be mapped in and other pages mapped out. That way the operating system uses the available RAM to maximum benefit.
  3. OR if multiple processes are reading/working on the same file, then mmap() is fantastic because the processes all share the same physical pages.
  1. 文件中有随机访问(非顺序),并且
  2. 整个事情很适合内存,或者文件中有局部引用,以便可以映射某些页面并映射其他页面。这样操作系统就可以使用可用的 RAM 来获得最大的收益。
  3. 或者,如果多个进程正在读取/处理同一个文件,那么 mmap() 非常棒,因为这些进程都共享相同的物理页面。

(btw - I love mmap()/MapViewOfFile()).

(顺便说一句 - 我喜欢 mmap()/MapViewOfFile())。

回答by Ben Collins

mmap is wayfaster. You might write a simple benchmark to prove it to yourself:

MMAP是方式更快。您可以编写一个简单的基准测试来向自己证明这一点:

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
  in.read(data, 0x1000);
  // do something with data
}

versus:

相对:

const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;

int fd = open("filename.bin", O_RDONLY);

while (off < file_size)
{
  data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
  // do stuff with data
  munmap(data, page_size);
  off += page_size;
}

Clearly, I'm leaving out details (like how to determine when you reach the end of the file in the event that your file isn't a multiple of page_size, for instance), but it really shouldn't be much more complicated than this.

显然,我省略了细节(例如,如果您的文件不是 的倍数,则如何确定何时到达文件末尾page_size),但它真的不应该比这更复杂.

If you can, you might try to break up your data into multiple files that can be mmap()-ed in whole instead of in part (much simpler).

如果可以,您可以尝试将数据分解为多个文件,这些文件可以整体进行 mmap() 处理,而不是部分进行(简单得多)。

A couple of months ago I had a half-baked implementation of a sliding-window mmap()-ed stream class for boost_iostreams, but nobody cared and I got busy with other stuff. Most unfortunately, I deleted an archive of old unfinished projects a few weeks ago, and that was one of the victims :-(

几个月前,我对 boost_iostreams 的滑动窗口 mmap() ed 流类进行了半生不熟的实现,但没有人关心,我忙于其他东西。最不幸的是,几周前我删除了一个未完成的旧项目档案,这就是受害者之一:-(

Update: I should also add the caveat that this benchmark would look quite different in Windows because Microsoft implemented a nifty file cache that does most of what you would do with mmap in the first place. I.e., for frequently-accessed files, you could just do std::ifstream.read() and it would be as fast as mmap, because the file cache would have already done a memory-mapping for you, and it's transparent.

更新:我还应该补充一点,这个基准测试在 Windows 中看起来会大不相同,因为 Microsoft 实现了一个漂亮的文件缓存,它首先可以完成您对 mmap 所做的大部分工作。即,对于经常访问的文件,您可以只执行 std::ifstream.read() 并且它会像 mmap 一样快,因为文件缓存已经为您完成了内存映射,并且它是透明的。

Final Update: Look, people: across a lot of different platform combinations of OS and standard libraries and disks and memory hierarchies, I can't say for certain that the system call mmap, viewed as a black box, will always always always be substantially faster than read. That wasn't exactly my intent, even if my words could be construed that way. Ultimately, my point was that memory-mapped i/o is generally faster than byte-based i/o; this is still true. If you find experimentally that there's no difference between the two, then the only explanation that seems reasonable to me is that your platform implements memory-mapping under the covers in a way that is advantageous to the performance of calls to read. The only way to be absolutely certain that you're using memory-mapped i/o in a portable way is to use mmap. If you don't care about portability and you can rely on the particular characteristics of your target platforms, then using readmay be suitable without sacrificing measurably any performance.

最后更新:看,人们:在操作系统和标准库、磁盘和内存层次结构的许多不同平台组合中,我不能肯定地说mmap,被视为黑匣子的系统调用总是总是快得多比read。这不完全是我的意图,即使我的话可以这样解释。 最后,我的观点是内存映射的 i/o 通常比基于字节的 i/o 快;这仍然是真的。如果您通过实验发现两者之间没有区别,那么对我来说似乎合理的唯一解释是您的平台在幕后实现内存映射的方式有利于调用的性能read. 绝对确定您以可移植的方式使用内存映射 i/o 的唯一方法是使用mmap. 如果您不关心可移植性并且可以依赖目标平台的特定特性,那么使用read可能是合适的,而不会牺牲任何可衡量的性能。

Edit to clean up answer list:@jbl:

编辑以清理答案列表:@jbl:

the sliding window mmap sounds interesting. Can you say a little more about it?

滑动窗口 mmap 听起来很有趣。你能多说一点吗?

Sure - I was writing a C++ library for Git (a libgit++, if you will), and I ran into a similar problem to this: I needed to be able to open large (very large) files and not have performance be a total dog (as it would be with std::fstream).

当然 - 我正在为 Git 编写一个 C++ 库(一个 libgit++,如果你愿意的话),我遇到了一个类似的问题:我需要能够打开大(非常大)文件,而不是让性能成为一个彻头彻尾的狗(就像它一样std::fstream)。

Boost::Iostreamsalready has a mapped_file Source, but the problem was that it was mmapping whole files, which limits you to 2^(wordsize). On 32-bit machines, 4GB isn't big enough. It's not unreasonable to expect to have .packfiles in Git that become much larger than that, so I needed to read the file in chunks without resorting to regular file i/o. Under the covers of Boost::Iostreams, I implemented a Source, which is more or less another view of the interaction between std::streambufand std::istream. You could also try a similar approach by just inheriting std::filebufinto a mapped_filebufand similarly, inheriting std::fstreaminto a mapped_fstream. It's the interaction between the two that's difficult to get right. Boost::Iostreamshas some of the work done for you, and it also provides hooks for filters and chains, so I thought it would be more useful to implement it that way.

Boost::Iostreams已经有一个mapped_file 源,但问题是它正在mmapping 整个文件,这将你限制为2^(wordsize)。在 32 位机器上,4GB 不够大。期望.packGit 中的文件变得比这大得多并不是不合理的,所以我需要分块读取文件而不求助于常规文件 i/o。在 的掩护下Boost::Iostreams,我实现了一个 Source,它或多或少是std::streambuf和之间交互的另一种视图std::istream。您也可以尝试类似的方法,只需继承std::filebuf到 a 中mapped_filebuf,类似地,继承std::fstreama mapped_fstream. 两者之间的互动很难做到正确。 Boost::Iostreams为您完成了一些工作,它还为过滤器和链提供了钩子,所以我认为以这种方式实现它会更有用。

回答by BeeOnRope

There are lots of good answers here already that cover many of the salient points, so I'll just add a couple of issues I didn't see addressed directly above. That is, this answer shouldn't be considered a comprehensive of the pros and cons, but rather an addendum to other answers here.

这里已经有很多很好的答案,涵盖了许多重点,所以我只添加几个我没有看到上面直接解决的问题。也就是说,不应将此答案视为优缺点的综合,而应视为此处其他答案的附录。

mmap seems like magic

mmap 看起来很神奇

Taking the case where the file is already fully cached1as the baseline2, mmapmight seem pretty much like magic:

以文件已经完全缓存的情况1作为基线2mmap可能看起来很像魔术

  1. mmaponly requires 1 system call to (potentially) map the entire file, after which no more system calls are needed.
  2. mmapdoesn't require a copy of the file data from kernel to user-space.
  3. mmapallows you to access the file "as memory", including processing it with whatever advanced tricks you can do against memory, such as compiler auto-vectorization, SIMDintrinsics, prefetching, optimized in-memory parsing routines, OpenMP, etc.
  1. mmap只需要 1 个系统调用来(可能)映射整个文件,之后就不再需要系统调用了。
  2. mmap不需要将文件数据从内核复制到用户空间。
  3. mmap允许您“作为内存”访问文件,包括使用您可以针对内存执行的任何高级技巧来处理它,例如编译器自动矢量化、SIMD内在函数、预取、优化的内存解析例程、OpenMP 等。

In the case that the file is already in the cache, it seems impossible to beat: you just directly access the kernel page cache as memory and it can't get faster than that.

在文件已经在缓存中的情况下,似乎无法击败:您只需直接访问内核页面缓存作为内存,它就不会比这更快。

Well, it can.

嗯,它可以。

mmap is not actually magic because...

mmap 实际上并不神奇,因为...

mmap still does per-page work

mmap 仍然执行每页工作

A primary hidden cost of mmapvs read(2)(which is really the comparable OS-level syscall for reading blocks) is that with mmapyou'll need to do "some work" for every 4K page in user-space, even though it might be hidden by the page-fault mechanism.

mmapvs 的一个主要隐藏成本read(2)(它实际上是用于读取块的可比较的操作系统级系统调用)是mmap您需要为用户空间中的每个 4K 页面做“一些工作”,即使它可能被隐藏缺页机制。

For a example a typical implementation that just mmaps the entire file will need to fault-in so 100 GB / 4K = 25 million faults to read a 100 GB file. Now, these will be minor faults, but 25 billion page faults is still not going to be super fast. The cost of a minor fault is probably in the 100s of nanos in the best case.

例如,一个典型的实现只是mmap整个文件需要出错,所以 100 GB / 4K = 2500 万个错误才能读取 100 GB 的文件。现在,这些将是小错误,但 250 亿页错误仍然不会很快。在最好的情况下,一个小故障的成本可能是 100 纳米。

mmap relies heavily on TLB performance

mmap 严重依赖 TLB 性能

Now, you can pass MAP_POPULATEto mmapto tell it to set up all the page tables before returning, so there should be no page faults while accessing it. Now, this has the little problem that it also reads the entire file into RAM, which is going to blow up if you try to map a 100GB file - but let's ignore that for now3. The kernel needs to do per-page workto set up these page tables (shows up as kernel time). This ends up being a major cost in the mmapapproach, and it's proportional to the file size (i.e., it doesn't get relatively less important as the file size grows)4.

现在,您可以通过MAP_POPULATEtommap告诉它在返回之前设置所有页表,因此在访问它时应该没有页面错误。现在,这有一个小问题,它还会将整个文件读入 RAM,如果您尝试映射一个 100GB 的文件,这将会爆炸 - 但现在让我们忽略它3。内核需要做每页工作来设置这些页表(显示为内核时间)。这最终成为该mmap方法的主要成本,并且它与文件大小成正比(即,随着文件大小的增长,它不会变得相对不那么重要)4

Finally, even in user-space accessing such a mapping isn't exactly free (compared to large memory buffers not originating from a file-based mmap) - even once the page tables are set up, each access to a new page is going to, conceptually, incur a TLB miss. Since mmaping a file means using the page cache and its 4K pages, you again incur this cost 25 million times for a 100GB file.

最后,即使在用户空间访问这样的映射也不是完全免费的(与不是源自基于文件的大内存缓冲区相比mmap) - 即使一旦设置了页表,对新页面的每次访问都会,从概念上讲,会导致 TLB 未命中。由于mmaping 文件意味着使用页面缓存及其 4K 页面,因此对于 100GB 的文件,您将再次承担 2500 万次的成本。

Now, the actual cost of these TLB misses depends heavily on at least the following aspects of your hardware: (a) how many 4K TLB enties you have and how the rest of the translation caching works performs (b) how well hardware prefetch deals with with the TLB - e.g., can prefetch trigger a page walk? (c) how fast and how parallel the page walking hardware is. On modern high-end x86 Intel processors, the page walking hardware is in general very strong: there are at least 2 parallel page walkers, a page walk can occur concurrently with continued execution, and hardware prefetching can trigger a page walk. So the TLB impact on a streamingread load is fairly low - and such a load will often perform similarly regardless of the page size. Other hardware is usually much worse, however!

现在,这些 TLB 未命中的实际成本至少在很大程度上取决于您的硬件的以下方面:(a) 您拥有多少 4K TLB 实体以及其余翻译缓存的工作方式 (b) 硬件预取处理的效果如何使用 TLB - 例如,预取可以触发页面遍历吗?(c) 页面行走硬件的速度和平行程度。在现代高端 x86 Intel 处理器上,页面遍历硬件通常非常强大:至少有 2 个并行页面遍历,页面遍历可以与持续执行同时发生,硬件预取可以触发页面遍历。因此,TLB对流式读取负载的影响相当低 - 无论页面大小如何,这种负载通常都会以类似的方式执行。然而,其他硬件通常要差得多!

read() avoids these pitfalls

read() 避免了这些陷阱

The read()syscall, which is what generally underlies the "block read" type calls offered e.g., in C, C++ and other languages has one primary disadvantage that everyone is well-aware of:

read()系统调用,这是一般伏于“块读”呼叫类型提供例如,在C,C ++和其他语言有一个主要的缺点,每个人都充分认识到:

  • Every read()call of N bytes must copy N bytes from kernel to user space.
  • 每次read()调用 N 字节都必须将 N 字节从内核复制到用户空间。

On the other hand, it avoids most the costs above - you don't need to map in 25 million 4K pages into user space. You can usually malloca single buffer small buffer in user space, and re-use that repeatedly for all your readcalls. On the kernel side, there is almost no issue with 4K pages or TLB misses because all of RAM is usually linearly mapped using a few very large pages (e.g., 1 GB pages on x86), so the underlying pages in the page cache are covered very efficiently in kernel space.

另一方面,它避免了上述大部分成本——您不需要将 2500 万个 4K 页面映射到用户空间。您通常可以malloc在用户空间中使用单个缓冲区小缓冲区,并在所有read调用中重复使用它。在内核方面,几乎没有 4K 页面或 TLB 未命中的问题,因为所有 RAM 通常使用一些非常大的页面(例如,x86 上的 1 GB 页面)进行线性映射,因此页面缓存中的底层页面被覆盖在内核空间中非常有效。

So basically you have the following comparison to determine which is faster for a single read of a large file:

因此,基本上您可以通过以下比较来确定单次读取大文件时哪个更快:

Is the extra per-page work implied by the mmapapproach more costly than the per-byte work of copying file contents from kernel to user space implied by using read()?

mmap方法隐含的额外每页工作是否比使用隐含的将文件内容从内核复制到用户空间的每字节工作成本更高read()

On many systems, they are actually approximately balanced. Note that each one scales with completely different attributes of the hardware and OS stack.

在许多系统上,它们实际上是近似平衡的。请注意,每一个都具有完全不同的硬件和操作系统堆栈属性。

In particular, the mmapapproach becomes relatively faster when:

特别是,该mmap方法在以下情况下变得相对较快:

  • The OS has fast minor-fault handling and especially minor-fault bulking optimizations such as fault-around.
  • The OS has a good MAP_POPULATEimplementation which can efficiently process large maps in cases where, for example, the underlying pages are contiguous in physical memory.
  • The hardware has strong page translation performance, such as large TLBs, fast second level TLBs, fast and parallel page-walkers, good prefetch interaction with translation and so on.
  • 操作系统具有快速的小故障处理,尤其是小故障批量优化,例如故障排除。
  • 操作系统有一个很好的MAP_POPULATE实现,它可以有效地处理大型映射,例如,底层页面在物理内存中是连续的。
  • 硬件具有强大的页面翻译性能,例如大型TLB、快速的二级TLB、快速并行的page-walker、良好的预取与翻译交互等。

... while the read()approach becomes relatively faster when:

...虽然在以下情况下该read()方法变得相对更快:

  • The read()syscall has good copy performance. E.g., good copy_to_userperformance on the kernel side.
  • The kernel has an efficient (relative to userland) way to map memory, e.g., using only a few large pages with hardware support.
  • The kernel has fast syscalls and a way to keep kernel TLB entries around across syscalls.
  • read()系统调用具有良好的复制性能。例如,copy_to_user内核方面的良好性能。
  • 内核有一种有效的(相对于用户空间的)方式来映射内存,例如,在硬件支持下只使用几个大页面。
  • 内核具有快速的系统调用和一种在系统调用之间保持内核 TLB 条目的方法。

The hardware factors above vary wildlyacross different platforms, even within the same family (e.g., within x86 generations and especially market segments) and definitely across architectures (e.g., ARM vs x86 vs PPC).

上述硬件因素在不同平台之间差异很大,即使在同一个系列中(例如,在 x86 代内,尤其是在细分市场中),而且肯定会跨架构(例如,ARM 与 x86 与 PPC)。

The OS factors keep changing as well, with various improvements on both sides causing a large jump in the relative speed for one approach or the other. A recent list includes:

操作系统因素也在不断变化,双方的各种改进导致一种方法或另一种方法的相对速度大幅跃升。最近的名单包括:

  • Addition of fault-around, described above, which really helps the mmapcase without MAP_POPULATE.
  • Addition of fast-path copy_to_usermethods in arch/x86/lib/copy_user_64.S, e.g., using REP MOVQwhen it is fast, which really help the read()case.
  • 添加上面描述的故障绕过,这确实有助于mmap没有MAP_POPULATE.
  • 在 中添加快速路径copy_to_user方法arch/x86/lib/copy_user_64.S,例如,REP MOVQ在快速时使用,这确实对这种read()情况有帮助。

Update after Spectre and Meltdown

Spectre 和 Meltdown 之后的更新

The mitigations for the Spectre and Meltdown vulnerabilities considerably increased the cost of a system call. On the systems I've measured, the cost of a "do nothing" system call (which is an estimate of the pure overhead of the system call, apart from any actual work done by the call) went from about 100 ns on a typical modern Linux system to about 700 ns. Furthermore, depending on your system, the page-table isolationfix specifically for Meltdown can have additional downstream effects apart from the direct system call cost due to the need to reload TLB entries.

针对 Spectre 和 Meltdown 漏洞的缓解措施大大增加了系统调用的成本。在我测量过的系统上,“什么都不做”系统调用的成本(这是系统调用的纯开销的估计,除了调用完成的任何实际工作)从典型的 100 ns 开始现代 Linux 系统大约为 700 ns。此外,根据您的系统,除了由于需要重新加载 TLB 条目而导致的直接系统调用成本之外,专门针对 Meltdown的页表隔离修复还会产生额外的下游影响。

All of this is a relative disadvantage for read()based methods as compared to mmapbased methods, since read()methods must make one system call for each "buffer size" worth of data. You can't arbitrarily increase the buffer size to amortize this cost since using large buffers usually performs worse since you exceed the L1 size and hence are constantly suffering cache misses.

read()基于方法相比,所有这些都是基于方法的相对劣势mmap,因为read()方法必须为每个“缓冲区大小”的数据进行一次系统调用。您不能任意增加缓冲区大小来分摊此成本,因为使用大缓冲区通常性能更差,因为您超过了 L1 大小并因此不断遭受缓存未命中。

On the other hand, with mmap, you can map in a large region of memory with MAP_POPULATEand the access it efficiently, at the cost of only a single system call.

另一方面,使用mmap,您可以映射到大内存区域MAP_POPULATE并有效地访问它,而代价是只需一次系统调用。



1This more-or-less also includes the case where the file wasn't fully cached to start with, but where the OS read-ahead is good enough to make it appear so (i.e., the page is usually cached by the time you want it). This is a subtle issue though because the way read-ahead works is often quite different between mmapand readcalls, and can be further adjusted by "advise" calls as described in 2.

1这或多或少还包括文件没有完全缓存的情况,但操作系统预读足以使它看起来如此(即,页面通常在您想要它)。这是一个微妙的问题,因为预读的工作方式在mmapread调用之间通常有很大不同,并且可以通过“建议”调用进一步调整,如2 中所述

2... because if the file is notcached, your behavior is going to be completely dominated by IO concerns, including how sympathetic your access pattern is to the underlying hardware - and all your effort should be in ensuring such access is as sympathetic as possible, e.g. via use of madviseor fadvisecalls (and whatever application level changes you can make to improve access patterns).

2...因为如果文件没有被缓存,您的行为将完全由 IO 问题主导,包括您的访问模式对底层硬件的同情程度 - 您应该努力确保此类访问与可能,例如通过使用madvisefadvise调用(以及您可以进行的任何应用程序级别更改以改进访问模式)。

3You could get around that, for example, by sequentially mmaping in windows of a smaller size, say 100 MB.

3例如,您可以mmap通过在较小尺寸的窗口(例如 100 MB)中依次输入来解决这个问题。

4In fact, it turns out the MAP_POPULATEapproach is (at least one some hardware/OS combination) only slightly faster than not using it, probably because the kernel is using faultaround- so the actual number of minor faults is reduced by a factor of 16 or so.

4事实上,事实证明该MAP_POPULATE方法(至少是一些硬件/操作系统组合)仅比不使用它稍微快一点,可能是因为内核使用了faultaround- 因此小故障的实际数量减少了 16 倍或者。

回答by mlbrock

I'm sorry Ben Collins lost his sliding windows mmap source code. That'd be nice to have in Boost.

我很抱歉 Ben Collins 丢失了他的滑动窗口 mmap 源代码。这在 Boost 中会很好。

Yes, mapping the file is much faster. You're essentially using the the OS virtual memory subsystem to associate memory-to-disk and vice versa. Think about it this way: if the OS kernel developers could make it faster they would. Because doing so makes just about everything faster: databases, boot times, program load times, et cetera.

是的,映射文件要快得多。您实际上是在使用 OS 虚拟内存子系统将内存与磁盘关联起来,反之亦然。可以这样想:如果操作系统内核开发人员可以让它更快,他们就会这样做。因为这样做几乎可以让一切变得更快:数据库、启动时间、程序加载时间等等。

The sliding window approach really isn't that difficult as multiple continguous pages can be mapped at once. So the size of the record doesn't matter so long as the largest of any single record will fit into memory. The important thing is managing the book-keeping.

滑动窗口方法确实不难,因为可以一次映射多个连续页面。因此,记录的大小并不重要,只要任何单个记录中最大的记录适合内存即可。重要的是管理簿记。

If a record doesn't begin on a getpagesize() boundary, your mapping has to begin on the previous page. The length of the region mapped extends from the first byte of the record (rounded down if necessary to the nearest multiple of getpagesize()) to the last byte of the record (rounded up to the nearest multiple of getpagesize()). When you're finished processing a record, you can unmap() it, and move on to the next.

如果记录不在 getpagesize() 边界上开始,则您的映射必须从上一页开始。映射区域的长度从记录的第一个字节(必要时向下舍入到最接近的 getpagesize() 倍数)到记录的最后一个字节(向上舍入到最接近的 getpagesize() 倍数)。处理完一条记录后,您可以 unmap() 它,然后继续下一个。

This all works just fine under Windows too using CreateFileMapping() and MapViewOfFile() (and GetSystemInfo() to get SYSTEM_INFO.dwAllocationGranularity --- not SYSTEM_INFO.dwPageSize).

这一切在 Windows 下也能正常工作,使用 CreateFileMapping() 和 MapViewOfFile()(和 GetSystemInfo() 来获取 SYSTEM_INFO.dwAllocationGranularity --- 不是 SYSTEM_INFO.dwPageSize)。

回答by Leon Timmermans

mmap should be faster, but I don't know how much. It very much depends on your code. If you use mmap it's best to mmap the whole file at once, that will make you life a lot easier. One potential problem is that if your file is bigger than 4GB (or in practice the limit is lower, often 2GB) you will need a 64bit architecture. So if you're using a 32 environment, you probably don't want to use it.

mmap 应该更快,但我不知道多少。这在很大程度上取决于您的代码。如果您使用 mmap,最好一次对整个文件进行 mmap,这将使您的生活更轻松。一个潜在的问题是,如果您的文件大于 4GB(或者实际上限制更低,通常为 2GB),您将需要 64 位架构。因此,如果您使用的是 32 位环境,您可能不想使用它。

Having said that, there may be a better route to improving performance. You said the input file gets scanned many times, if you can read it out in one pass and then be done with it, that could potentially be much faster.

话虽如此,可能有更好的方法来提高性能。你说输入文件被扫描了很多次,如果你能一次读出它然后完成它,那可能会快得多。

回答by Douglas Leeder

Perhaps you should pre-process the files, so each record is in a separate file (or at least that each file is a mmap-able size).

也许您应该对文件进行预处理,因此每条记录都在一个单独的文件中(或者至少每个文件都是可映射的大小)。

Also could you do all of the processing steps for each record, before moving onto the next one? Maybe that would avoid some of the IO overhead?

在转到下一个记录之前,您是否也可以为每条记录执行所有处理步骤?也许这会避免一些 IO 开销?

回答by paxos1977

I agree that mmap'd file I/O is going to be faster, but while your benchmarking the code, shouldn't the counter example be somewhatoptimized?

我同意 mmap 的文件 I/O 会更快,但是在您对代码进行基准测试时,不应该对反例进行一些优化吗?

Ben Collins wrote:

本·柯林斯写道:

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
    in.read(data, 0x1000);
    // do something with data 
}

I would suggest also trying:

我建议也尝试:

char data[0x1000];
std::ifstream iifle( "file.bin");
std::istream  in( ifile.rdbuf() );

while( in )
{
    in.read( data, 0x1000);
    // do something with data
}

And beyond that, you might also try making the buffer size the same size as one page of virtual memory, in case 0x1000 is not the size of one page of virtual memory on your machine... IMHO mmap'd file I/O still wins, but this should make things closer.

除此之外,您还可以尝试使缓冲区大小与一页虚拟内存的大小相同,以防 0x1000 不是您机器上一页虚拟内存的大小......恕我直言 mmap 的文件 I/O 仍然赢了,但这应该会让事情变得更接近。

回答by mike

To my mind, using mmap() "just" unburdens the developer from having to write their own caching code. In a simple "read through file eactly once" case, this isn't going to be hard (although as mlbrock points out you still save the memory copy into process space), but if you're going back and forth in the file or skipping bits and so forth, I believe the kernel developers have probablydone a better job implementing caching than I can...

在我看来,使用 mmap() “只是”减轻了开发人员必须编写自己的缓存代码的负担。在一个简单的“一次读取文件”的情况下,这并不难(尽管 mlbrock 指出您仍然将内存副本保存到进程空间中),但是如果您在文件中来回或跳过位等等,我相信内核开发人员可能在实现缓存方面做得比我做得更好......

回答by mike

I remember mapping a huge file containing a tree structure into memory years ago. I was amazed by the speed compared to normal de-serialization which involves lot of work in memory, like allocating tree nodes and setting pointers. So in fact I was comparing a single call to mmap (or its counterpart on Windows) against many (MANY) calls to operator new and constructor calls. For such kind of task, mmap is unbeatable compared to de-serialization. Of course one should look into boosts relocatable pointer for this.

我记得几年前将一个包含树结构的大文件映射到内存中。与涉及大量内存工作(例如分配树节点和设置指针)的普通反序列化相比,速度让我感到惊讶。因此,实际上我正在比较对 mmap(或其在 Windows 上的对应项)的单个调用与对 operator new 和构造函数调用的许多 (MANY) 调用。对于此类任务,与反序列化相比,mmap 是无与伦比的。当然,应该为此研究 boosts 可重定位指针。