C++ 为什么 memmove 比 memcpy 快？

Question

提问by cruppstahl

I am investigating performance hotspots in an application which spends 50% of its time in memmove(3). The application inserts millions of 4-byte integers into sorted arrays, and uses memmove to shift the data "to the right" in order to make space for the inserted value.

我正在调查在 memmove(3) 上花费 50% 时间的应用程序中的性能热点。该应用程序将数百万个 4 字节整数插入到排序数组中，并使用 memmove 将数据“向右”移动，以便为插入的值腾出空间。

My expectation was that copying memory is extremely fast, and I was surprised that so much time is spent in memmove. But then I had the idea that memmove is slow because it's moving overlapping regions, which must be implemented in a tight loop, instead of copying large pages of memory. I wrote a small microbenchmark to find out whether there was a performance difference between memcpy and memmove, expecting memcpy to win hands down.

我的期望是复制内存非常快，我很惊讶在 memmove 上花费了这么多时间。但是后来我想到 memmove 很慢，因为它正在移动重叠区域，必须在紧密循环中实现，而不是复制大内存页面。我写了一个小的微基准测试来找出 memcpy 和 memmove 之间是否存在性能差异，希望 memcpy 能够胜出。

I ran my benchmark on two machines (core i5, core i7) and saw that memmove is actually faster than memcpy, on the older core i7 even nearly twice as fast! Now I am looking for explanations.

我在两台机器（核心 i5、核心 i7）上运行了我的基准测试，发现 memmove 实际上比 memcpy 快，在较旧的核心 i7 上甚至快两倍！现在我正在寻找解释。

Here is my benchmark. It copies 100 mb with memcpy, and then moves about 100 mb with memmove; source and destination are overlapping. Various "distances" for source and destination are tried. Each test is run 10 times, the average time is printed.

这是我的基准。它使用 memcpy 复制 100 mb，然后使用 memmove 移动大约 100 mb；源和目标重叠。尝试了源和目的地的各种“距离”。每个测试运行 10 次，打印平均时间。

https://gist.github.com/cruppstahl/78a57cdf937bca3d062c

Here are the results on the Core i5 (Linux 3.5.0-54-generic #81~precise1-Ubuntu SMP x86_64 GNU/Linux, gcc is 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5). The number in brackets is the distance (gap size) between source and destination:

这是在 Core i5 上的结果（Linux 3.5.0-54-generic #81~precise1-Ubuntu SMP x86_64 GNU/Linux, gcc is 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)。括号中的数字是源和目标之间的距离（间隙大小）：

memcpy        0.0140074
memmove (002) 0.0106168
memmove (004) 0.01065
memmove (008) 0.0107917
memmove (016) 0.0107319
memmove (032) 0.0106724
memmove (064) 0.0106821
memmove (128) 0.0110633

Memmove is implemented as a SSE optimized assembler code, copying from back to front. It uses hardware prefetch to load the data into the cache, and copies 128 bytes to XMM registers, then stores them at the destination.

Memmove 实现为 SSE 优化的汇编代码，从后到前复制。它使用硬件预取将数据加载到缓存中，并将 128 字节复制到 XMM 寄存器，然后将它们存储在目的地。

(memcpy-ssse3-back.S, lines 1650 ff)

（memcpy-ssse3-back.S，第 1650 行）

L(gobble_ll_loop):
    prefetchnta -0x1c0(%rsi)
    prefetchnta -0x280(%rsi)
    prefetchnta -0x1c0(%rdi)
    prefetchnta -0x280(%rdi)
    sub memcpy        0.0118526
memcpy        0.0119105
memmove (002) 0.0108151
memmove (004) 0.0107122
memmove (008) 0.0107262
memmove (016) 0.0108555
memmove (032) 0.0107171
memmove (064) 0.0106437
memmove (128) 0.0106648
x80, %rdx
    movdqu  -0x10(%rsi), %xmm1
    movdqu  -0x20(%rsi), %xmm2
    movdqu  -0x30(%rsi), %xmm3
    movdqu  -0x40(%rsi), %xmm4
    movdqu  -0x50(%rsi), %xmm5
    movdqu  -0x60(%rsi), %xmm6
    movdqu  -0x70(%rsi), %xmm7
    movdqu  -0x80(%rsi), %xmm8
    movdqa  %xmm1, -0x10(%rdi)
    movdqa  %xmm2, -0x20(%rdi)
    movdqa  %xmm3, -0x30(%rdi)
    movdqa  %xmm4, -0x40(%rdi)
    movdqa  %xmm5, -0x50(%rdi)
    movdqa  %xmm6, -0x60(%rdi)
    movdqa  %xmm7, -0x70(%rdi)
    movdqa  %xmm8, -0x80(%rdi)
    lea -0x80(%rsi), %rsi
    lea -0x80(%rdi), %rdi
    jae L(gobble_ll_loop)

Why is memmove faster then memcpy? I would expect memcpy to copy memory pages, which should be much faster than looping. In worst case I would expect memcpy to be as fast as memmove.

为什么 memmove 比 memcpy 快？我希望 memcpy 复制内存页，这应该比循环快得多。在最坏的情况下，我希望 memcpy 与 memmove 一样快。

PS: I know that I cannot replace memmove with memcpy in my code. I know that the code sample mixes C and C++. This question is really just for academic purposes.

PS：我知道我不能在我的代码中用 memcpy 替换 memmove 。我知道代码示例混合了 C 和 C++。这个问题真的只是出于学术目的。

UPDATE 1

更新 1

I ran some variations of the tests, based on the various answers.

我根据不同的答案运行了一些不同的测试。

When running memcpy twice, then the second run is faster than the first one.
When "touching" the destination buffer of memcpy (memset(b2, 0, BUFFERSIZE...)) then the first run of memcpy is also faster.
memcpy is still a little bit slower than memmove.

当运行 memcpy 两次时，第二次运行比第一次运行快。
当“接触” memcpy ( memset(b2, 0, BUFFERSIZE...))的目标缓冲区时， memcpy的第一次运行也会更快。
memcpy 仍然比 memmove 慢一点。

Here are the results:

结果如下：

memmove (002) 0.0610362
memmove (004) 0.0554264
memmove (008) 0.0575859
memmove (016) 0.057326
memmove (032) 0.0583542
memmove (064) 0.0561934
memmove (128) 0.0549391
memcpy 0.0537919

My conclusion: based on a comment from @Oliver Charlesworth, the operating system has to commit physical memory as soon as the memcpy destination buffer is accessed for the very first time (if someone knows how to "proof" this then please add an answer!). In addition, as @Mats Petersson said, memmove is cache friendlier than memcpy.

我的结论：根据@Oliver Charlesworth 的评论，操作系统必须在第一次访问 memcpy 目标缓冲区时立即提交物理内存（如果有人知道如何“证明”这一点，请添加答案！）。此外，正如@Mats Petersson 所说，memmove 比 memcpy 对缓存更友好。

Thanks for all the great answers and comments!

感谢所有精彩的回答和评论！

Answer 1

采纳答案by Tony Delroy

Your memmovecalls are shuffling memory along by 2 to 128 bytes, while your memcpysource and destination are completely different. Somehow that's accounting for the performance difference: if you copy to the same place, you'll see memcpyends up possibly a smidge faster, e.g. on ideone.com:

您的memmove调用正在将内存混洗 2 到 128 个字节，而您的memcpy源和目标完全不同。不知何故，这是性能差异的原因：如果你复制到同一个地方，你会看到memcpy最终可能会更快一点，例如在ideone.com 上：

##代码##

Hardly anything in it though - no evidence that writing back to an already faulted in memory page has muchimpact, and we're certainly not seeing a halving of time... but it does show that there's nothing wrong making memcpyunnecessarily slower when compared apples-for-apples.

虽然其中几乎没有任何内容 - 没有证据表明写回已经出现故障的内存页面会产生太大影响，而且我们当然不会看到时间减半......但它确实表明memcpy与苹果相比，不必要地变慢并没有错-对于苹果。

Answer 2

回答by Mats Petersson

When you are using memcpy, the writes need to go into the cache. When you use memmovewhere when you are copying a small step forward, the memory you are copying over will already be in the cache (because it was read 2, 4, 16 or 128 bytes "back"). Try doing a memmovewhere the destination is several megabytes (> 4 * cache size), and I suspect (but can't be bothered to test) that you'll get similar results.

当您使用时memcpy，写入需要进入缓存。当您使用memmovewhere 当您向前复制一小步时，您正在复制的内存将已经在缓存中（因为它被读取了 2、4、16 或 128 个字节“后退”）。尝试memmove在目标是几兆字节（> 4 * 缓存大小）的地方做一个，我怀疑（但不会费心去测试）你会得到类似的结果。

I guarantee that ALL is about cache maintenance when you do large memory operations.

我保证当你进行大内存操作时，ALL 都是关于缓存维护的。

Answer 3

回答by user3710044

Historically, memmove and memcopy are the same function. They worked in the same way and had the same implementation. It was then realised that memcopy doesn't need to be (and frequently wasn't) defined to handle overlapping areas in any particular way.

从历史上看，memmove 和 memcopy 是相同的功能。它们以相同的方式工作并具有相同的实现。然后意识到不需要（并且经常没有）定义 memcopy 来以任何特定方式处理重叠区域。

The end result is that memmove was defined to handle overlapping regions in a particular way even if this impacts performance. Memcopy is supposed to use the best algorithm available for non-overlapping regions. The implementations are normally almost identical.

最终结果是 memmove 被定义为以特定方式处理重叠区域，即使这会影响性能。Memcopy 应该使用可用于非重叠区域的最佳算法。实现通常几乎相同。

The problem you have run into is that there are so many variations of the x86 hardware that it is impossible to tell which method of shifting memory around will be the fastest. And even if you think you have a result in one circumstance something as simple as having a different 'stride' in the memory layout can cause vastly different cache performance.

您遇到的问题是 x86 硬件的变体如此之多，以至于无法确定哪种移动内存的方法最快。即使你认为你在一种情况下有结果，像在内存布局中使用不同的“步幅”这样简单的事情也会导致缓存性能大不相同。

You can either benchmark what you're actually doing or ignore the problem and rely on the benchmarks done for the C library.

您可以对实际执行的操作进行基准测试，也可以忽略问题并依赖为 C 库完成的基准测试。

Edit: Oh, and one last thing; shifting lots of memory contents around is VERY slow. I would guess your application would run faster with something like a simple B-Tree implementation to handle your integers. (Oh you are, okay)

编辑：哦，还有最后一件事；移动大量内存内容非常慢。我猜你的应用程序会运行得更快一些，比如一个简单的 B-Tree 实现来处理你的整数。（哦，你是，好吧）

Edit2: To summarise my expansion in the comments: The microbenchmark is the issue here, it isn't measuring what you think it is. The tasks given to memcpy and memmove differ significantly from each other. If the task given to memcpy is repeated several times with memmove or memcpy the end results will not depend on which memory shifting function you use UNLESS the regions overlap.

Edit2：在评论中总结我的扩展：微基准测试是这里的问题，它不是衡量你认为它是什么。赋予 memcpy 和 memmove 的任务彼此显着不同。如果给 memcpy 的任务用 memmove 或 memcpy 重复多次，最终结果将不取决于您使用哪种内存移位函数，除非区域重叠。

Answer 4

回答by Ehsan

"memcpy is more efficient than memmove." In your case, you most probably are not doing the exact same thing while you run the two functions.

“memcpy 比 memmove 更有效率。” 在您的情况下，您很可能在运行这两个函数时没有做完全相同的事情。

In general, USE memmove only if you have to. USE it when there is a very reasonable chance that the source and destination regions are over-lapping.

一般来说，只有在必要时才使用 memmove。当源区域和目标区域很可能重叠时使用它。

Reference: https://www.youtube.com/watch?v=Yr1YnOVG-4gDr. Jerry Cain, (Stanford Intro Systems Lecture - 7) Time: 36:00

参考：https: //www.youtube.com/watch?v=Yr1YnOVG-4g Dr. Jerry Cain，（斯坦福介绍系统讲座 - 7）时间：36:00

C++ 为什么 memmove 比 memcpy 快？

提问by cruppstahl

UPDATE 1

更新 1

采纳答案by Tony Delroy

回答by Mats Petersson

回答by user3710044

回答by Ehsan

相关推荐

最近更新

标签

C++ 为什么 memmove 比 memcpy 快？

提问by cruppstahl

UPDATE 1

更新 1

采纳答案by Tony Delroy

回答by Mats Petersson

回答by user3710044

回答by Ehsan

相关推荐

任何用于音频处理的类似 OpenCV 的 C/C++ 库？

C++ 如何将整个字符数组设置为空格？

C++ 如何将 std::find/std::find_if 与自定义类对象的向量一起使用？

C++ - 使用构造函数初始化头文件中的变量

相关推荐

最近更新

标签