windows 32 位和 64 位进程之间的 memcpy 性能差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/269408/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
memcpy performance differences between 32 and 64 bit processes
提问by timday
We have Core2 machines (Dell T5400) with XP64.
我们有 XP64 的 Core2 机器(戴尔 T5400)。
We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or 2.4GByte/s with the Intel compiler CRT's memcpy). While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which should be using 128-bit wide load-stores regardless of 32/64-bitness of the process) demonstrates similar upper limits on the copy bandwidth it achieves.
我们观察到,当运行 32 位进程时,memcpy 的性能在 1.2GByte/s 的数量级;然而,64 位进程中的 memcpy 达到大约 2.2GByte/s(或 2.4GByte/s 与 Intel 编译器 CRT 的 memcpy)。虽然最初的反应可能只是将其解释为由于 64 位代码中可用的寄存器更宽,但我们观察到我们自己的类似 memcpy 的 SSE 汇编代码(它应该使用 128 位宽的加载存储,而不管 32 /64-bitness of the process)展示了它实现的复制带宽的类似上限。
My question is, what's this difference actually due to ? Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what ?
我的问题是,这种差异实际上是由于什么?32 位进程是否必须跳过一些额外的 WOW64 圈才能访问 RAM?它与 TLB 或预取器有关还是......什么?
Thanks for any insight.
感谢您的任何见解。
Also raised on Intel forums.
也在英特尔论坛上提出。
采纳答案by Die in Sente
Of course, you really need to look at the actual machine instructions that are being executed inside the innermost loop of the memcpy, by stepping into the machine code with a debugger. Anything else is just speculation.
当然,您确实需要查看在 memcpy 最内层循环中执行的实际机器指令,方法是使用调试器单步执行机器代码。其他的都只是猜测。
My quess is that it probably doesn't have anything to do with 32-bit versus 64-bit per se; my guess is that the faster library routine was written using SSE non-temporal stores.
我的问题是它可能与 32 位与 64 位本身没有任何关系;我的猜测是更快的库例程是使用 SSE 非临时存储编写的。
If the inner loop contains any variation of conventional load-store instructions, then the destination memory must be read into the machine's cache, modified, and written back out. Since that read is totally unnecessary -- the bits being read are overwritten immediately -- you can save half the memory bandwidth by using the "non-temporal" write instructions, which bypass the caches. That way, the destination memory is just written making a one-way trip to the memory instead of a round trip.
如果内部循环包含传统加载-存储指令的任何变体,则必须将目标内存读入机器的缓存、修改并写回。由于该读取是完全不必要的——正在读取的位会立即被覆盖——您可以通过使用绕过缓存的“非临时”写入指令来节省一半的内存带宽。这样,目标内存只是写入到内存的单向旅行而不是往返旅行。
I don't know the Intel compiler's CRT library, so this is just a guess. There's no particular reason why the 32-bit libCRT can't do the same thing, but the speedup you quote is in the ballpark of what I would expect just by converting the movdqa instructions to movnt...
我不知道英特尔编译器的 CRT 库,所以这只是一个猜测。没有特别的理由为什么 32 位 libCRT 不能做同样的事情,但是你引用的加速是在我期望的范围内,只需将 movdqa 指令转换为 movnt ......
Since memcpy is not doing any calculations, it's always bound by how fast you can read and write memory.
由于 memcpy 不进行任何计算,因此它始终受读取和写入内存的速度的限制。
回答by Mecki
I think the following can explain it:
我认为以下可以解释它:
To copy data from memory to a register and back to memory, you do
要将数据从内存复制到寄存器然后再复制回内存,您可以
mov eax, [address]
mov [address2], eax
This moves 32 bit (4 byte) from address to address2. The same goes with 64 bit in 64 bit mode
这将 32 位(4 字节)从地址移动到地址 2。64 位模式下的 64 位也是如此
mov rax, [address]
mov [address2], rax
This moves 64 bit, 2 byte, from address to address2. "mov" itself, regardless of whether it is 64 bit or 32 bit has a latency of 0.5 and a throughput of 0.5 according to Intel's specs. Latency is how many clock cycles the instruction takes to travel through the pipeline and throughput is how long the CPU has to wait before accepting the same instruction again. As you can see, it can do two mov's per clock cycle, however, it has to wait half a clock cycle between two mov's, thus it can effectively only do one mov per clock cycle (or am I wrong here and misinterpret the terms? See PDF herefor details).
这将 64 位 2 字节从地址移动到地址 2。根据英特尔的规范,“mov”本身,无论是 64 位还是 32 位,延迟为 0.5,吞吐量为 0.5。延迟是指令通过流水线所需的时钟周期数,吞吐量是 CPU 在再次接受同一指令之前必须等待的时间。如您所见,它可以在每个时钟周期执行两个 mov,但是,它必须在两个 mov 之间等待半个时钟周期,因此它每个时钟周期只能有效执行一个 mov(或者我在这里错了并误解了术语?有关详细信息,请参阅此处的PDF)。
Of course a mov reg, mem
can be longer than 0.5 cycles, depending if the data is in 1st or 2nd level cache, or not in cache at all and needs to be grabbed from memory. However, the latency time of above ignores this fact (as the PDF states I linked above), it assumes all data necessary for the mov are present already (otherwise the latency will increase by how long it takes to fetch the data from wherever it is right now - this might be several clock cycles and is completely independent of the command being executed says the PDF on page 482/C-30).
当然amov reg, mem
可以长于0.5个周期,这取决于数据是在一级缓存还是二级缓存中,或者根本不在缓存中,需要从内存中抓取。然而,上面的延迟时间忽略了这个事实(正如我上面链接的 PDF 状态),它假设 mov 所需的所有数据都已经存在(否则延迟会增加从任何地方获取数据所需的时间)现在 - 这可能是几个时钟周期,并且完全独立于正在执行的命令,如第 482/C-30 页的 PDF 所述)。
What is interesting, whether the mov is 32 or 64 bit plays no role. That means unless the memory bandwidth becomes the limiting factor, 64 bit mov's are equally fast to 32 bit mov's, and since it takes only half as many mov's to move the same amount of data from A to B when using 64 bit, the throughput can (in theory) be twice as high (the fact that it's not is probably because memory is not unlimited fast).
有趣的是,mov 是 32 位还是 64 位无关紧要。这意味着除非内存带宽成为限制因素,否则 64 位 mov 与 32 位 mov 的速度相同,并且由于使用 64 位时,将相同数量的数据从 A 移动到 B 只需要一半的 mov,因此吞吐量可以(理论上)是两倍高(事实并非如此可能是因为内存不是无限快)。
Okay, now you think when using the larger SSE registers, you should get faster throughput, right? AFAIK the xmm registers are not 256, but 128 bit wide, BTW (reference at Wikipedia). However, have you considered latency and throughput? Either the data you want to move is 128 bit aligned or not. Depending on that, you either move it using
好的,现在您认为当使用更大的 SSE 寄存器时,您应该获得更快的吞吐量,对吗?AFAIK xmm 寄存器不是 256,而是 128 位宽,顺便说一句(参考维基百科)。但是,您是否考虑过延迟和吞吐量?您要移动的数据是否为 128 位对齐。根据这一点,您可以使用移动它
movdqa xmm1, [address]
movdqa [address2], xmm1
or if not aligned
或者如果没有对齐
movdqu xmm1, [address]
movdqu [address2], xmm1
Well, movdqa/movdqu has a latency of 1 and a throughput of 1. So the instructions take twice as long to be executed and the waiting time after the instructions is twice as long as a normal mov.
好吧,movdqa/movdqu 的延迟为 1,吞吐量为 1。因此,指令的执行时间是普通 mov 的两倍,指令执行后的等待时间是正常的两倍。
And something else we have not even taken into account is the fact that the CPU actually splits instructions into micro-ops and it can execute these in parallel. Now it starts getting really complicated... even too complicated for me.
我们甚至没有考虑到的另一件事是 CPU 实际上将指令拆分为微操作,并且它可以并行执行这些。现在它开始变得非常复杂……对我来说甚至太复杂了。
Anyway, I know from experience loading data to/from xmm registers is much slower than loading data to/from normal registers, so your idea to speed up transfer by using xmm registers was doomed from the very first second. I'm actually surprised that in the end the SSE memmove is not much slower than the normal one.
无论如何,我从经验中知道从 xmm 寄存器加载数据/从 xmm 寄存器加载数据比从普通寄存器加载数据慢得多,因此您使用 xmm 寄存器加速传输的想法从一开始就注定了。我实际上很惊讶,最终 SSE memmove 并不比正常的 memmove 慢多少。
回答by timday
I finally got to the bottom of this (and Die in Sente's answer was on the right lines, thanks)
我终于明白了这一点(Sente 的答案是正确的,谢谢)
In the below, dst and src are 512 MByte std::vector. I'm using the Intel 10.1.029 compiler and CRT.
在下面,dst 和 src 是 512 MByte std::vector。我使用的是 Intel 10.1.029 编译器和 CRT。
On 64bit both
在 64 位上
memcpy(&dst[0],&src[0],dst.size())
memcpy(&dst[0],&src[0],dst.size())
and
和
memcpy(&dst[0],&src[0],N)
memcpy(&dst[0],&src[0],N)
where N is previously declared const size_t N=512*(1<<20);
call
其中 N 是先前声明的const size_t N=512*(1<<20);
调用
__intel_fast_memcpy
__intel_fast_memcpy
the bulk of which consists of:
其中大部分包括:
000000014004ED80 lea rcx,[rcx+40h]
000000014004ED84 lea rdx,[rdx+40h]
000000014004ED88 lea r8,[r8-40h]
000000014004ED8C prefetchnta [rdx+180h]
000000014004ED93 movdqu xmm0,xmmword ptr [rdx-40h]
000000014004ED98 movdqu xmm1,xmmword ptr [rdx-30h]
000000014004ED9D cmp r8,40h
000000014004EDA1 movntdq xmmword ptr [rcx-40h],xmm0
000000014004EDA6 movntdq xmmword ptr [rcx-30h],xmm1
000000014004EDAB movdqu xmm2,xmmword ptr [rdx-20h]
000000014004EDB0 movdqu xmm3,xmmword ptr [rdx-10h]
000000014004EDB5 movntdq xmmword ptr [rcx-20h],xmm2
000000014004EDBA movntdq xmmword ptr [rcx-10h],xmm3
000000014004EDBF jge 000000014004ED80
and runs at ~2200 MByte/s.
并以 ~2200 MByte/s 的速度运行。
But on 32bit
但是在 32 位
memcpy(&dst[0],&src[0],dst.size())
memcpy(&dst[0],&src[0],dst.size())
calls
电话
__intel_fast_memcpy
__intel_fast_memcpy
the bulk of which consists of
其中大部分由
004447A0 sub ecx,80h
004447A6 movdqa xmm0,xmmword ptr [esi]
004447AA movdqa xmm1,xmmword ptr [esi+10h]
004447AF movdqa xmmword ptr [edx],xmm0
004447B3 movdqa xmmword ptr [edx+10h],xmm1
004447B8 movdqa xmm2,xmmword ptr [esi+20h]
004447BD movdqa xmm3,xmmword ptr [esi+30h]
004447C2 movdqa xmmword ptr [edx+20h],xmm2
004447C7 movdqa xmmword ptr [edx+30h],xmm3
004447CC movdqa xmm4,xmmword ptr [esi+40h]
004447D1 movdqa xmm5,xmmword ptr [esi+50h]
004447D6 movdqa xmmword ptr [edx+40h],xmm4
004447DB movdqa xmmword ptr [edx+50h],xmm5
004447E0 movdqa xmm6,xmmword ptr [esi+60h]
004447E5 movdqa xmm7,xmmword ptr [esi+70h]
004447EA add esi,80h
004447F0 movdqa xmmword ptr [edx+60h],xmm6
004447F5 movdqa xmmword ptr [edx+70h],xmm7
004447FA add edx,80h
00444800 cmp ecx,80h
00444806 jge 004447A0
and runs at ~1350 MByte/s only.
并且仅以 ~1350 MByte/s 的速度运行。
HOWEVER
然而
memcpy(&dst[0],&src[0],N)
where N is previously declared const size_t N=512*(1<<20);
compiles (on 32bit) to a direct call to a
其中 N 先前声明const size_t N=512*(1<<20);
编译(在 32 位上)为直接调用
__intel_VEC_memcpy
the bulk of which consists of
其中大部分由
0043FF40 movdqa xmm0,xmmword ptr [esi]
0043FF44 movdqa xmm1,xmmword ptr [esi+10h]
0043FF49 movdqa xmm2,xmmword ptr [esi+20h]
0043FF4E movdqa xmm3,xmmword ptr [esi+30h]
0043FF53 movntdq xmmword ptr [edi],xmm0
0043FF57 movntdq xmmword ptr [edi+10h],xmm1
0043FF5C movntdq xmmword ptr [edi+20h],xmm2
0043FF61 movntdq xmmword ptr [edi+30h],xmm3
0043FF66 movdqa xmm4,xmmword ptr [esi+40h]
0043FF6B movdqa xmm5,xmmword ptr [esi+50h]
0043FF70 movdqa xmm6,xmmword ptr [esi+60h]
0043FF75 movdqa xmm7,xmmword ptr [esi+70h]
0043FF7A movntdq xmmword ptr [edi+40h],xmm4
0043FF7F movntdq xmmword ptr [edi+50h],xmm5
0043FF84 movntdq xmmword ptr [edi+60h],xmm6
0043FF89 movntdq xmmword ptr [edi+70h],xmm7
0043FF8E lea esi,[esi+80h]
0043FF94 lea edi,[edi+80h]
0043FF9A dec ecx
0043FF9B jne ___intel_VEC_memcpy+244h (43FF40h)
and runs at ~2100MByte/s (and proving 32bit isn't somehow bandwidth limited).
并以 ~2100MByte/s 运行(并证明 32 位不受带宽限制)。
I withdraw my claim that my own memcpy-like SSE code suffers from a
similar ~1300 MByte/limit in 32bit builds; I now don't have any problems
getting >2GByte/s on 32 or 64bit; the trick (as the above results hint)
is to use non-temporal ("streaming") stores (e.g _mm_stream_ps
intrinsic).
我撤回了我自己的类似 memcpy 的 SSE 代码在 32 位构建中遭受类似 ~1300 MByte/限制的说法;我现在在 32 位或 64 位上获得 >2GByte/s 没有任何问题;诀窍(如上面的结果提示)是使用非时间(“流”)存储(例如_mm_stream_ps
内在)。
It seems a bit strange that the 32bit "dst.size()
" memcpy doesn't eventually
call the faster "movnt
" version (if you step into memcpy there is the most
incredible amount of CPUID
checking and heuristic logic e.g comparing number
of bytes to be copied with cache size etc before it goes anywhere near your
actual data) but at least I understand the observed behaviour now (and it's
not SysWow64 or H/W related).
看起来有点奇怪 32 位“ dst.size()
” memcpy 最终没有调用更快的“ movnt
”版本(如果你进入 memcpy,会有最难以置信的CPUID
检查和启发式逻辑,例如比较要复制的字节数与缓存大小等在它接近您的实际数据之前)但至少我现在了解观察到的行为(并且它与 SysWow64 或 H/W 无关)。
回答by Die in Sente
Thanks for the positive feedback! I think I can partlyexplain what's going here.
感谢您的积极反馈!我想我可以部分解释这里发生了什么。
Using the non-temporal stores for memcpy is definitely the fasted ifyou're only timing the memcpy call.
如果您只是为 memcpy 调用计时,那么使用memcpy 的非临时存储绝对是禁食。
On the other hand, if you're benchmarking an application, the movdqa stores have the benefit that they leave the destination memory in cache. Or at least the part of it that fits into cache.
另一方面,如果您对应用程序进行基准测试,movdqa 存储的好处是它们将目标内存留在缓存中。或者至少是适合缓存的部分。
So if you're designing a runtime library and if you can assume that the application that called memcpy is going to use the destination buffer immediately after the memcpy call, then you'll want to provide the movdqa version. This effectively optimizes out the trip from memory back into the cpu that would follow the movntdq version, and all of the instructions following the call will run faster.
因此,如果您正在设计运行时库,并且可以假设调用 memcpy 的应用程序将在调用 memcpy 后立即使用目标缓冲区,那么您将需要提供 movdqa 版本。这有效地优化了从内存返回到遵循 movntdq 版本的 cpu 的行程,并且调用之后的所有指令都将运行得更快。
But on the other hand, if the destination buffer is large compared to the processor's cache, that optimization doesn't work and the movntdq version would give you faster application benchmarks.
但另一方面,如果目标缓冲区与处理器的缓存相比较大,则该优化不起作用,movntdq 版本将为您提供更快的应用程序基准测试。
So the idea memcpy would have multiple versions under the hood. When the destination buffer is small compared to the processor's cache, use movdqa, otherwise, then the destination buffer is large compared to the processor's cache, use movntdq. It sounds like this is what's happening in the 32-bit library.
所以这个想法 memcpy 会有多个版本。当目标缓冲区与处理器的缓存相比较小时,使用 movdqa,否则,目标缓冲区与处理器的缓存相比较大,则使用 movntdq。听起来这就是 32 位库中发生的事情。
Of course, none of this has anything to do with the differences between 32-bit and 64-bit.
当然,这与 32 位和 64 位之间的差异没有任何关系。
My conjecture is that the 64-bit library just isn't as mature. The developers just haven't gotten around to providing both routines in that version of library yet.
我的猜测是 64 位库还没有那么成熟。开发人员还没有开始在该版本的库中提供这两个例程。
回答by Harper Shelby
My off-the-cuff guess is that the 64 bit processes are using the processor's native 64-bit memory size, which optimizes the use of the memory bus.
我的即兴猜测是 64 位进程正在使用处理器的本机 64 位内存大小,这优化了内存总线的使用。
回答by Brian Knoblauch
I don't have a reference in front of me, so I'm not absolutely positive on the timings/instructions, but I can still give the theory. If you're doing a memory move under 32-bit mode, you'll do something like a "rep movsd" which moves a single 32-bit value every clock cycle. Under 64-bit mode, you can do a "rep movsq" which does a single 64-bit move every clock cycle. That instruction is not available to 32-bit code, so you'd be doing 2 x rep movsd (at 1 cycle a piece) for half the execution speed.
我面前没有参考资料,所以我对时间/说明不是绝对肯定的,但我仍然可以给出理论。如果您在 32 位模式下进行内存移动,您将执行类似于“rep movsd”的操作,它在每个时钟周期移动一个 32 位值。在 64 位模式下,您可以执行“rep movsq”,它在每个时钟周期执行一次 64 位移动。该指令不适用于 32 位代码,因此您将执行 2 次 rep movsd(每条 1 个周期),执行速度减半。
VERY much simplified, ignoring all the memory bandwidth/alignment issues, etc, but this is where it all begins...
非常简化,忽略所有内存带宽/对齐问题等,但这就是一切的开始......
回答by GodLikeMOuse
Here's an example of a memcpy routine geared specifically for 64 bit architecture.
这是一个专门针对 64 位架构的 memcpy 例程的示例。
void uint8copy(void *dest, void *src, size_t n){
uint64_t * ss = (uint64_t)src;
uint64_t * dd = (uint64_t)dest;
n = n * sizeof(uint8_t)/sizeof(uint64_t);
while(n--)
*dd++ = *ss++;
}//end uint8copy()
The full article is here: http://www.godlikemouse.com/2008/03/04/optimizing-memcpy-routines/
全文在这里:http: //www.godlikemouse.com/2008/03/04/optimizing-memcpy-routines/