windows 32 位和 64 位进程之间的 memcpy 性能差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/269408/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-15 11:34:17  来源:igfitidea点击:

memcpy performance differences between 32 and 64 bit processes

windowsmemory64-bitcpu32-bit

提问by timday

We have Core2 machines (Dell T5400) with XP64.

我们有 XP64 的 Core2 机器(戴尔 T5400)。

We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or 2.4GByte/s with the Intel compiler CRT's memcpy). While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which should be using 128-bit wide load-stores regardless of 32/64-bitness of the process) demonstrates similar upper limits on the copy bandwidth it achieves.

我们观察到,当运行 32 位进程时,memcpy 的性能在 1.2GByte/s 的数量级;然而,64 位进程中的 memcpy 达到大约 2.2GByte/s(或 2.4GByte/s 与 Intel 编译器 CRT 的 memcpy)。虽然最初的反应可能只是将其解释为由于 64 位代码中可用的寄存器更宽,但我们观察到我们自己的类似 memcpy 的 SSE 汇编代码(它应该使用 128 位宽的加载存储,而不管 32 /64-bitness of the process)展示了它实现的复制带宽的类似上限。

My question is, what's this difference actually due to ? Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what ?

我的问题是,这种差异实际上是由于什么?32 位进程是否必须跳过一些额外的 WOW64 圈才能访问 RAM?它与 TLB 或预取器有关还是......什么?

Thanks for any insight.

感谢您的任何见解。

Also raised on Intel forums.

也在英特尔论坛上提出。

采纳答案by Die in Sente

Of course, you really need to look at the actual machine instructions that are being executed inside the innermost loop of the memcpy, by stepping into the machine code with a debugger. Anything else is just speculation.

当然,您确实需要查看在 memcpy 最内层循环中执行的实际机器指令,方法是使用调试器单步执行机器代码。其他的都只是猜测。

My quess is that it probably doesn't have anything to do with 32-bit versus 64-bit per se; my guess is that the faster library routine was written using SSE non-temporal stores.

我的问题是它可能与 32 位与 64 位本身没有任何关系;我的猜测是更快的库例程是使用 SSE 非临时存储编写的。

If the inner loop contains any variation of conventional load-store instructions, then the destination memory must be read into the machine's cache, modified, and written back out. Since that read is totally unnecessary -- the bits being read are overwritten immediately -- you can save half the memory bandwidth by using the "non-temporal" write instructions, which bypass the caches. That way, the destination memory is just written making a one-way trip to the memory instead of a round trip.

如果内部循环包含传统加载-存储指令的任何变体,则必须将目标内存读入机器的缓存、修改并写回。由于该读取是完全不必要的——正在读取的位会立即被覆盖——您可以通过使用绕过缓存的“非临时”写入指令来节省一半的内存带宽。这样,目标内存只是写入到内存的单向旅行而不是往返旅行。

I don't know the Intel compiler's CRT library, so this is just a guess. There's no particular reason why the 32-bit libCRT can't do the same thing, but the speedup you quote is in the ballpark of what I would expect just by converting the movdqa instructions to movnt...

我不知道英特尔编译器的 CRT 库,所以这只是一个猜测。没有特别的理由为什么 32 位 libCRT 不能做同样的事情,但是你引用的加速是在我期望的范围内,只需将 movdqa 指令转换为 movnt ......

Since memcpy is not doing any calculations, it's always bound by how fast you can read and write memory.

由于 memcpy 不进行任何计算,因此它始终受读取和写入内存的速度的限制。

回答by Mecki

I think the following can explain it:

我认为以下可以解释它:

To copy data from memory to a register and back to memory, you do

要将数据从内存复制到寄存器然后再复制回内存,您可以

mov eax, [address]
mov [address2], eax

This moves 32 bit (4 byte) from address to address2. The same goes with 64 bit in 64 bit mode

这将 32 位(4 字节)从地址移动到地址 2。64 位模式下的 64 位也是如此

mov rax, [address]
mov [address2], rax

This moves 64 bit, 2 byte, from address to address2. "mov" itself, regardless of whether it is 64 bit or 32 bit has a latency of 0.5 and a throughput of 0.5 according to Intel's specs. Latency is how many clock cycles the instruction takes to travel through the pipeline and throughput is how long the CPU has to wait before accepting the same instruction again. As you can see, it can do two mov's per clock cycle, however, it has to wait half a clock cycle between two mov's, thus it can effectively only do one mov per clock cycle (or am I wrong here and misinterpret the terms? See PDF herefor details).

这将 64 位 2 字节从地址移动到地址 2。根据英特尔的规范,“mov”本身,无论是 64 位还是 32 位,延迟为 0.5,吞吐量为 0.5。延迟是指令通过流水线所需的时钟周期数,吞吐量是 CPU 在再次接受同一指令之前必须等待的时间。如您所见,它可以在每个时钟周期执行两个 mov,但是,它必须在两个 mov 之间等待半个时钟周期,因此它每个时钟周期只能有效执行一个 mov(或者我在这里错了并误解了术语?有关详细信息,请参阅此处PDF)。

Of course a mov reg, memcan be longer than 0.5 cycles, depending if the data is in 1st or 2nd level cache, or not in cache at all and needs to be grabbed from memory. However, the latency time of above ignores this fact (as the PDF states I linked above), it assumes all data necessary for the mov are present already (otherwise the latency will increase by how long it takes to fetch the data from wherever it is right now - this might be several clock cycles and is completely independent of the command being executed says the PDF on page 482/C-30).

当然amov reg, mem可以长于0.5个周期,这取决于数据是在一级缓存还是二级缓存中,或者根本不在缓存中,需要从内存中抓取。然而,上面的延迟时间忽略了这个事实(正如我上面链接的 PDF 状态),它假设 mov 所需的所有数据都已经存在(否则延迟会增加从任何地方获取数据所需的时间)现在 - 这可能是几个时钟周期,并且完全独立于正在执行的命令,如第 482/C-30 页的 PDF 所述)。

What is interesting, whether the mov is 32 or 64 bit plays no role. That means unless the memory bandwidth becomes the limiting factor, 64 bit mov's are equally fast to 32 bit mov's, and since it takes only half as many mov's to move the same amount of data from A to B when using 64 bit, the throughput can (in theory) be twice as high (the fact that it's not is probably because memory is not unlimited fast).

有趣的是,mov 是 32 位还是 64 位无关紧要。这意味着除非内存带宽成为限制因素,否则 64 位 mov 与 32 位 mov 的速度相同,并且由于使用 64 位时,将相同数量的数据从 A 移动到 B 只需要一半的 mov,因此吞吐量可以(理论上)是两倍高(事实并非如此可能是因为内存不是无限快)。

Okay, now you think when using the larger SSE registers, you should get faster throughput, right? AFAIK the xmm registers are not 256, but 128 bit wide, BTW (reference at Wikipedia). However, have you considered latency and throughput? Either the data you want to move is 128 bit aligned or not. Depending on that, you either move it using

好的,现在您认为当使用更大的 SSE 寄存器时,您应该获得更快的吞吐量,对吗?AFAIK xmm 寄存器不是 256,而是 128 位宽,顺便说一句(参考维基百科)。但是,您是否考虑过延迟和吞吐量?您要移动的数据是否为 ​​128 位对齐。根据这一点,您可以使用移动它

movdqa xmm1, [address]
movdqa [address2], xmm1

or if not aligned

或者如果没有对齐

movdqu xmm1, [address]
movdqu [address2], xmm1

Well, movdqa/movdqu has a latency of 1 and a throughput of 1. So the instructions take twice as long to be executed and the waiting time after the instructions is twice as long as a normal mov.

好吧,movdqa/movdqu 的延迟为 1,吞吐量为 1。因此,指令的执行时间是普通 mov 的两倍,指令执行后的等待时间是正常的两倍。

And something else we have not even taken into account is the fact that the CPU actually splits instructions into micro-ops and it can execute these in parallel. Now it starts getting really complicated... even too complicated for me.

我们甚至没有考虑到的另一件事是 CPU 实际上将指令拆分为微操作,并且它可以并行执行这些。现在它开始变得非常复杂……对我来说甚至太复杂了。

Anyway, I know from experience loading data to/from xmm registers is much slower than loading data to/from normal registers, so your idea to speed up transfer by using xmm registers was doomed from the very first second. I'm actually surprised that in the end the SSE memmove is not much slower than the normal one.

无论如何,我从经验中知道从 xmm 寄存器加载数据/从 xmm 寄存器加载数据比从普通寄存器加载数据慢得多,因此您使用 xmm 寄存器加速传输的想法从一开始就注定了。我实际上很惊讶,最终 SSE memmove 并不比正常的 memmove 慢多少。

回答by timday

I finally got to the bottom of this (and Die in Sente's answer was on the right lines, thanks)

我终于明白了这一点(Sente 的答案是正确的,谢谢)

In the below, dst and src are 512 MByte std::vector. I'm using the Intel 10.1.029 compiler and CRT.

在下面,dst 和 src 是 512 MByte std::vector。我使用的是 Intel 10.1.029 编译器和 CRT。

On 64bit both

在 64 位上

memcpy(&dst[0],&src[0],dst.size())

memcpy(&dst[0],&src[0],dst.size())

and

memcpy(&dst[0],&src[0],N)

memcpy(&dst[0],&src[0],N)

where N is previously declared const size_t N=512*(1<<20);call

其中 N 是先前声明的const size_t N=512*(1<<20);调用

__intel_fast_memcpy

__intel_fast_memcpy

the bulk of which consists of:

其中大部分包括:

  000000014004ED80  lea         rcx,[rcx+40h] 
  000000014004ED84  lea         rdx,[rdx+40h] 
  000000014004ED88  lea         r8,[r8-40h] 
  000000014004ED8C  prefetchnta [rdx+180h] 
  000000014004ED93  movdqu      xmm0,xmmword ptr [rdx-40h] 
  000000014004ED98  movdqu      xmm1,xmmword ptr [rdx-30h] 
  000000014004ED9D  cmp         r8,40h 
  000000014004EDA1  movntdq     xmmword ptr [rcx-40h],xmm0 
  000000014004EDA6  movntdq     xmmword ptr [rcx-30h],xmm1 
  000000014004EDAB  movdqu      xmm2,xmmword ptr [rdx-20h] 
  000000014004EDB0  movdqu      xmm3,xmmword ptr [rdx-10h] 
  000000014004EDB5  movntdq     xmmword ptr [rcx-20h],xmm2 
  000000014004EDBA  movntdq     xmmword ptr [rcx-10h],xmm3 
  000000014004EDBF  jge         000000014004ED80 

and runs at ~2200 MByte/s.

并以 ~2200 MByte/s 的速度运行。

But on 32bit

但是在 32 位

memcpy(&dst[0],&src[0],dst.size())

memcpy(&dst[0],&src[0],dst.size())

calls

电话

__intel_fast_memcpy

__intel_fast_memcpy

the bulk of which consists of

其中大部分由

  004447A0  sub         ecx,80h 
  004447A6  movdqa      xmm0,xmmword ptr [esi] 
  004447AA  movdqa      xmm1,xmmword ptr [esi+10h] 
  004447AF  movdqa      xmmword ptr [edx],xmm0 
  004447B3  movdqa      xmmword ptr [edx+10h],xmm1 
  004447B8  movdqa      xmm2,xmmword ptr [esi+20h] 
  004447BD  movdqa      xmm3,xmmword ptr [esi+30h] 
  004447C2  movdqa      xmmword ptr [edx+20h],xmm2 
  004447C7  movdqa      xmmword ptr [edx+30h],xmm3 
  004447CC  movdqa      xmm4,xmmword ptr [esi+40h] 
  004447D1  movdqa      xmm5,xmmword ptr [esi+50h] 
  004447D6  movdqa      xmmword ptr [edx+40h],xmm4 
  004447DB  movdqa      xmmword ptr [edx+50h],xmm5 
  004447E0  movdqa      xmm6,xmmword ptr [esi+60h] 
  004447E5  movdqa      xmm7,xmmword ptr [esi+70h] 
  004447EA  add         esi,80h 
  004447F0  movdqa      xmmword ptr [edx+60h],xmm6 
  004447F5  movdqa      xmmword ptr [edx+70h],xmm7 
  004447FA  add         edx,80h 
  00444800  cmp         ecx,80h 
  00444806  jge         004447A0

and runs at ~1350 MByte/s only.

并且仅以 ~1350 MByte/s 的速度运行。

HOWEVER

然而

memcpy(&dst[0],&src[0],N)

where N is previously declared const size_t N=512*(1<<20);compiles (on 32bit) to a direct call to a

其中 N 先前声明const size_t N=512*(1<<20);编译(在 32 位上)为直接调用

__intel_VEC_memcpy

the bulk of which consists of

其中大部分由

  0043FF40  movdqa      xmm0,xmmword ptr [esi] 
  0043FF44  movdqa      xmm1,xmmword ptr [esi+10h] 
  0043FF49  movdqa      xmm2,xmmword ptr [esi+20h] 
  0043FF4E  movdqa      xmm3,xmmword ptr [esi+30h] 
  0043FF53  movntdq     xmmword ptr [edi],xmm0 
  0043FF57  movntdq     xmmword ptr [edi+10h],xmm1 
  0043FF5C  movntdq     xmmword ptr [edi+20h],xmm2 
  0043FF61  movntdq     xmmword ptr [edi+30h],xmm3 
  0043FF66  movdqa      xmm4,xmmword ptr [esi+40h] 
  0043FF6B  movdqa      xmm5,xmmword ptr [esi+50h] 
  0043FF70  movdqa      xmm6,xmmword ptr [esi+60h] 
  0043FF75  movdqa      xmm7,xmmword ptr [esi+70h] 
  0043FF7A  movntdq     xmmword ptr [edi+40h],xmm4 
  0043FF7F  movntdq     xmmword ptr [edi+50h],xmm5 
  0043FF84  movntdq     xmmword ptr [edi+60h],xmm6 
  0043FF89  movntdq     xmmword ptr [edi+70h],xmm7 
  0043FF8E  lea         esi,[esi+80h] 
  0043FF94  lea         edi,[edi+80h] 
  0043FF9A  dec         ecx  
  0043FF9B  jne         ___intel_VEC_memcpy+244h (43FF40h) 

and runs at ~2100MByte/s (and proving 32bit isn't somehow bandwidth limited).

并以 ~2100MByte/s 运行(并证明 32 位不受带宽限制)。

I withdraw my claim that my own memcpy-like SSE code suffers from a similar ~1300 MByte/limit in 32bit builds; I now don't have any problems getting >2GByte/s on 32 or 64bit; the trick (as the above results hint) is to use non-temporal ("streaming") stores (e.g _mm_stream_psintrinsic).

我撤回了我自己的类似 memcpy 的 SSE 代码在 32 位构建中遭受类似 ~1300 MByte/限制的说法;我现在在 32 位或 64 位上获得 >2GByte/s 没有任何问题;诀窍(如上面的结果提示)是使用非时间(“流”)存储(例如_mm_stream_ps内在)。

It seems a bit strange that the 32bit "dst.size()" memcpy doesn't eventually call the faster "movnt" version (if you step into memcpy there is the most incredible amount of CPUIDchecking and heuristic logic e.g comparing number of bytes to be copied with cache size etc before it goes anywhere near your actual data) but at least I understand the observed behaviour now (and it's not SysWow64 or H/W related).

看起来有点奇怪 32 位“ dst.size()” memcpy 最终没有调用更快的“ movnt”版本(如果你进入 memcpy,会有最难以置信的CPUID检查和启发式逻辑,例如比较要复制的字节数与缓存大小等在它接近您的实际数据之前)但至少我现在了解观察到的行为(并且它与 SysWow64 或 H/W 无关)。

回答by Die in Sente

Thanks for the positive feedback! I think I can partlyexplain what's going here.

感谢您的积极反馈!我想我可以部分解释这里发生了什么。

Using the non-temporal stores for memcpy is definitely the fasted ifyou're only timing the memcpy call.

如果您只是为 memcpy 调用计时,那么使用memcpy 的非临时存储绝对是禁食。

On the other hand, if you're benchmarking an application, the movdqa stores have the benefit that they leave the destination memory in cache. Or at least the part of it that fits into cache.

另一方面,如果您对应用程序进行基准测试,movdqa 存储的好处是它们将目标内存留在缓存中。或者至少是适合缓存的部分。

So if you're designing a runtime library and if you can assume that the application that called memcpy is going to use the destination buffer immediately after the memcpy call, then you'll want to provide the movdqa version. This effectively optimizes out the trip from memory back into the cpu that would follow the movntdq version, and all of the instructions following the call will run faster.

因此,如果您正在设计运行时库,并且可以假设调用 memcpy 的应用程序将在调用 memcpy 后立即使用目标缓冲区,那么您将需要提供 movdqa 版本。这有效地优化了从内存返回到遵循 movntdq 版本的 cpu 的行程,并且调用之后的所有指令都将运行得更快。

But on the other hand, if the destination buffer is large compared to the processor's cache, that optimization doesn't work and the movntdq version would give you faster application benchmarks.

但另一方面,如果目标缓冲区与处理器的缓存相比较大,则该优化不起作用,movntdq 版本将为您提供更快的应用程序基准测试。

So the idea memcpy would have multiple versions under the hood. When the destination buffer is small compared to the processor's cache, use movdqa, otherwise, then the destination buffer is large compared to the processor's cache, use movntdq. It sounds like this is what's happening in the 32-bit library.

所以这个想法 memcpy 会有多个版本。当目标缓冲区与处理器的缓存相比较小时,使用 movdqa,否则,目标缓冲区与处理器的缓存相比较大,则使用 movntdq。听起来这就是 32 位库中发生的事情。

Of course, none of this has anything to do with the differences between 32-bit and 64-bit.

当然,这与 32 位和 64 位之间的差异没有任何关系。

My conjecture is that the 64-bit library just isn't as mature. The developers just haven't gotten around to providing both routines in that version of library yet.

我的猜测是 64 位库还没有那么成熟。开发人员还没有开始在该版本的库中提供这两个例程。

回答by Harper Shelby

My off-the-cuff guess is that the 64 bit processes are using the processor's native 64-bit memory size, which optimizes the use of the memory bus.

我的即兴猜测是 64 位进程正在使用处理器的本机 64 位内存大小,这优化了内存总线的使用。

回答by Brian Knoblauch

I don't have a reference in front of me, so I'm not absolutely positive on the timings/instructions, but I can still give the theory. If you're doing a memory move under 32-bit mode, you'll do something like a "rep movsd" which moves a single 32-bit value every clock cycle. Under 64-bit mode, you can do a "rep movsq" which does a single 64-bit move every clock cycle. That instruction is not available to 32-bit code, so you'd be doing 2 x rep movsd (at 1 cycle a piece) for half the execution speed.

我面前没有参考资料,所以我对时间/说明不是绝对肯定的,但我仍然可以给出理论。如果您在 32 位模式下进行内存移动,您将执行类似于“rep movsd”的操作,它在每个时钟周期移动一个 32 位值。在 64 位模式下,您可以执行“rep movsq”,它在每个时钟周期执行一次 64 位移动。该指令不适用于 32 位代码,因此您将执行 2 次 rep movsd(每条 1 个周期),执行速度减半。

VERY much simplified, ignoring all the memory bandwidth/alignment issues, etc, but this is where it all begins...

非常简化,忽略所有内存带宽/对齐问题等,但这就是一切的开始......

回答by GodLikeMOuse

Here's an example of a memcpy routine geared specifically for 64 bit architecture.

这是一个专门针对 64 位架构的 memcpy 例程的示例。

void uint8copy(void *dest, void *src, size_t n){
    uint64_t * ss = (uint64_t)src;
    uint64_t * dd = (uint64_t)dest;
    n = n * sizeof(uint8_t)/sizeof(uint64_t); 

    while(n--)
        *dd++ = *ss++;
}//end uint8copy()

The full article is here: http://www.godlikemouse.com/2008/03/04/optimizing-memcpy-routines/

全文在这里:http: //www.godlikemouse.com/2008/03/04/optimizing-memcpy-routines/