C++ 为什么 memcpy() 和 memmove() 比指针增量快？

Question

提问by wanderer

I am copying N bytes from pSrcto pDest. This can be done in a single loop:

我正在从pSrcto复制 N 个字节pDest。这可以在一个循环中完成：

for (int i = 0; i < N; i++)
    *pDest++ = *pSrc++

Why is this slower than memcpyor memmove? What tricks do they use to speed it up?

为什么这比memcpy或慢memmove？他们使用什么技巧来加快速度？

Answer 1

回答by onemasse

Because memcpy uses word pointers instead of byte pointers, also the memcpy implementations are often written with SIMDinstructions which makes it possible to shuffle 128 bits at a time.

因为 memcpy 使用字指针而不是字节指针，所以 memcpy 的实现也经常用SIMD指令编写，这使得一次 128 位混洗成为可能。

SIMD instructions are assembly instructions that can perform the same operation on each element in a vector up to 16 bytes long. That includes load and store instructions.

SIMD 指令是汇编指令，可以对最多 16 字节长的向量中的每个元素执行相同的操作。这包括加载和存储指令。

Answer 2

回答by Daemin

Memory copy routines can be far more complicated and faster than a simple memory copy via pointers such as:

内存复制例程可能比通过指针进行的简单内存复制要复杂得多，速度也更快，例如：

void simple_memory_copy(void* dst, void* src, unsigned int bytes)
{
  unsigned char* b_dst = (unsigned char*)dst;
  unsigned char* b_src = (unsigned char*)src;
  for (int i = 0; i < bytes; ++i)
    *b_dst++ = *b_src++;
}

Improvements

改进

The first improvement one can make is to align one of the pointers on a word boundary (by word I mean native integer size, usually 32 bits/4 bytes, but can be 64 bits/8 bytes on newer architectures) and use word sized move/copy instructions. This requires using a byte to byte copy until a pointer is aligned.

可以做的第一个改进是在字边界上对齐一个指针（我指的是本机整数大小，通常是 32 位/4 字节，但在较新的体系结构上可以是 64 位/8 字节）并使用字大小的移动/复制指令。这需要使用字节到字节的复制，直到指针对齐。

void aligned_memory_copy(void* dst, void* src, unsigned int bytes)
{
  unsigned char* b_dst = (unsigned char*)dst;
  unsigned char* b_src = (unsigned char*)src;

  // Copy bytes to align source pointer
  while ((b_src & 0x3) != 0)
  {
    *b_dst++ = *b_src++;
    bytes--;
  }

  unsigned int* w_dst = (unsigned int*)b_dst;
  unsigned int* w_src = (unsigned int*)b_src;
  while (bytes >= 4)
  {
    *w_dst++ = *w_src++;
    bytes -= 4;
  }

  // Copy trailing bytes
  if (bytes > 0)
  {
    b_dst = (unsigned char*)w_dst;
    b_src = (unsigned char*)w_src;
    while (bytes > 0)
    {
      *b_dst++ = *b_src++;
      bytes--;
    }
  }
}

Different architectures will perform differently based on if the source or the destination pointer is appropriately aligned. For instance on an XScale processor I got better performance by aligning the destination pointer rather than the source pointer.

根据源或目标指针是否适当对齐，不同的体系结构将执行不同的操作。例如，在 XScale 处理器上，我通过对齐目标指针而不是源指针获得了更好的性能。

To further improve performance some loop unrolling can be done, so that more of the processor's registers are loaded with data and that means the load/store instructions can be interleaved and have their latency hidden by additional instructions (such as loop counting etc). The benefit this brings varies quite a bit by the processor, since load/store instruction latencies can be quite different.

为了进一步提高性能，可以进行一些循环展开，以便更多的处理器寄存器加载数据，这意味着加载/存储指令可以交错，并通过附加指令（例如循环计数等）隐藏它们的延迟。这带来的好处因处理器而异，因为加载/存储指令延迟可能大不相同。

At this stage the code ends up being written in Assembly rather than C (or C++) since you need to manually place the load and store instructions to get maximum benefit of latency hiding and throughput.

在这个阶段，代码最终是用汇编而不是 C（或 C++）编写的，因为您需要手动放置加载和存储指令，以获得延迟隐藏和吞吐量的最大好处。

Generally a whole cache line of data should be copied in one iteration of the unrolled loop.

通常，应该在展开循环的一次迭代中复制整个缓存数据行。

Which brings me to the next improvement, adding pre-fetching. These are special instructions that tell the processor's cache system to load specific parts of memory into its cache. Since there is a delay between issuing the instruction and having the cache line filled, the instructions need to be placed in such a way so that the data is available when just as it is to be copied, and no sooner/later.

这让我想到了下一个改进，即添加预取。这些是特殊指令，告诉处理器的缓存系统将内存的特定部分加载到其缓存中。由于发出指令和填充缓存线之间存在延迟，因此需要以这样的方式放置指令，以便数据在复制时可用，而不是迟早/晚。

This means putting prefetch instructions at the start of the function as well as inside the main copy loop. With the prefetch instructions in the middle of the copy loop fetching data that will be copied in several iterations time.

这意味着将预取指令放在函数的开头以及主复制循环内。复制循环中间的预取指令获取将在多次迭代中复制的数据。

I can't remember, but it may also be beneficial to prefetch the destination addresses as well as the source ones.

我不记得了，但预取目标地址和源地址也可能是有益的。

Factors

因素

The main factors that affect how fast memory can be copied are:

影响内存复制速度的主要因素是：

The latency between the processor, its caches, and main memory.
The size and structure of the processor's cache lines.
The processor's memory move/copy instructions (latency, throughput, register size, etc).

处理器、其缓存和主内存之间的延迟。
处理器缓存行的大小和结构。
处理器的内存移动/复制指令（延迟、吞吐量、寄存器大小等）。

So if you want to write an efficient and fast memory cope routine you'll need to know quite a lot about the processor and architecture you are writing for. Suffice to say, unless you're writing on some embedded platform it would be much easier to just use the built in memory copy routines.

因此，如果您想编写一个高效且快速的内存处理例程，您需要对所编写的处理器和架构有很多了解。可以这么说，除非您在某个嵌入式平台上进行编写，否则仅使用内置的内存复制例程会容易得多。

Answer 3

回答by Mark Byers

memcpycan copy more than one byte at once depending on the computer's architecture. Most modern computers can work with 32 bits or more in a single processor instruction.

memcpy根据计算机的体系结构，可以一次复制多个字节。大多数现代计算机可以在单个处理器指令中使用 32 位或更多位。

From one example implementation:

从一个示例实现：

    00026          * For speedy copying, optimize the common case where both pointers
    00027          * and the length are word-aligned, and copy word-at-a-time instead
    00028          * of byte-at-a-time. Otherwise, copy by bytes.

Answer 4

回答by Danny Dulai

You can implement memcpy()using any of the following techniques, some dependent on your architecture for performance gains, and they will all be much faster than your code:

您可以memcpy()使用以下任何一种技术来实现，其中一些技术依赖于您的架构以获得性能提升，并且它们都将比您的代码快得多：

Use larger units, such as 32-bit words instead of bytes. You can also (or may have to) deal with alignment here as well. You can't go reading/writing a 32-bit word to a odd memory location for example on some platforms, and on other platforms you pay a massive performance penalty. To fix this, the address has to be a unit divisible by 4. You can take this up to 64-bits for 64bit CPUs, or even higher using SIMD(Single instruction, multiple data) instructions (MMX, SSE, etc.)
You can use special CPU instructions that your compiler may not be able to optimize from C. For example, on a 80386, you can use the "rep" prefix instruction + "movsb" instruction to move N bytes dictated by placing N in the count register. Good compilers will just do this for you, but you may be on a platform that lacks a good compiler. Note, that example tends to be a bad demonstration of speed, but combined with alignment + larger unit instructions, it can be faster than mostly everything else on certain CPUs.
Loop unrolling-- branches can be quite expensive on some CPUs, so unrolling the loops can lower the number of branches. This is also a good technique for combining with SIMD instructions and very large sized units.

使用更大的单位，例如 32 位字而不是字节。您也可以（或可能必须）在这里处理对齐问题。例如，在某些平台上，您不能将 32 位字读/写到奇数内存位置，而在其他平台上，您将付出巨大的性能损失。要解决这个问题，地址必须是一个可被 4 整除的单位。对于 64 位 CPU，您可以将其设置为 64 位，或者使用SIMD（单指令、多数据）指令（MMX、SSE等）甚至更高。
您可以使用编译器可能无法从 C 优化的特殊 CPU 指令。例如，在 80386 上，您可以使用“rep”前缀指令 +“movsb”指令移动 N 个字节，由将 N 放在计数中决定登记。好的编译器会为你做这件事，但你可能在一个缺乏好的编译器的平台上。请注意，该示例往往不能很好地展示速度，但结合对齐 + 更大的单元指令，它可能比某些 CPU 上的大多数其他东西都快。
循环展开——分支在某些 CPU 上可能非常昂贵，因此展开循环可以减少分支的数量。这也是结合 SIMD 指令和超大尺寸单元的好技术。

For example, http://www.agner.org/optimize/#asmlibhas a memcpyimplementation that beats most out there (by a very tiny amount). If you read the source code, it will be full of tons of inlined assembly code that pulls off all of the above three techniques, choosing which of those techniques based on what CPU you are running on.

例如，httpmemcpy://www.agner.org/optimize/#asmlib有一个实现效果最好的（非常小）。如果您阅读源代码，它将充满大量的内联汇编代码，这些代码实现了上述所有三种技术，根据您运行的 CPU 选择这些技术中的哪一种。

Note, there are similar optimizations that can be made for finding bytes in a buffer too. strchr()and friends will often by faster than your hand rolled equivalent. This is especially true for .NETand Java. For example, in .NET, the built-in String.IndexOf()is much faster than even a Boyer–Moore string search, because it uses the above optimization techniques.

请注意，对于在缓冲区中查找字节也可以进行类似的优化。strchr()和朋友通常会比你的手卷等值更快。对于.NET和Java尤其如此。例如，在 .NET 中，内置String.IndexOf()函数甚至比Boyer-Moore 字符串搜索快得多，因为它使用了上述优化技术。

Answer 5

回答by moshbear

Short answer:

简短的回答：

cache fill
wordsize transfers instead of byte ones where possible
SIMD magic

缓存填充
在可能的情况下使用字大小传输而不是字节传输
SIMD魔法

Answer 6

回答by NPE

I don't know whether it is actually used in any real-world implementations of memcpy, but I think Duff's Devicedeserves a mention here.

我不知道它是否真的用于的任何实际实现中memcpy，但我认为Duff 的设备在这里值得一提。

From Wikipedia:

来自维基百科：

send(to, from, count)
register short *to, *from;
register count;
{
        register n = (count + 7) / 8;
        switch(count % 8) {
        case 0:      do {     *to = *from++;
        case 7:              *to = *from++;
        case 6:              *to = *from++;
        case 5:              *to = *from++;
        case 4:              *to = *from++;
        case 3:              *to = *from++;
        case 2:              *to = *from++;
        case 1:              *to = *from++;
                } while(--n > 0);
        }
}

Note that the above isn't a memcpysince it deliberately doesn't increment the topointer. It implements a slightly different operation: the writing into a memory-mapped register. See the Wikipedia article for details.

请注意，上面的不是 a，memcpy因为它故意不增加to指针。它实现了一个稍微不同的操作：写入内存映射寄存器。有关详细信息，请参阅维基百科文章。

Answer 7

回答by VoidStar

Like others say memcpy copies larger than 1-byte chunks. Copying in word sized chunks is much faster. However, most implementations take it a step further and run several MOV (word) instructions before looping. The advantage to copying in say, 8 word blocks per loop is that the loop itself is costly. This technique reduces the number of conditional branches by a factor of 8, optimizing the copy for giant blocks.

就像其他人所说的 memcpy 复制大于 1 字节的块。以字大小的块复制要快得多。然而，大多数实现更进一步，在循环之前运行几个 MOV（字）指令。比方说，每个循环复制 8 个字块的优点是循环本身的成本很高。这种技术将条件分支的数量减少了 8 倍，优化了巨型块的副本。

Answer 8

回答by masoud

The answers are great, but if you still want implement a fast memcpyyourself, there is an interesting blog post about fast memcpy, Fast memcpy in C.

答案很好，但如果你仍然想memcpy自己实现一个 fast ，有一篇关于 fast memcpy, Fast memcpy in C的有趣博客文章。

void *memcpy(void* dest, const void* src, size_t count)
{
    char* dst8 = (char*)dest;
    char* src8 = (char*)src;

    if (count & 1) {
        dst8[0] = src8[0];
        dst8 += 1;
        src8 += 1;
    }

    count /= 2;
    while (count--) {
        dst8[0] = src8[0];
        dst8[1] = src8[1];

        dst8 += 2;
        src8 += 2;
    }
    return dest;
}

Even, it can be better with optimizing memory accesses.

甚至，优化内存访问会更好。

Answer 9

回答by BillThor

Because like many library routines it has been optimized for the architecture you are running on. Others have posted various techniques which can be used.

因为像许多库例程一样，它已经针对您正在运行的架构进行了优化。其他人发布了可以使用的各种技术。

Given the choice, use library routines rather than roll your own. This is a variation on DRY that I call DRO (Don't Repeat Others). Also, library routines are less likely be wrong than your own implementation.

给定选择，使用库例程而不是自己动手。这是我称之为 DRO（不要重复其他人）的 DRY 的变体。此外，库例程比您自己的实现更不可能出错。

I have seen memory access checkers complain about out of bounds reads on memory or string buffers which were not a multiple of the word size. This is a result of the optimization being used.

我已经看到内存访问检查器抱怨内存或字符串缓冲区的越界读取，这些缓冲区不是字大小的倍数。这是使用优化的结果。

Answer 10

回答by gnasher729

You can look at the MacOS implementation of memset, memcpy and memmove.

你可以看看memset、memcpy和memmove的MacOS实现。

At boot time, the OS determines which processor it's running on. It has built in specifically optimised code for each supported processor, and at boot time stores a jmp instruction to the right code in a fixed read/only location.

在启动时，操作系统确定它在哪个处理器上运行。它为每个支持的处理器内置了专门优化的代码，并在启动时将 jmp 指令存储在固定的只读/只读位置中的正确代码中。

The C memset, memcpy and memmove implementations are just a jump to that fixed location.

C memset、memcpy 和 memmove 实现只是跳转到该固定位置。

The implementations use different code depending on alignment of source and destination for memcpy and memmove. They obviously use all available vector capabilities. They also use non-caching variants when you copy large amounts of data, and have instructions to minimise waits for page tables. It's not just assembler code, it's assembler code written by someone with extremely good knowledge of each processor architecture.

根据 memcpy 和 memmove 的源和目标的对齐方式，实现使用不同的代码。他们显然使用了所有可用的矢量功能。当您复制大量数据时，它们还使用非缓存变体，并具有最小化页表等待的说明。这不仅仅是汇编代码，它是由对每种处理器架构非常了解的人编写的汇编代码。

Intel also added assembler instructions that can make string operations faster. For example with an instruction to support strstr which does 256 byte compares in one cycle.

英特尔还添加了可以加快字符串操作速度的汇编指令。例如，支持 strstr 的指令在一个周期内进行 256 字节的比较。

C++ 为什么 memcpy() 和 memmove() 比指针增量快？

提问by wanderer

回答by onemasse

回答by Daemin

回答by Mark Byers

回答by Danny Dulai

回答by moshbear

回答by NPE

回答by VoidStar

回答by masoud

回答by BillThor

回答by gnasher729

相关推荐

最近更新

标签

C++ 为什么 memcpy() 和 memmove() 比指针增量快？

提问by wanderer

回答by onemasse

回答by Daemin

回答by Mark Byers

回答by Danny Dulai

回答by moshbear

回答by NPE

回答by VoidStar

回答by masoud

回答by BillThor

回答by gnasher729

相关推荐

C++ 在头文件或 cpp 中包含 std 库？

C++ 使用预处理器进行字符串连接

我在哪里可以找到 C++11 标准论文？

C++ std::stringstream 的小数点？

相关推荐

最近更新

标签