C语言更快的 memcpy 替代品？

Question

提问by Tony Stark

I have a function that is doing memcpy, but it's taking up an enormous amount of cycles. Is there a faster alternative/approach than using memcpy to move a piece of memory?

我有一个正在执行 memcpy 的函数，但它占用了大量的周期。有没有比使用 memcpy 移动一块内存更快的替代方法/方法？

Answer 1

回答by nos

memcpyis likely to be the fastest way you can copy bytes around in memory. If you need something faster - try figuring out a way of notcopying things around, e.g. swap pointers only, not the data itself.

memcpy可能是在内存中复制字节的最快方式。如果您需要更快的速度 - 尝试找出一种不复制内容的方法，例如仅交换指针，而不是数据本身。

Answer 2

回答by Serge Rogatch

This is an answer for x86_64 with AVX2 instruction set present. Though something similar may apply for ARM/AArch64 with SIMD.

这是带有 AVX2 指令集的 x86_64 的答案。虽然类似的东西可能适用于带有 SIMD 的 ARM/AArch64。

On Ryzen 1800X with single memory channel filled completely (2 slots, 16 GB DDR4 in each), the following code is 1.56 times faster than memcpy()on MSVC++2017 compiler. If you fill both memory channels with 2 DDR4 modules, i.e. you have all 4 DDR4 slots busy, you may get further 2 times faster memory copying. For triple-(quad-)channel memory systems, you can get further 1.5(2.0) times faster memory copying if the code is extended to analogous AVX512 code. With AVX2-only triple/quad channel systems with all slots busy are not expected to be faster because to load them fully you need to load/store more than 32 bytes at once (48 bytes for triple- and 64-bytes for quad-channel systems), while AVX2 can load/store no more than 32 bytes at once. Though multithreading on some systems can alleviate this without AVX512 or even AVX2.

在单个内存通道完全填满的 Ryzen 1800X 上（2 个插槽，每个插槽 16 GB DDR4），以下代码比memcpy()MSVC++2017 编译器快 1.56 倍。如果用 2 个 DDR4 模块填充两个内存通道，即所有 4 个 DDR4 插槽都忙，则内存复制速度可能会进一步提高 2 倍。对于三（四）通道内存系统，如果将代码扩展为类似的 AVX512 代码，您可以获得进一步快 1.5(2.0) 倍的内存复制。对于只有 AVX2 且所有插槽都忙的三通道/四通道系统，预计不会更快，因为要完全加载它们，您需要一次加载/存储超过 32 个字节（三通道为 48 字节，四通道为 64 字节）系统），而 AVX2 一次可以加载/存储不超过 32 个字节。尽管某些系统上的多线程可以在没有 AVX512 甚至 AVX2 的情况下缓解这种情况。

So here is the copy code that assumes you are copying a large block of memory whose size is a multiple of 32 and the block is 32-byte aligned.

所以这里的复制代码假设您正在复制一个大内存块，其大小是 32 的倍数，并且该块是 32 字节对齐的。

For non-multiple size and non-aligned blocks, prologue/epilogue code can be written reducing the width to 16 (SSE4.1), 8, 4, 2 and finally 1 byte at once for the block head and tail. Also in the middle a local array of 2-3 __m256ivalues can be used as a proxy between aligned reads from the source and aligned writes to the destination.

对于非多重大小和非对齐的块，可以编写序言/结尾代码，将块头和尾的宽度一次减少到 16 (SSE4.1)、8、4、2 和最后 1 个字节。同样在中间，一个包含 2-3 个__m256i值的本地数组可以用作从源对齐读取和对齐写入到目标之间的代理。

#include <immintrin.h>
#include <cstdint>
/* ... */
void fastMemcpy(void *pvDest, void *pvSrc, size_t nBytes) {
  assert(nBytes % 32 == 0);
  assert((intptr_t(pvDest) & 31) == 0);
  assert((intptr_t(pvSrc) & 31) == 0);
  const __m256i *pSrc = reinterpret_cast<const __m256i*>(pvSrc);
  __m256i *pDest = reinterpret_cast<__m256i*>(pvDest);
  int64_t nVects = nBytes / sizeof(*pSrc);
  for (; nVects > 0; nVects--, pSrc++, pDest++) {
    const __m256i loaded = _mm256_stream_load_si256(pSrc);
    _mm256_stream_si256(pDest, loaded);
  }
  _mm_sfence();
}

A key feature of this code is that it skips CPU cache when copying: when CPU cache is involved (i.e. AVX instructions without _stream_are used), the copy speed drops several times on my system.

这段代码的一个关键特性是它在复制时跳过 CPU 缓存：当涉及 CPU 缓存时（即使用了 AVX 指令_stream_），我的系统上的复制速度会下降几次。

My DDR4 memory is 2.6GHz CL13 . So when copying 8GB of data from one array to another I got the following speeds:

我的 DDR4 内存是 2.6GHz CL13 。因此，当将 8GB 数据从一个阵列复制到另一个阵列时，我获得了以下速度：

memcpy(): 17 208 004 271 bytes/sec.
Stream copy: 26 842 874 528 bytes/sec.

Note that in these measurements the total size of both input and output buffers is divided by the number of seconds elapsed. Because for each byte of the array there are 2 memory accesses: one to read the byte from the input array, another to write the byte to the output array. In the other words, when copying 8GB from one array to another, you do 16GB worth of memory access operations.

请注意，在这些测量中，输入和输出缓冲区的总大小除以经过的秒数。因为对于数组的每个字节，有 2 次内存访问：一次从输入数组中读取字节，另一次将字节写入输出数组。换句话说，当将 8GB 从一个阵列复制到另一个阵列时，您需要执行 16GB 的内存访问操作。

Moderate multithreading can further improve performance about 1.44 times, so total increase over memcpy()reaches 2.55 times on my machine. Here's how stream copy performance depends on the number of threads used on my machine:

适度的多线程可以进一步提高性能约 1.44 倍，因此memcpy()在我的机器上总共增加了2.55 倍。以下是流复制性能如何取决于我的机器上使用的线程数：

Stream copy 1 threads: 27114820909.821 bytes/sec
Stream copy 2 threads: 37093291383.193 bytes/sec
Stream copy 3 threads: 39133652655.437 bytes/sec
Stream copy 4 threads: 39087442742.603 bytes/sec
Stream copy 5 threads: 39184708231.360 bytes/sec
Stream copy 6 threads: 38294071248.022 bytes/sec
Stream copy 7 threads: 38015877356.925 bytes/sec
Stream copy 8 threads: 38049387471.070 bytes/sec
Stream copy 9 threads: 38044753158.979 bytes/sec
Stream copy 10 threads: 37261031309.915 bytes/sec
Stream copy 11 threads: 35868511432.914 bytes/sec
Stream copy 12 threads: 36124795895.452 bytes/sec
Stream copy 13 threads: 36321153287.851 bytes/sec
Stream copy 14 threads: 36211294266.431 bytes/sec
Stream copy 15 threads: 35032645421.251 bytes/sec
Stream copy 16 threads: 33590712593.876 bytes/sec

The code is:

代码是：

void AsyncStreamCopy(__m256i *pDest, const __m256i *pSrc, int64_t nVects) {
  for (; nVects > 0; nVects--, pSrc++, pDest++) {
    const __m256i loaded = _mm256_stream_load_si256(pSrc);
    _mm256_stream_si256(pDest, loaded);
  }
}

void BenchmarkMultithreadStreamCopy(double *gpdOutput, const double *gpdInput, const int64_t cnDoubles) {
  assert((cnDoubles * sizeof(double)) % sizeof(__m256i) == 0);
  const uint32_t maxThreads = std::thread::hardware_concurrency();
  std::vector<std::thread> thrs;
  thrs.reserve(maxThreads + 1);

  const __m256i *pSrc = reinterpret_cast<const __m256i*>(gpdInput);
  __m256i *pDest = reinterpret_cast<__m256i*>(gpdOutput);
  const int64_t nVects = cnDoubles * sizeof(*gpdInput) / sizeof(*pSrc);

  for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
    auto start = std::chrono::high_resolution_clock::now();
    lldiv_t perWorker = div((long long)nVects, (long long)nThreads);
    int64_t nextStart = 0;
    for (uint32_t i = 0; i < nThreads; i++) {
      const int64_t curStart = nextStart;
      nextStart += perWorker.quot;
      if ((long long)i < perWorker.rem) {
        nextStart++;
      }
      thrs.emplace_back(AsyncStreamCopy, pDest + curStart, pSrc+curStart, nextStart-curStart);
    }
    for (uint32_t i = 0; i < nThreads; i++) {
      thrs[i].join();
    }
    _mm_sfence();
    auto elapsed = std::chrono::high_resolution_clock::now() - start;
    double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
    printf("Stream copy %d threads: %.3lf bytes/sec\n", (int)nThreads, cnDoubles * 2 * sizeof(double) / nSec);

    thrs.clear();
  }
}

Answer 3

回答by INS

Please offer us more details. On i386 architecture it is very possible that memcpy is the fastest way of copying. But on different architecture for which the compiler doesn't have an optimized version it is best that you rewrite your memcpy function. I did this on a custom ARM architecture using assembly language. If you transfer BIG chunks of memory then DMAis probably the answer you are looking for.

请向我们提供更多详细信息。在 i386 架构上，memcpy 很可能是最快的复制方式。但是在编译器没有优化版本的不同体系结构上，最好重写 memcpy 函数。我使用汇编语言在自定义 ARM 体系结构上执行此操作。如果您传输大块内存，那么DMA可能就是您正在寻找的答案。

Please offer more details - architecture, operating system (if relevant).

请提供更多详细信息 - 架构、操作系统（如果相关）。

Answer 4

回答by user2009004

Actually, memcpy is NOT the fastest way, especially if you call it many times. I also had some code that I really needed to speed up, and memcpy is slow because it has too many unnecessary checks. For example, it checks to see if the destination and source memory blocks overlap and if it should start copying from the back of the block rather than the front. If you do not care about such considerations, you can certainly do significantly better. I have some code, but here is perhaps an ever better version:

实际上，memcpy 并不是最快的方法，尤其是当您多次调用它时。我也有一些我确实需要加速的代码，而 memcpy 很慢，因为它有太多不必要的检查。例如，它会检查目标内存块和源内存块是否重叠，以及是否应该从块的后面而不是前面开始复制。如果你不关心这些考虑，你当然可以做得更好。我有一些代码，但这里可能是一个更好的版本：

Very fast memcpy for image processing?.

用于图像处理的非常快的 memcpy？.

If you search, you can find other implementations as well. But for true speed you need an assembly version.

如果您搜索，您也可以找到其他实现。但是为了真正的速度，您需要一个汇编版本。

Answer 5

回答by sharptooth

Usually the standard library shipped with the compiler will implement memcpy()the fastest way possible for the target platform already.

通常，编译器附带的标准库会memcpy()以最快的方式实现目标平台。

Answer 6

回答by High Performance Mark

It's generally faster not to make a copy at all. Whether you can adapt your function to not copy I don't know but it's worth looking in to.

通常根本不制作副本会更快。我不知道您是否可以调整您的功能以使其不复制，但值得研究。

Answer 7

回答by Patrick

Sometimes functions like memcpy, memset, ... are implemented in two different ways:

有时像 memcpy、memset、...这样的函数以两种不同的方式实现：

once as a real function
once as some assembly that's immediately inlined

一次作为一个真正的函数
曾经作为一些立即内联的程序集

Not all compilers take the inlined-assembly version by default, your compiler may use the function variant by default, causing some overhead because of the function call. Check your compiler to see how to take the intrinsic variant of the function (command line option, pragma's, ...).

并非所有编译器都默认采用内联汇编版本，您的编译器可能默认使用函数变体，由于函数调用而导致一些开销。检查您的编译器以了解如何获取函数的内在变体（命令行选项、编译指示等）。

Edit: See http://msdn.microsoft.com/en-us/library/tzkfha43%28VS.80%29.aspxfor an explanation of intrinsics on the Microsoft C compiler.

编辑：有关Microsoft C 编译器内在函数的说明，请参阅http://msdn.microsoft.com/en-us/library/tzkfha43%28VS.80%29.aspx。

Answer 8

回答by Yousf

Check you Compiler/Platform manual. For some micro-processors and DSP-kits using memcpy is much slower than intrinsic functionsor DMAoperations.

检查编译器/平台手册。对于某些微处理器和 DSP 套件，使用 memcpy 比内部函数或DMA操作慢得多。

Answer 9

回答by Andrew McGregor

If your platform supports it, look into if you can use the mmap() system call to leave your data in the file... generally the OS can manage that better. And, as everyone has been saying, avoid copying if at all possible; pointers are your friend in cases like this.

如果您的平台支持它，请查看您是否可以使用 mmap() 系统调用将数据保留在文件中……通常操作系统可以更好地管理它。而且，正如每个人所说的，尽可能避免复制；在这种情况下，指针是你的朋友。

Answer 10

回答by Dorin Laz?r

You should check the assembly code generated for your code. What you don't want is to have the memcpycall generate a call to the memcpyfunction in the standard library - what you want is to have a repeated call to the best ASM instruction to copy the largest amount of data - something like rep movsq.

您应该检查为您的代码生成的汇编代码。您不想要的是让memcpy调用生成memcpy对标准库中函数的调用- 您想要的是重复调用最佳 ASM 指令以复制最大量的数据 - 类似于rep movsq.

How can you achieve this? Well, the compiler optimizes calls to memcpyby replacing it with simple movs as long as it knows how much data it should copy. You can see this if you write a memcpywith a well determined (constexpr) value. If the compiler doesn't know the value, it will have to fall back to the byte-level implementation of memcpy- the issue being that memcpyhas to respect the one-byte granularity. It will still move 128 bits at a time, but after each 128b it will have to check if it has enough data to copy as 128b or it has to fall back to 64bits, then to 32 and 8 (I think that 16 might be suboptimal anyway, but I don't know for sure).

你怎么能做到这一点？好吧，编译器memcpy通过用简单的movs替换它来优化调用，只要它知道它应该复制多少数据。如果您memcpy使用确定的 ( constexpr) 值编写 a ，您就会看到这一点。如果编译器不知道该值，它将不得不回退到的字节级实现memcpy- 问题是memcpy必须尊重一字节粒度。它仍然会一次移动 128 位，但是在每个 128b 之后，它必须检查它是否有足够的数据可以复制为 128b 或者它必须回退到 64 位，然后是 32 和 8（我认为 16 可能不是最理想的）无论如何，但我不确定）。

So what you want is either be able to tell to memcpywhat's the size of your data with const expressions that the compiler can optimize. This way no call to memcpyis performed. What you don't want is to pass to memcpya variable that will only be known at run-time. That translates into a function call and tons of tests to check the best copy instruction. Sometimes, a simple for loop is better than memcpyfor this reason (eliminating one function call). And what you really really don't wantis pass to memcpyan odd number of bytes to copy.

因此，您想要的是能够memcpy使用编译器可以优化的 const 表达式告诉您数据的大小。这样就不会调用 tomemcpy了。你不想要的是传递给memcpy一个只能在运行时知道的变量。这转化为函数调用和大量测试以检查最佳复制指令。有时，一个简单的 for 循环比memcpy这个原因更好（消除一个函数调用）。而您真正不想要的是传递给memcpy要复制的奇数字节。

C语言更快的 memcpy 替代品？

提问by Tony Stark

回答by nos

回答by Serge Rogatch

回答by INS

回答by user2009004

回答by sharptooth

回答by High Performance Mark

回答by Patrick

回答by Yousf

回答by Andrew McGregor

回答by Dorin Laz?r

相关推荐

最近更新

标签

C语言 更快的 memcpy 替代品？

提问by Tony Stark

回答by nos

回答by Serge Rogatch

回答by INS

回答by user2009004

回答by sharptooth

回答by High Performance Mark

回答by Patrick

回答by Yousf

回答by Andrew McGregor

回答by Dorin Laz?r

相关推荐

C语言 如何在C中将字符串类型转换为整数并将其存储在整数数组中？

C语言 传递参数使指针从整数

C语言 如何在 GDB 中打印 #defined 常量？

C语言 指向C中字符串的指针？

相关推荐

最近更新

标签

C语言更快的 memcpy 替代品？

C语言如何在C中将字符串类型转换为整数并将其存储在整数数组中？

C语言传递参数使指针从整数

C语言如何在 GDB 中打印 #defined 常量？

C语言指向C中字符串的指针？