C++ 就性能而言,使用 std::memcpy() 或 std::copy() 更好吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4707012/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 16:13:25  来源:igfitidea点击:

Is it better to use std::memcpy() or std::copy() in terms to performance?

c++performanceoptimization

提问by user576670

Is it better to use memcpyas shown below or is it better to use std::copy()in terms to performance? Why?

它是更好地使用memcpy如下图所示或者是它更好地使用std::copy()在方面的表现?为什么?

char *bits = NULL;
...

bits = new (std::nothrow) char[((int *) copyMe->bits)[0]];
if (bits == NULL)
{
    cout << "ERROR Not enough memory.\n";
    exit(1);
}

memcpy (bits, copyMe->bits, ((int *) copyMe->bits)[0]);

回答by David Stone

I'm going to go against the general wisdom here that std::copywill have a slight, almost imperceptible performance loss. I just did a test and found that to be untrue: I did notice a performance difference. However, the winner was std::copy.

我将与这里的一般智慧背道而驰,这std::copy将导致轻微的、几乎察觉不到的性能损失。我刚刚做了一个测试,发现这是不正确的:我确实注意到了性能差异。然而,获胜者是std::copy

I wrote a C++ SHA-2 implementation. In my test, I hash 5 strings using all four SHA-2 versions (224, 256, 384, 512), and I loop 300 times. I measure times using Boost.timer. That 300 loop counter is enough to completely stabilize my results. I ran the test 5 times each, alternating between the memcpyversion and the std::copyversion. My code takes advantage of grabbing data in as large of chunks as possible (many other implementations operate with char/ char *, whereas I operate with T/ T *(where Tis the largest type in the user's implementation that has correct overflow behavior), so fast memory access on the largest types I can is central to the performance of my algorithm. These are my results:

我写了一个 C++ SHA-2 实现。在我的测试中,我使用所有四个 SHA-2 版本(224、256、384、512)对 5 个字符串进行哈希处理,并循环了 300 次。我使用 Boost.timer 测量时间。300 循环计数器足以完全稳定我的结果。我每次运行测试 5 次,在memcpy版本和std::copy版本之间交替。我的代码利用尽可能大的块获取数据(许多其他实现使用char/操作char *,而我使用T/ T *(其中T是用户实现中具有正确溢出行为的最大类型),因此在我可以使用的最大类型是我算法性能的核心。这些是我的结果:

Time (in seconds) to complete run of SHA-2 tests

完成 SHA-2 测试运行的时间(以秒为单位)

std::copy   memcpy  % increase
6.11        6.29    2.86%
6.09        6.28    3.03%
6.10        6.29    3.02%
6.08        6.27    3.03%
6.08        6.27    3.03%

Total average increase in speed of std::copy over memcpy: 2.99%

与 memcpy 相比,std::copy 的速度总平均提高:2.99%

My compiler is gcc 4.6.3 on Fedora 16 x86_64. My optimization flags are -Ofast -march=native -funsafe-loop-optimizations.

我的编译器是 Fedora 16 x86_64 上的 gcc 4.6.3。我的优化标志是-Ofast -march=native -funsafe-loop-optimizations.

Code for my SHA-2 implementations.

我的 SHA-2 实现代码。

I decided to run a test on my MD5 implementation as well. The results were much less stable, so I decided to do 10 runs. However, after my first few attempts, I got results that varied wildly from one run to the next, so I'm guessing there was some sort of OS activity going on. I decided to start over.

我决定也对我的 MD5 实现进行测试。结果不太稳定,所以我决定运行 10 次。然而,在我最初的几次尝试之后,我得到的结果从一次运行到下一次运行变化很大,所以我猜有某种操作系统活动正在进行。我决定重新开始。

Same compiler settings and flags. There is only one version of MD5, and it's faster than SHA-2, so I did 3000 loops on a similar set of 5 test strings.

相同的编译器设置和标志。只有一个版本的 MD5,而且它比 SHA-2 更快,所以我对一组类似的 5 个测试字符串进行了 3000 次循环。

These are my final 10 results:

这些是我最后的 10 个结果:

Time (in seconds) to complete run of MD5 tests

完成 MD5 测试运行的时间(以秒为单位)

std::copy   memcpy      % difference
5.52        5.56        +0.72%
5.56        5.55        -0.18%
5.57        5.53        -0.72%
5.57        5.52        -0.91%
5.56        5.57        +0.18%
5.56        5.57        +0.18%
5.56        5.53        -0.54%
5.53        5.57        +0.72%
5.59        5.57        -0.36%
5.57        5.56        -0.18%

Total average decrease in speed of std::copy over memcpy: 0.11%

std::copy 相对于 memcpy 的速度总平均下降:0.11%

Code for my MD5 implementation

我的 MD5 实现代码

These results suggest that there is some optimization that std::copy used in my SHA-2 tests that std::copycould not use in my MD5 tests. In the SHA-2 tests, both arrays were created in the same function that called std::copy/ memcpy. In my MD5 tests, one of the arrays was passed in to the function as a function parameter.

这些结果表明 std::copy 在我的 SHA-2 测试中使用的一些优化std::copy在我的 MD5 测试中无法使用。在 SHA-2 测试中,两个数组都是在调用std::copy/的同一函数中创建的memcpy。在我的 MD5 测试中,其中一个数组作为函数参数传递给函数。

I did a little bit more testing to see what I could do to make std::copyfaster again. The answer turned out to be simple: turn on link time optimization. These are my results with LTO turned on (option -flto in gcc):

我做了更多的测试,看看我可以做些什么来std::copy再次加快速度。答案很简单:开启链接时间优化。这些是我打开 LTO 的结果(gcc 中的选项 -flto):

Time (in seconds) to complete run of MD5 tests with -flto

使用 -flto 完成 MD5 测试运行的时间(以秒为单位)

std::copy   memcpy      % difference
5.54        5.57        +0.54%
5.50        5.53        +0.54%
5.54        5.58        +0.72%
5.50        5.57        +1.26%
5.54        5.58        +0.72%
5.54        5.57        +0.54%
5.54        5.56        +0.36%
5.54        5.58        +0.72%
5.51        5.58        +1.25%
5.54        5.57        +0.54%

Total average increase in speed of std::copy over memcpy: 0.72%

与 memcpy 相比,std::copy 的速度总平均提高:0.72%

In summary, there does not appear to be a performance penalty for using std::copy. In fact, there appears to be a performance gain.

总之,使用std::copy. 事实上,似乎有性能提升。

Explanation of results

结果说明

So why might std::copygive a performance boost?

那么为什么可能会std::copy提高性能呢?

First, I would not expect it to be slower for any implementation, as long as the optimization of inlining is turned on. All compilers inline aggressively; it is possibly the most important optimization because it enables so many other optimizations. std::copycan (and I suspect all real world implementations do) detect that the arguments are trivially copyable and that memory is laid out sequentially. This means that in the worst case, when memcpyis legal, std::copyshould perform no worse. The trivial implementation of std::copythat defers to memcpyshould meet your compiler's criteria of "always inline this when optimizing for speed or size".

首先,只要打开内联优化,我不希望它对任何实现都变慢。所有编译器都积极地内联;它可能是最重要的优化,因为它支持许多其他优化。std::copy可以(并且我怀疑所有现实世界的实现都可以)检测到参数是可简单复制的,并且内存是按顺序排列的。这意味着在最坏的情况下,什么时候memcpy是合法的,std::copy应该不会表现得更糟。在简单的实现的std::copy可推迟到memcpy应满足“永远在线这样优化速度或大小时,”你的编译器的标准。

However, std::copyalso keeps more of its information. When you call std::copy, the function keeps the types intact. memcpyoperates on void *, which discards almost all useful information. For instance, if I pass in an array of std::uint64_t, the compiler or library implementer may be able to take advantage of 64-bit alignment with std::copy, but it may be more difficult to do so with memcpy. Many implementations of algorithms like this work by first working on the unaligned portion at the start of the range, then the aligned portion, then the unaligned portion at the end. If it is all guaranteed to be aligned, then the code becomes simpler and faster, and easier for the branch predictor in your processor to get correct.

但是,std::copy也保留了更多的信息。当您调用 时std::copy,该函数会保持类型不变。memcpy操作void *,它丢弃几乎所有有用的信息。例如,如果我传入一个 的数组std::uint64_t,编译器或库实现者可能能够利用 64 位对齐的优势std::copy,但使用 可能更难做到这一点memcpy。像这样的算法的许多实现首先处理范围开头的未对齐部分,然后是对齐的部分,最后是未对齐的部分。如果保证全部对齐,那么代码会变得更简单、更快,并且处理器中的分支预测器更容易获得正确的结果。

Premature optimization?

过早优化?

std::copyis in an interesting position. I expect it to never be slower than memcpyand sometimes faster with any modern optimizing compiler. Moreover, anything that you can memcpy, you can std::copy. memcpydoes not allow any overlap in the buffers, whereas std::copysupports overlap in one direction (with std::copy_backwardfor the other direction of overlap). memcpyonly works on pointers, std::copyworks on any iterators (std::map, std::vector, std::deque, or my own custom type). In other words, you should just use std::copywhen you need to copy chunks of data around.

std::copy处于一个有趣的位置。我希望它永远不会比memcpy任何现代优化编译器慢,有时甚至更快。此外,任何你能做到的memcpy,你都可以std::copymemcpy不允许缓冲区中有任何重叠,而std::copy支持在一个方向上重叠(std::copy_backward对于另一个方向的重叠)。memcpy只适用于指针,std::copy在任何迭代器的工作原理(std::mapstd::vectorstd::deque,或者我自己的自定义类型)。换句话说,您应该只std::copy在需要复制数据块时使用。

回答by Peter Alexander

All compilers I know will replace a simple std::copywith a memcpywhen it is appropriate, or even better, vectorize the copy so that it would be even faster than a memcpy.

我知道的所有编译器都会在适当std::copymemcpy时候用 a替换 simple ,或者更好的是,将副本向量化,以便它比 a 更快memcpy

In any case: profile and find out yourself. Different compilers will do different things, and it's quite possible it won't do exactly what you ask.

在任何情况下:配置文件并找出自己。不同的编译器会做不同的事情,很可能它不会完全按照你的要求做。

See this presentation on compiler optimisations(pdf).

请参阅有关编译器优化的演示文稿(pdf)。

Here's what GCC doesfor a simple std::copyof a POD type.

以下是 GCC为简单std::copy的 POD 类型所做事情

#include <algorithm>

struct foo
{
  int x, y;    
};

void bar(foo* a, foo* b, size_t n)
{
  std::copy(a, a + n, b);
}

Here's the disassembly (with only -Ooptimisation), showing the call to memmove:

这是反汇编(仅-O优化),显示调用memmove

bar(foo*, foo*, unsigned long):
    salq    , %rdx
    sarq    , %rdx
    testq   %rdx, %rdx
    je  .L5
    subq    , %rsp
    movq    %rsi, %rax
    salq    , %rdx
    movq    %rdi, %rsi
    movq    %rax, %rdi
    call    memmove
    addq    , %rsp
.L5:
    rep
    ret

If you change the function signature to

如果将函数签名更改为

void bar(foo* __restrict a, foo* __restrict b, size_t n)

then the memmovebecomes a memcpyfor a slight performance improvement. Note that memcpyitself will be heavily vectorised.

然后memmove变成了memcpy一个轻微的性能改进。请注意,memcpy它本身将被严重矢量化。

回答by Puppy

Always use std::copybecause memcpyis limited to only C-style POD structures, and the compiler will likely replace calls to std::copywith memcpyif the targets are in fact POD.

始终使用,std::copy因为memcpy仅限于 C 风格的 POD 结构,如果目标实际上是 POD ,编译器可能会替换对std::copywith 的调用memcpy

Plus, std::copycan be used with many iterator types, not just pointers. std::copyis more flexible for no performance loss and is the clear winner.

另外,std::copy可以与许多迭代器类型一起使用,而不仅仅是指针。std::copy更灵活,没有性能损失,是明显的赢家。

回答by Charles Salvia

In theory, memcpymight have a slight, imperceptible, infinitesimal, performance advantage, only because it doesn't have the same requirements as std::copy. From the man page of memcpy:

理论上,memcpy可能有轻微的难以察觉的无穷小的性能优势,只是因为它没有与std::copy. 从手册页memcpy

To avoid overflows, the size of the arrays pointed by both the destination and source parameters, shall be at least num bytes, and should not overlap(for overlapping memory blocks, memmove is a safer approach).

为避免溢出,目标和源参数所指向的数组的大小至少应为 num 字节,并且不应重叠(对于重叠的内存块, memmove 是一种更安全的方法)。

In other words, memcpycan ignore the possibility of overlapping data. (Passing overlapping arrays to memcpyis undefined behavior.) So memcpydoesn't need to explicitly check for this condition, whereas std::copycan be used as long as the OutputIteratorparameter is not in the source range. Note this is notthe same as saying that the source range and destination range can't overlap.

换句话说,memcpy可以忽略数据重叠的可能性。(将重叠数组传递给memcpy是未定义的行为。)因此memcpy不需要显式检查此条件,而std::copy只要OutputIterator参数不在源范围内就可以使用。请注意,这是一样的话说,来源范围和目标范围不能重叠。

So since std::copyhas somewhat different requirements, in theory it should be slightly(with an extreme emphasis on slightly) slower, since it probably will check for overlapping C-arrays, or else delegate the copying of C-arrays to memmove, which needs to perform the check. But in practice, you (and most profilers) probably won't even detect any difference.

因此,由于std::copy有一些不同的要求,理论上它应该稍微(特别强调稍微)慢一点,因为它可能会检查重叠的 C 数组,或者将 C 数组的复制委托给memmove,这需要执行查看。但在实践中,您(和大多数分析人员)可能甚至不会检测到任何差异。

Of course, if you're not working with PODs, you can'tuse memcpyanyway.

当然,如果您不使用PODsmemcpy无论如何都无法使用。

回答by UmmaGumma

My rule is simple. If you are using C++ prefer C++ libraries and not C :)

我的规则很简单。如果您使用 C++,则更喜欢 C++ 库而不是 C :)

回答by einpoklum

If you really need maximum copying performance (which you might not), use neither of them.

如果您确实需要最大的复制性能(您可能不需要),请不要使用它们

There's a lotthat can be done to optimize memory copying - even more if you're willing to use multiple threads/cores for it. See, for example:

有一个很多可以做,以优化内存复制-甚至更多,如果你愿意使用多线程吧/内核。参见,例如:

What's missing/sub-optimal in this memcpy implementation?

这个 memcpy 实现中缺少什么/次优?

both the question and some of the answers have suggested implementations or links to implementations.

问题和一些答案都提供了建议的实现或实现的链接。

回答by Grumbel

Just a minor addition: The speed difference between memcpy()and std::copy()can vary quite a bit depending on if optimizations are enabled or disabled. With g++ 6.2.0 and without optimizations memcpy()clearly wins:

只是一个小补充:memcpy()和之间的速度差异std::copy()可能会有所不同,具体取决于是启用还是禁用优化。使用 g++ 6.2.0 并且没有优化memcpy()显然会获胜:

Benchmark             Time           CPU Iterations
---------------------------------------------------
bm_memcpy            17 ns         17 ns   40867738
bm_stdcopy           62 ns         62 ns   11176219
bm_stdcopy_n         72 ns         72 ns    9481749

When optimizations are enabled (-O3), everything looks pretty much the same again:

启用优化 ( -O3) 后,一切看起来都差不多了:

Benchmark             Time           CPU Iterations
---------------------------------------------------
bm_memcpy             3 ns          3 ns  274527617
bm_stdcopy            3 ns          3 ns  272663990
bm_stdcopy_n          3 ns          3 ns  274732792

The bigger the array the less noticeable the effect gets, but even at N=1000memcpy()is about twice as fast when optimizations aren't enabled.

数组越大,效果越不明显,但即使在N=1000memcpy()未启用优化时,速度也会快两倍。

Source code (requires Google Benchmark):

源代码(需要 Google Benchmark):

#include <string.h>
#include <algorithm>
#include <vector>
#include <benchmark/benchmark.h>

constexpr int N = 10;

void bm_memcpy(benchmark::State& state)
{
  std::vector<int> a(N);
  std::vector<int> r(N);

  while (state.KeepRunning())
  {
    memcpy(r.data(), a.data(), N * sizeof(int));
  }
}

void bm_stdcopy(benchmark::State& state)
{
  std::vector<int> a(N);
  std::vector<int> r(N);

  while (state.KeepRunning())
  {
    std::copy(a.begin(), a.end(), r.begin());
  }
}

void bm_stdcopy_n(benchmark::State& state)
{
  std::vector<int> a(N);
  std::vector<int> r(N);

  while (state.KeepRunning())
  {
    std::copy_n(a.begin(), N, r.begin());
  }
}

BENCHMARK(bm_memcpy);
BENCHMARK(bm_stdcopy);
BENCHMARK(bm_stdcopy_n);

BENCHMARK_MAIN()

/* EOF */

回答by imatveev13

Profiling shows that statement: std::copy()is always as fast as memcpy()or faster is false.

分析显示语句:std::copy()总是一样快memcpy()或更快是错误的。

My system:

我的系统:

HP-Compaq-dx7500-Microtower 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux.

gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

HP-Compaq-dx7500-Microtower 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux。

gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

The code (language: c++):

代码(语言:c++):

    const uint32_t arr_size = (1080 * 720 * 3); //HD image in rgb24
    const uint32_t iterations = 100000;
    uint8_t arr1[arr_size];
    uint8_t arr2[arr_size];
    std::vector<uint8_t> v;

    main(){
        {
            DPROFILE;
            memcpy(arr1, arr2, sizeof(arr1));
            printf("memcpy()\n");
        }

        v.reserve(sizeof(arr1));
        {
            DPROFILE;
            std::copy(arr1, arr1 + sizeof(arr1), v.begin());
            printf("std::copy()\n");
        }

        {
            time_t t = time(NULL);
            for(uint32_t i = 0; i < iterations; ++i)
                memcpy(arr1, arr2, sizeof(arr1));
            printf("memcpy()    elapsed %d s\n", time(NULL) - t);
        }

        {
            time_t t = time(NULL);
            for(uint32_t i = 0; i < iterations; ++i)
                std::copy(arr1, arr1 + sizeof(arr1), v.begin());
            printf("std::copy() elapsed %d s\n", time(NULL) - t);
        }
    }

g++ -O0 -o test_stdcopy test_stdcopy.cpp

memcpy() profile: main:21: now:1422969084:04859 elapsed:2650 us
std::copy() profile: main:27: now:1422969084:04862 elapsed:2745 us
memcpy() elapsed 44 s std::copy() elapsed 45 s

g++ -O3 -o test_stdcopy test_stdcopy.cpp

memcpy() profile: main:21: now:1422969601:04939 elapsed:2385 us
std::copy() profile: main:28: now:1422969601:04941 elapsed:2690 us
memcpy() elapsed 27 s std::copy() elapsed 43 s

g++ -O0 -o test_stdcopy test_stdcopy.cpp

memcpy() 配置文件:main:21: now:1422969084:04859 elapsed:2650 us
std::copy() profile: main:27: now:1422969084:04862 elapsed:2745 us
memcpy() elapsed 44 s std::copy() ) 经过 45 秒

g++ -O3 -o test_stdcopy test_stdcopy.cpp

memcpy() 配置文件:main:21: now:1422969601:04939 elapsed:2385 us
std::copy() profile: main:28: now:1422969601:04941 elapsed:2690 us
memcpy() elapsed 27 s std::copy() ) 经过 43 秒

Red Alert pointed out that the code uses memcpy from array to array and std::copy from array to vector. That coud be a reason for faster memcpy.

Red Alert指出,代码使用了memcpy from array to array和std::copy from array to vector。这可能是加快 memcpy 的一个原因。

Since there is

既然有

v.reserve(sizeof(arr1));

v.reserve(sizeof(arr1));

there shall be no difference in copy to vector or array.

复制到向量或数组应该没有区别。

The code is fixed to use array for both cases. memcpy still faster:

代码固定为在这两种情况下都使用数组。memcpy 仍然更快:

{
    time_t t = time(NULL);
    for(uint32_t i = 0; i < iterations; ++i)
        memcpy(arr1, arr2, sizeof(arr1));
    printf("memcpy()    elapsed %ld s\n", time(NULL) - t);
}

{
    time_t t = time(NULL);
    for(uint32_t i = 0; i < iterations; ++i)
        std::copy(arr1, arr1 + sizeof(arr1), arr2);
    printf("std::copy() elapsed %ld s\n", time(NULL) - t);
}

memcpy()    elapsed 44 s
std::copy() elapsed 48 s