C++ 就性能而言,使用 std::memcpy() 或 std::copy() 更好吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4707012/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is it better to use std::memcpy() or std::copy() in terms to performance?
提问by user576670
Is it better to use memcpy
as shown below or is it better to use std::copy()
in terms to performance? Why?
它是更好地使用memcpy
如下图所示或者是它更好地使用std::copy()
在方面的表现?为什么?
char *bits = NULL;
...
bits = new (std::nothrow) char[((int *) copyMe->bits)[0]];
if (bits == NULL)
{
cout << "ERROR Not enough memory.\n";
exit(1);
}
memcpy (bits, copyMe->bits, ((int *) copyMe->bits)[0]);
回答by David Stone
I'm going to go against the general wisdom here that std::copy
will have a slight, almost imperceptible performance loss. I just did a test and found that to be untrue: I did notice a performance difference. However, the winner was std::copy
.
我将与这里的一般智慧背道而驰,这std::copy
将导致轻微的、几乎察觉不到的性能损失。我刚刚做了一个测试,发现这是不正确的:我确实注意到了性能差异。然而,获胜者是std::copy
。
I wrote a C++ SHA-2 implementation. In my test, I hash 5 strings using all four SHA-2 versions (224, 256, 384, 512), and I loop 300 times. I measure times using Boost.timer. That 300 loop counter is enough to completely stabilize my results. I ran the test 5 times each, alternating between the memcpy
version and the std::copy
version. My code takes advantage of grabbing data in as large of chunks as possible (many other implementations operate with char
/ char *
, whereas I operate with T
/ T *
(where T
is the largest type in the user's implementation that has correct overflow behavior), so fast memory access on the largest types I can is central to the performance of my algorithm. These are my results:
我写了一个 C++ SHA-2 实现。在我的测试中,我使用所有四个 SHA-2 版本(224、256、384、512)对 5 个字符串进行哈希处理,并循环了 300 次。我使用 Boost.timer 测量时间。300 循环计数器足以完全稳定我的结果。我每次运行测试 5 次,在memcpy
版本和std::copy
版本之间交替。我的代码利用尽可能大的块获取数据(许多其他实现使用char
/操作char *
,而我使用T
/ T *
(其中T
是用户实现中具有正确溢出行为的最大类型),因此在我可以使用的最大类型是我算法性能的核心。这些是我的结果:
Time (in seconds) to complete run of SHA-2 tests
完成 SHA-2 测试运行的时间(以秒为单位)
std::copy memcpy % increase
6.11 6.29 2.86%
6.09 6.28 3.03%
6.10 6.29 3.02%
6.08 6.27 3.03%
6.08 6.27 3.03%
Total average increase in speed of std::copy over memcpy: 2.99%
与 memcpy 相比,std::copy 的速度总平均提高:2.99%
My compiler is gcc 4.6.3 on Fedora 16 x86_64. My optimization flags are -Ofast -march=native -funsafe-loop-optimizations
.
我的编译器是 Fedora 16 x86_64 上的 gcc 4.6.3。我的优化标志是-Ofast -march=native -funsafe-loop-optimizations
.
Code for my SHA-2 implementations.
I decided to run a test on my MD5 implementation as well. The results were much less stable, so I decided to do 10 runs. However, after my first few attempts, I got results that varied wildly from one run to the next, so I'm guessing there was some sort of OS activity going on. I decided to start over.
我决定也对我的 MD5 实现进行测试。结果不太稳定,所以我决定运行 10 次。然而,在我最初的几次尝试之后,我得到的结果从一次运行到下一次运行变化很大,所以我猜有某种操作系统活动正在进行。我决定重新开始。
Same compiler settings and flags. There is only one version of MD5, and it's faster than SHA-2, so I did 3000 loops on a similar set of 5 test strings.
相同的编译器设置和标志。只有一个版本的 MD5,而且它比 SHA-2 更快,所以我对一组类似的 5 个测试字符串进行了 3000 次循环。
These are my final 10 results:
这些是我最后的 10 个结果:
Time (in seconds) to complete run of MD5 tests
完成 MD5 测试运行的时间(以秒为单位)
std::copy memcpy % difference
5.52 5.56 +0.72%
5.56 5.55 -0.18%
5.57 5.53 -0.72%
5.57 5.52 -0.91%
5.56 5.57 +0.18%
5.56 5.57 +0.18%
5.56 5.53 -0.54%
5.53 5.57 +0.72%
5.59 5.57 -0.36%
5.57 5.56 -0.18%
Total average decrease in speed of std::copy over memcpy: 0.11%
std::copy 相对于 memcpy 的速度总平均下降:0.11%
Code for my MD5 implementation
These results suggest that there is some optimization that std::copy used in my SHA-2 tests that std::copy
could not use in my MD5 tests. In the SHA-2 tests, both arrays were created in the same function that called std::copy
/ memcpy
. In my MD5 tests, one of the arrays was passed in to the function as a function parameter.
这些结果表明 std::copy 在我的 SHA-2 测试中使用的一些优化std::copy
在我的 MD5 测试中无法使用。在 SHA-2 测试中,两个数组都是在调用std::copy
/的同一函数中创建的memcpy
。在我的 MD5 测试中,其中一个数组作为函数参数传递给函数。
I did a little bit more testing to see what I could do to make std::copy
faster again. The answer turned out to be simple: turn on link time optimization. These are my results with LTO turned on (option -flto in gcc):
我做了更多的测试,看看我可以做些什么来std::copy
再次加快速度。答案很简单:开启链接时间优化。这些是我打开 LTO 的结果(gcc 中的选项 -flto):
Time (in seconds) to complete run of MD5 tests with -flto
使用 -flto 完成 MD5 测试运行的时间(以秒为单位)
std::copy memcpy % difference
5.54 5.57 +0.54%
5.50 5.53 +0.54%
5.54 5.58 +0.72%
5.50 5.57 +1.26%
5.54 5.58 +0.72%
5.54 5.57 +0.54%
5.54 5.56 +0.36%
5.54 5.58 +0.72%
5.51 5.58 +1.25%
5.54 5.57 +0.54%
Total average increase in speed of std::copy over memcpy: 0.72%
与 memcpy 相比,std::copy 的速度总平均提高:0.72%
In summary, there does not appear to be a performance penalty for using std::copy
. In fact, there appears to be a performance gain.
总之,使用std::copy
. 事实上,似乎有性能提升。
Explanation of results
结果说明
So why might std::copy
give a performance boost?
那么为什么可能会std::copy
提高性能呢?
First, I would not expect it to be slower for any implementation, as long as the optimization of inlining is turned on. All compilers inline aggressively; it is possibly the most important optimization because it enables so many other optimizations. std::copy
can (and I suspect all real world implementations do) detect that the arguments are trivially copyable and that memory is laid out sequentially. This means that in the worst case, when memcpy
is legal, std::copy
should perform no worse. The trivial implementation of std::copy
that defers to memcpy
should meet your compiler's criteria of "always inline this when optimizing for speed or size".
首先,只要打开内联优化,我不希望它对任何实现都变慢。所有编译器都积极地内联;它可能是最重要的优化,因为它支持许多其他优化。std::copy
可以(并且我怀疑所有现实世界的实现都可以)检测到参数是可简单复制的,并且内存是按顺序排列的。这意味着在最坏的情况下,什么时候memcpy
是合法的,std::copy
应该不会表现得更糟。在简单的实现的std::copy
可推迟到memcpy
应满足“永远在线这样优化速度或大小时,”你的编译器的标准。
However, std::copy
also keeps more of its information. When you call std::copy
, the function keeps the types intact. memcpy
operates on void *
, which discards almost all useful information. For instance, if I pass in an array of std::uint64_t
, the compiler or library implementer may be able to take advantage of 64-bit alignment with std::copy
, but it may be more difficult to do so with memcpy
. Many implementations of algorithms like this work by first working on the unaligned portion at the start of the range, then the aligned portion, then the unaligned portion at the end. If it is all guaranteed to be aligned, then the code becomes simpler and faster, and easier for the branch predictor in your processor to get correct.
但是,std::copy
也保留了更多的信息。当您调用 时std::copy
,该函数会保持类型不变。memcpy
操作void *
,它丢弃几乎所有有用的信息。例如,如果我传入一个 的数组std::uint64_t
,编译器或库实现者可能能够利用 64 位对齐的优势std::copy
,但使用 可能更难做到这一点memcpy
。像这样的算法的许多实现首先处理范围开头的未对齐部分,然后是对齐的部分,最后是未对齐的部分。如果保证全部对齐,那么代码会变得更简单、更快,并且处理器中的分支预测器更容易获得正确的结果。
Premature optimization?
过早优化?
std::copy
is in an interesting position. I expect it to never be slower than memcpy
and sometimes faster with any modern optimizing compiler. Moreover, anything that you can memcpy
, you can std::copy
. memcpy
does not allow any overlap in the buffers, whereas std::copy
supports overlap in one direction (with std::copy_backward
for the other direction of overlap). memcpy
only works on pointers, std::copy
works on any iterators (std::map
, std::vector
, std::deque
, or my own custom type). In other words, you should just use std::copy
when you need to copy chunks of data around.
std::copy
处于一个有趣的位置。我希望它永远不会比memcpy
任何现代优化编译器慢,有时甚至更快。此外,任何你能做到的memcpy
,你都可以std::copy
。memcpy
不允许缓冲区中有任何重叠,而std::copy
支持在一个方向上重叠(std::copy_backward
对于另一个方向的重叠)。memcpy
只适用于指针,std::copy
在任何迭代器的工作原理(std::map
,std::vector
,std::deque
,或者我自己的自定义类型)。换句话说,您应该只std::copy
在需要复制数据块时使用。
回答by Peter Alexander
All compilers I know will replace a simple std::copy
with a memcpy
when it is appropriate, or even better, vectorize the copy so that it would be even faster than a memcpy
.
我知道的所有编译器都会在适当std::copy
的memcpy
时候用 a替换 simple ,或者更好的是,将副本向量化,以便它比 a 更快memcpy
。
In any case: profile and find out yourself. Different compilers will do different things, and it's quite possible it won't do exactly what you ask.
在任何情况下:配置文件并找出自己。不同的编译器会做不同的事情,很可能它不会完全按照你的要求做。
See this presentation on compiler optimisations(pdf).
请参阅有关编译器优化的演示文稿(pdf)。
Here's what GCC doesfor a simple std::copy
of a POD type.
以下是 GCC为简单std::copy
的 POD 类型所做的事情。
#include <algorithm>
struct foo
{
int x, y;
};
void bar(foo* a, foo* b, size_t n)
{
std::copy(a, a + n, b);
}
Here's the disassembly (with only -O
optimisation), showing the call to memmove
:
这是反汇编(仅-O
优化),显示调用memmove
:
bar(foo*, foo*, unsigned long):
salq , %rdx
sarq , %rdx
testq %rdx, %rdx
je .L5
subq , %rsp
movq %rsi, %rax
salq , %rdx
movq %rdi, %rsi
movq %rax, %rdi
call memmove
addq , %rsp
.L5:
rep
ret
If you change the function signature to
如果将函数签名更改为
void bar(foo* __restrict a, foo* __restrict b, size_t n)
then the memmove
becomes a memcpy
for a slight performance improvement. Note that memcpy
itself will be heavily vectorised.
然后memmove
变成了memcpy
一个轻微的性能改进。请注意,memcpy
它本身将被严重矢量化。
回答by Puppy
Always use std::copy
because memcpy
is limited to only C-style POD structures, and the compiler will likely replace calls to std::copy
with memcpy
if the targets are in fact POD.
始终使用,std::copy
因为memcpy
仅限于 C 风格的 POD 结构,如果目标实际上是 POD ,编译器可能会替换对std::copy
with 的调用memcpy
。
Plus, std::copy
can be used with many iterator types, not just pointers. std::copy
is more flexible for no performance loss and is the clear winner.
另外,std::copy
可以与许多迭代器类型一起使用,而不仅仅是指针。std::copy
更灵活,没有性能损失,是明显的赢家。
回答by Charles Salvia
In theory, memcpy
might have a slight, imperceptible, infinitesimal, performance advantage, only because it doesn't have the same requirements as std::copy
. From the man page of memcpy
:
理论上,memcpy
可能有轻微的、难以察觉的、无穷小的性能优势,只是因为它没有与std::copy
. 从手册页memcpy
:
To avoid overflows, the size of the arrays pointed by both the destination and source parameters, shall be at least num bytes, and should not overlap(for overlapping memory blocks, memmove is a safer approach).
为避免溢出,目标和源参数所指向的数组的大小至少应为 num 字节,并且不应重叠(对于重叠的内存块, memmove 是一种更安全的方法)。
In other words, memcpy
can ignore the possibility of overlapping data. (Passing overlapping arrays to memcpy
is undefined behavior.) So memcpy
doesn't need to explicitly check for this condition, whereas std::copy
can be used as long as the OutputIterator
parameter is not in the source range. Note this is notthe same as saying that the source range and destination range can't overlap.
换句话说,memcpy
可以忽略数据重叠的可能性。(将重叠数组传递给memcpy
是未定义的行为。)因此memcpy
不需要显式检查此条件,而std::copy
只要OutputIterator
参数不在源范围内就可以使用。请注意,这是不一样的话说,来源范围和目标范围不能重叠。
So since std::copy
has somewhat different requirements, in theory it should be slightly(with an extreme emphasis on slightly) slower, since it probably will check for overlapping C-arrays, or else delegate the copying of C-arrays to memmove
, which needs to perform the check. But in practice, you (and most profilers) probably won't even detect any difference.
因此,由于std::copy
有一些不同的要求,理论上它应该稍微(特别强调稍微)慢一点,因为它可能会检查重叠的 C 数组,或者将 C 数组的复制委托给memmove
,这需要执行查看。但在实践中,您(和大多数分析人员)可能甚至不会检测到任何差异。
Of course, if you're not working with PODs, you can'tuse memcpy
anyway.
当然,如果您不使用PODs,则memcpy
无论如何都无法使用。
回答by UmmaGumma
My rule is simple. If you are using C++ prefer C++ libraries and not C :)
我的规则很简单。如果您使用 C++,则更喜欢 C++ 库而不是 C :)
回答by einpoklum
If you really need maximum copying performance (which you might not), use neither of them.
如果您确实需要最大的复制性能(您可能不需要),请不要使用它们。
There's a lotthat can be done to optimize memory copying - even more if you're willing to use multiple threads/cores for it. See, for example:
有一个很多可以做,以优化内存复制-甚至更多,如果你愿意使用多线程吧/内核。参见,例如:
What's missing/sub-optimal in this memcpy implementation?
both the question and some of the answers have suggested implementations or links to implementations.
问题和一些答案都提供了建议的实现或实现的链接。
回答by Grumbel
Just a minor addition: The speed difference between memcpy()
and std::copy()
can vary quite a bit depending on if optimizations are enabled or disabled. With g++ 6.2.0 and without optimizations memcpy()
clearly wins:
只是一个小补充:memcpy()
和之间的速度差异std::copy()
可能会有所不同,具体取决于是启用还是禁用优化。使用 g++ 6.2.0 并且没有优化memcpy()
显然会获胜:
Benchmark Time CPU Iterations
---------------------------------------------------
bm_memcpy 17 ns 17 ns 40867738
bm_stdcopy 62 ns 62 ns 11176219
bm_stdcopy_n 72 ns 72 ns 9481749
When optimizations are enabled (-O3
), everything looks pretty much the same again:
启用优化 ( -O3
) 后,一切看起来都差不多了:
Benchmark Time CPU Iterations
---------------------------------------------------
bm_memcpy 3 ns 3 ns 274527617
bm_stdcopy 3 ns 3 ns 272663990
bm_stdcopy_n 3 ns 3 ns 274732792
The bigger the array the less noticeable the effect gets, but even at N=1000
memcpy()
is about twice as fast when optimizations aren't enabled.
数组越大,效果越不明显,但即使在N=1000
memcpy()
未启用优化时,速度也会快两倍。
Source code (requires Google Benchmark):
源代码(需要 Google Benchmark):
#include <string.h>
#include <algorithm>
#include <vector>
#include <benchmark/benchmark.h>
constexpr int N = 10;
void bm_memcpy(benchmark::State& state)
{
std::vector<int> a(N);
std::vector<int> r(N);
while (state.KeepRunning())
{
memcpy(r.data(), a.data(), N * sizeof(int));
}
}
void bm_stdcopy(benchmark::State& state)
{
std::vector<int> a(N);
std::vector<int> r(N);
while (state.KeepRunning())
{
std::copy(a.begin(), a.end(), r.begin());
}
}
void bm_stdcopy_n(benchmark::State& state)
{
std::vector<int> a(N);
std::vector<int> r(N);
while (state.KeepRunning())
{
std::copy_n(a.begin(), N, r.begin());
}
}
BENCHMARK(bm_memcpy);
BENCHMARK(bm_stdcopy);
BENCHMARK(bm_stdcopy_n);
BENCHMARK_MAIN()
/* EOF */
回答by imatveev13
Profiling shows that statement: std::copy()
is always as fast as memcpy()
or faster is false.
分析显示语句:std::copy()
总是一样快memcpy()
或更快是错误的。
My system:
我的系统:
HP-Compaq-dx7500-Microtower 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux.
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
HP-Compaq-dx7500-Microtower 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux。
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
The code (language: c++):
代码(语言:c++):
const uint32_t arr_size = (1080 * 720 * 3); //HD image in rgb24
const uint32_t iterations = 100000;
uint8_t arr1[arr_size];
uint8_t arr2[arr_size];
std::vector<uint8_t> v;
main(){
{
DPROFILE;
memcpy(arr1, arr2, sizeof(arr1));
printf("memcpy()\n");
}
v.reserve(sizeof(arr1));
{
DPROFILE;
std::copy(arr1, arr1 + sizeof(arr1), v.begin());
printf("std::copy()\n");
}
{
time_t t = time(NULL);
for(uint32_t i = 0; i < iterations; ++i)
memcpy(arr1, arr2, sizeof(arr1));
printf("memcpy() elapsed %d s\n", time(NULL) - t);
}
{
time_t t = time(NULL);
for(uint32_t i = 0; i < iterations; ++i)
std::copy(arr1, arr1 + sizeof(arr1), v.begin());
printf("std::copy() elapsed %d s\n", time(NULL) - t);
}
}
g++ -O0 -o test_stdcopy test_stdcopy.cpp
memcpy() profile: main:21: now:1422969084:04859 elapsed:2650 us
std::copy() profile: main:27: now:1422969084:04862 elapsed:2745 us
memcpy() elapsed 44 s std::copy() elapsed 45 sg++ -O3 -o test_stdcopy test_stdcopy.cpp
memcpy() profile: main:21: now:1422969601:04939 elapsed:2385 us
std::copy() profile: main:28: now:1422969601:04941 elapsed:2690 us
memcpy() elapsed 27 s std::copy() elapsed 43 s
g++ -O0 -o test_stdcopy test_stdcopy.cpp
memcpy() 配置文件:main:21: now:1422969084:04859 elapsed:2650 us
std::copy() profile: main:27: now:1422969084:04862 elapsed:2745 us
memcpy() elapsed 44 s std::copy() ) 经过 45 秒g++ -O3 -o test_stdcopy test_stdcopy.cpp
memcpy() 配置文件:main:21: now:1422969601:04939 elapsed:2385 us
std::copy() profile: main:28: now:1422969601:04941 elapsed:2690 us
memcpy() elapsed 27 s std::copy() ) 经过 43 秒
Red Alert pointed out that the code uses memcpy from array to array and std::copy from array to vector. That coud be a reason for faster memcpy.
Red Alert指出,代码使用了memcpy from array to array和std::copy from array to vector。这可能是加快 memcpy 的一个原因。
Since there is
既然有
v.reserve(sizeof(arr1));
v.reserve(sizeof(arr1));
there shall be no difference in copy to vector or array.
复制到向量或数组应该没有区别。
The code is fixed to use array for both cases. memcpy still faster:
代码固定为在这两种情况下都使用数组。memcpy 仍然更快:
{
time_t t = time(NULL);
for(uint32_t i = 0; i < iterations; ++i)
memcpy(arr1, arr2, sizeof(arr1));
printf("memcpy() elapsed %ld s\n", time(NULL) - t);
}
{
time_t t = time(NULL);
for(uint32_t i = 0; i < iterations; ++i)
std::copy(arr1, arr1 + sizeof(arr1), arr2);
printf("std::copy() elapsed %ld s\n", time(NULL) - t);
}
memcpy() elapsed 44 s
std::copy() elapsed 48 s