C++ 为什么在单独循环中按元素添加比在组合循环中快得多?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8547778/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why are elementwise additions much faster in separate loops than in a combined loop?
提问by Johannes Gerer
Suppose a1
, b1
, c1
, and d1
point to heap memory and my numerical code has the following core loop.
假设a1
、b1
、c1
和d1
指向堆内存,我的数字代码有以下核心循环。
const int n = 100000;
for (int j = 0; j < n; j++) {
a1[j] += b1[j];
c1[j] += d1[j];
}
This loop is executed 10,000 times via another outer for
loop. To speed it up, I changed the code to:
这个循环通过另一个外for
循环执行 10,000 次。为了加快速度,我将代码更改为:
for (int j = 0; j < n; j++) {
a1[j] += b1[j];
}
for (int j = 0; j < n; j++) {
c1[j] += d1[j];
}
Compiled on MS Visual C++ 10.0with full optimization and SSE2enabled for 32-bit on a Intel Core 2Duo (x64), the first example takes 5.5 seconds and the double-loop example takes only 1.9 seconds. My question is: (Please refer to the my rephrased question at the bottom)
在 MS Visual C++ 10.0上编译并在Intel Core 2Duo (x64)上为 32 位启用了全面优化和SSE2,第一个示例需要 5.5 秒,双循环示例仅需要 1.9 秒。我的问题是:(请参阅底部我改写的问题)
PS: I am not sure, if this helps:
PS:我不确定,如果这有帮助:
Disassembly for the first loop basically looks like this (this block is repeated about five times in the full program):
第一个循环的反汇编基本上是这样的(这个块在整个程序中重复了大约五次):
movsd xmm0,mmword ptr [edx+18h]
addsd xmm0,mmword ptr [ecx+20h]
movsd mmword ptr [ecx+20h],xmm0
movsd xmm0,mmword ptr [esi+10h]
addsd xmm0,mmword ptr [eax+30h]
movsd mmword ptr [eax+30h],xmm0
movsd xmm0,mmword ptr [edx+20h]
addsd xmm0,mmword ptr [ecx+28h]
movsd mmword ptr [ecx+28h],xmm0
movsd xmm0,mmword ptr [esi+18h]
addsd xmm0,mmword ptr [eax+38h]
Each loop of the double loop example produces this code (the following block is repeated about three times):
双循环示例的每个循环都会生成此代码(以下块重复了大约 3 次):
addsd xmm0,mmword ptr [eax+28h]
movsd mmword ptr [eax+28h],xmm0
movsd xmm0,mmword ptr [ecx+20h]
addsd xmm0,mmword ptr [eax+30h]
movsd mmword ptr [eax+30h],xmm0
movsd xmm0,mmword ptr [ecx+28h]
addsd xmm0,mmword ptr [eax+38h]
movsd mmword ptr [eax+38h],xmm0
movsd xmm0,mmword ptr [ecx+30h]
addsd xmm0,mmword ptr [eax+40h]
movsd mmword ptr [eax+40h],xmm0
The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:
结果证明这个问题无关紧要,因为行为严重依赖于数组 (n) 和 CPU 缓存的大小。所以如果有进一步的兴趣,我改写这个问题:
Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?
您能否深入了解导致不同缓存行为的细节,如下图的五个区域所示?
It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.
通过为这些 CPU 提供类似的图表,指出 CPU/缓存架构之间的差异也可能很有趣。
PPS: Here is the full code. It uses TBBTick_Count
for higher resolution timing, which can be disabled by not defining the TBB_TIMING
Macro:
PPS:这是完整的代码。它使用TBBTick_Count
进行更高分辨率的计时,可以通过不定义TBB_TIMING
宏来禁用它:
#include <iostream>
#include <iomanip>
#include <cmath>
#include <string>
//#define TBB_TIMING
#ifdef TBB_TIMING
#include <tbb/tick_count.h>
using tbb::tick_count;
#else
#include <time.h>
#endif
using namespace std;
//#define preallocate_memory new_cont
enum { new_cont, new_sep };
double *a1, *b1, *c1, *d1;
void allo(int cont, int n)
{
switch(cont) {
case new_cont:
a1 = new double[n*4];
b1 = a1 + n;
c1 = b1 + n;
d1 = c1 + n;
break;
case new_sep:
a1 = new double[n];
b1 = new double[n];
c1 = new double[n];
d1 = new double[n];
break;
}
for (int i = 0; i < n; i++) {
a1[i] = 1.0;
d1[i] = 1.0;
c1[i] = 1.0;
b1[i] = 1.0;
}
}
void ff(int cont)
{
switch(cont){
case new_sep:
delete[] b1;
delete[] c1;
delete[] d1;
case new_cont:
delete[] a1;
}
}
double plain(int n, int m, int cont, int loops)
{
#ifndef preallocate_memory
allo(cont,n);
#endif
#ifdef TBB_TIMING
tick_count t0 = tick_count::now();
#else
clock_t start = clock();
#endif
if (loops == 1) {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
}
} else {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
a1[j] += b1[j];
}
for (int j = 0; j < n; j++) {
c1[j] += d1[j];
}
}
}
double ret;
#ifdef TBB_TIMING
tick_count t1 = tick_count::now();
ret = 2.0*double(n)*double(m)/(t1-t0).seconds();
#else
clock_t end = clock();
ret = 2.0*double(n)*double(m)/(double)(end - start) *double(CLOCKS_PER_SEC);
#endif
#ifndef preallocate_memory
ff(cont);
#endif
return ret;
}
void main()
{
freopen("C:\test.csv", "w", stdout);
char *s = " ";
string na[2] ={"new_cont", "new_sep"};
cout << "n";
for (int j = 0; j < 2; j++)
for (int i = 1; i <= 2; i++)
#ifdef preallocate_memory
cout << s << i << "_loops_" << na[preallocate_memory];
#else
cout << s << i << "_loops_" << na[j];
#endif
cout << endl;
long long nmax = 1000000;
#ifdef preallocate_memory
allo(preallocate_memory, nmax);
#endif
for (long long n = 1L; n < nmax; n = max(n+1, long long(n*1.2)))
{
const long long m = 10000000/n;
cout << n;
for (int j = 0; j < 2; j++)
for (int i = 1; i <= 2; i++)
cout << s << plain(n, m, j, i);
cout << endl;
}
}
(It shows FLOP/s for different values of n
.)
(它显示了不同值的 FLOP/s n
。)
采纳答案by Mysticial
Upon further analysis of this, I believe this is (at least partially) caused by the data alignment of the four-pointers. This will cause some level of cache bank/way conflicts.
经过进一步分析,我认为这是(至少部分)由四指针的数据对齐引起的。这将导致某种程度的缓存组/方式冲突。
If I've guessed correctly on how you are allocating your arrays, they are likely to be aligned to the page line.
如果我猜对了您如何分配数组,它们很可能与 page line 对齐。
This means that all your accesses in each loop will fall on the same cache way. However, Intel processors have had 8-way L1 cache associativity for a while. But in reality, the performance isn't completely uniform. Accessing 4-ways is still slower than say 2-ways.
这意味着您在每个循环中的所有访问都将采用相同的缓存方式。然而,英特尔处理器已经有一段时间具有 8 路 L1 缓存关联性。但实际上,性能并不完全一致。访问 4 路仍然比 2 路慢。
EDIT: It does in fact look like you are allocating all the arrays separately.Usually when such large allocations are requested, the allocator will request fresh pages from the OS. Therefore, there is a high chance that large allocations will appear at the same offset from a page-boundary.
编辑:实际上看起来您正在分别分配所有数组。通常,当请求如此大的分配时,分配器将从操作系统请求新页面。因此,大量分配很可能出现在与页面边界相同的偏移量处。
Here's the test code:
这是测试代码:
int main(){
const int n = 100000;
#ifdef ALLOCATE_SEPERATE
double *a1 = (double*)malloc(n * sizeof(double));
double *b1 = (double*)malloc(n * sizeof(double));
double *c1 = (double*)malloc(n * sizeof(double));
double *d1 = (double*)malloc(n * sizeof(double));
#else
double *a1 = (double*)malloc(n * sizeof(double) * 4);
double *b1 = a1 + n;
double *c1 = b1 + n;
double *d1 = c1 + n;
#endif
// Zero the data to prevent any chance of denormals.
memset(a1,0,n * sizeof(double));
memset(b1,0,n * sizeof(double));
memset(c1,0,n * sizeof(double));
memset(d1,0,n * sizeof(double));
// Print the addresses
cout << a1 << endl;
cout << b1 << endl;
cout << c1 << endl;
cout << d1 << endl;
clock_t start = clock();
int c = 0;
while (c++ < 10000){
#if ONE_LOOP
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
#else
for(int j=0;j<n;j++){
a1[j] += b1[j];
}
for(int j=0;j<n;j++){
c1[j] += d1[j];
}
#endif
}
clock_t end = clock();
cout << "seconds = " << (double)(end - start) / CLOCKS_PER_SEC << endl;
system("pause");
return 0;
}
Benchmark Results:
基准测试结果:
EDIT: Results on an actualCore 2 architecture machine:
编辑:在实际的Core 2 架构机器上的结果:
2 x Intel Xeon X5482 Harpertown @ 3.2 GHz:
2 个英特尔至强 X5482 Harpertown @ 3.2 GHz:
#define ALLOCATE_SEPERATE
#define ONE_LOOP
00600020
006D0020
007A0020
00870020
seconds = 6.206
#define ALLOCATE_SEPERATE
//#define ONE_LOOP
005E0020
006B0020
00780020
00850020
seconds = 2.116
//#define ALLOCATE_SEPERATE
#define ONE_LOOP
00570020
00633520
006F6A20
007B9F20
seconds = 1.894
//#define ALLOCATE_SEPERATE
//#define ONE_LOOP
008C0020
00983520
00A46A20
00B09F20
seconds = 1.993
Observations:
观察:
6.206 secondswith one loop and 2.116 secondswith two loops. This reproduces the OP's results exactly.
In the first two tests, the arrays are allocated separately.You'll notice that they all have the same alignment relative to the page.
In the second two tests, the arrays are packed together to break that alignment.Here you'll notice both loops are faster. Furthermore, the second (double) loop is now the slower one as you would normally expect.
一个循环为6.206 秒,两个循环为2.116 秒。这准确地再现了 OP 的结果。
在前两个测试中,数组是单独分配的。您会注意到它们都相对于页面具有相同的对齐方式。
在后两个测试中,数组被打包在一起以打破这种对齐。在这里,您会注意到两个循环都更快。此外,第二个(双)循环现在是您通常期望的较慢的循环。
As @Stephen Cannon points out in the comments, there is a very likely possibility that this alignment causes false aliasingin the load/store units or the cache. I Googled around for this and found that Intel actually has a hardware counter for partial address aliasingstalls:
正如@Stephen Cannon 在评论中指出的那样,这种对齐很可能会导致加载/存储单元或缓存中出现错误混叠。我为此在谷歌上搜索,发现英特尔实际上有一个用于部分地址别名停顿的硬件计数器:
5 Regions - Explanations
5 地区 - 说明
Region 1:
区域 1:
This one is easy. The dataset is so small that the performance is dominated by overhead like looping and branching.
这个很容易。数据集非常小,以至于性能受到循环和分支等开销的影响。
Region 2:
区域 2:
Here, as the data sizes increase, the amount of relative overhead goes down and the performance "saturates". Here two loops is slower because it has twice as much loop and branching overhead.
在这里,随着数据大小的增加,相对开销的数量会减少,性能也会“饱和”。这里的两个循环较慢,因为它有两倍的循环和分支开销。
I'm not sure exactly what's going on here... Alignment could still play an effect as Agner Fog mentions cache bank conflicts. (That link is about Sandy Bridge, but the idea should still be applicable to Core 2.)
我不确定这里到底发生了什么......因为 Agner Fog 提到了缓存组冲突,对齐仍然可以发挥作用。(那个链接是关于 Sandy Bridge 的,但这个想法应该仍然适用于 Core 2。)
Region 3:
区域 3:
At this point, the data no longer fits in the L1 cache. So performance is capped by the L1 <-> L2 cache bandwidth.
此时,数据不再适合 L1 缓存。因此,性能受到 L1 <-> L2 缓存带宽的限制。
Region 4:
区域 4:
The performance drop in the single-loop is what we are observing. And as mentioned, this is due to the alignment which (most likely) causes false aliasingstalls in the processor load/store units.
单循环中的性能下降是我们所观察到的。如前所述,这是由于对齐(最有可能)导致处理器加载/存储单元中的假混叠停顿。
However, in order for false aliasing to occur, there must be a large enough stride between the datasets. This is why you don't see this in region 3.
然而,为了发生假混叠,数据集之间必须有足够大的步幅。这就是您在区域 3 中看不到此内容的原因。
Region 5:
区域 5:
At this point, nothing fits in the cache. So you're bound by memory bandwidth.
此时,缓存中没有任何内容。所以你受到内存带宽的限制。
回答by Johannes Gerer
OK, the right answer definitely has to do something with the CPU cache. But to use the cache argument can be quite difficult, especially without data.
好的,正确答案肯定与 CPU 缓存有关。但是使用缓存参数可能非常困难,尤其是在没有数据的情况下。
There are many answers, that led to a lot of discussion, but let's face it: Cache issues can be very complex and are not one dimensional. They depend heavily on the size of the data, so my question was unfair: It turned out to be at a very interesting point in the cache graph.
有很多答案引发了很多讨论,但让我们面对现实吧:缓存问题可能非常复杂,而且不是一维的。它们在很大程度上取决于数据的大小,所以我的问题是不公平的:结果证明它处于缓存图中非常有趣的点。
@Mysticial's answer convinced a lot of people (including me), probably because it was the only one that seemed to rely on facts, but it was only one "data point" of the truth.
@Mysticial 的回答说服了很多人(包括我),可能是因为它是唯一一个似乎依赖事实的答案,但它只是事实的一个“数据点”。
That's why I combined his test (using a continuous vs. separate allocation) and @James' Answer's advice.
这就是为什么我结合了他的测试(使用连续分配和单独分配)和@James' Answer 的建议。
The graphs below shows, that most of the answers and especially the majority of comments to the question and answers can be considered completely wrong or true depending on the exact scenario and parameters used.
下图显示,根据所使用的确切场景和参数,大多数答案,尤其是对问题和答案的大多数评论都可以被视为完全错误或正确。
Note that my initial question was at n = 100.000. This point (by accident) exhibits special behavior:
请注意,我最初的问题是n = 100.000。这一点(偶然)表现出特殊的行为:
It possesses the greatest discrepancy between the one and two loop'ed version (almost a factor of three)
It is the only point, where one-loop (namely with continuous allocation) beats the two-loop version. (This made Mysticial's answer possible, at all.)
它拥有一个和两个循环版本之间的最大差异(几乎是三倍)
这是单循环(即连续分配)击败双循环版本的唯一一点。(这使 Mysticial 的答案成为可能,根本没有。)
The result using initialized data:
使用初始化数据的结果:
The result using uninitialized data (this is what Mysticial tested):
使用未初始化数据的结果(这是 Mysticial 测试的):
And this is a hard-to-explain one: Initialized data, that is allocated once and reused for every following test case of different vector size:
这是一个难以解释的:初始化数据,即分配一次并重用于以下不同向量大小的每个测试用例:
Proposal
提议
Every low-level performance related question on Stack Overflow should be required to provide MFLOPS information for the whole range of cache relevant data sizes! It's a waste of everybody's time to think of answers and especially discuss them with others without this information.
Stack Overflow 上的每个低级性能相关问题都应该被要求提供所有缓存相关数据大小的 MFLOPS 信息!在没有这些信息的情况下,思考答案,尤其是与其他人讨论答案,是在浪费每个人的时间。
回答by Puppy
The second loop involves a lot less cache activity, so it's easier for the processor to keep up with the memory demands.
第二个循环涉及的缓存活动少得多,因此处理器更容易跟上内存需求。
回答by OldCurmudgeon
Imagine you are working on a machine where n
was just the right value for it only to be possible to hold two of your arrays in memory at one time, but the total memory available, via disk caching, was still sufficient to hold all four.
想象一下,您正在一台机器上工作n
,它的值恰到好处,因为它一次只能将两个数组保存在内存中,但通过磁盘缓存,可用的总内存仍然足以保存所有四个。
Assuming a simple LIFO caching policy, this code:
假设一个简单的 LIFO 缓存策略,此代码:
for(int j=0;j<n;j++){
a[j] += b[j];
}
for(int j=0;j<n;j++){
c[j] += d[j];
}
would first cause a
and b
to be loaded into RAM and then be worked on entirely in RAM. When the second loop starts, c
and d
would then be loaded from disk into RAM and operated on.
将首先导致a
并b
加载到 RAM 中,然后完全在 RAM 中进行处理。当第二循环开始,c
并d
随后将从磁盘被装入RAM和操作上。
the other loop
另一个循环
for(int j=0;j<n;j++){
a[j] += b[j];
c[j] += d[j];
}
will page out two arrays and page in the other two every time around the loop. This would obviously be muchslower.
每次循环时都会对两个数组进行分页,并在另外两个数组中分页。这显然是很多慢。
You are probably not seeing disk caching in your tests but you are probably seeing the side effects of some other form of caching.
您可能没有在测试中看到磁盘缓存,但您可能会看到其他某种形式的缓存的副作用。
There seems to be a little confusion/misunderstanding here so I will try to elaborate a little using an example.
这里似乎有点混乱/误解,所以我将尝试使用一个例子来详细说明。
Say n = 2
and we are working with bytes. In my scenario we thus have just 4 bytes of RAMand the rest of our memory is significantly slower (say 100 times longer access).
说n = 2
,我们正在处理字节。在我的场景中,我们因此只有 4 个字节的 RAM,而我们的其余内存明显变慢(比如访问时间长 100 倍)。
Assuming a fairly dumb caching policy of if the byte is not in the cache, put it there and get the following byte too while we are at ityou will get a scenario something like this:
假设一个相当愚蠢的缓存策略,如果字节不在缓存中,把它放在那里并在我们处理它时也获取以下字节,你会得到这样的场景:
With
for(int j=0;j<n;j++){ a[j] += b[j]; } for(int j=0;j<n;j++){ c[j] += d[j]; }
cache
a[0]
anda[1]
thenb[0]
andb[1]
and seta[0] = a[0] + b[0]
in cache - there are now four bytes in cache,a[0], a[1]
andb[0], b[1]
. Cost = 100 + 100.- set
a[1] = a[1] + b[1]
in cache. Cost = 1 + 1. - Repeat for
c
andd
. Total cost =
(100 + 100 + 1 + 1) * 2 = 404
With
for(int j=0;j<n;j++){ a[j] += b[j]; c[j] += d[j]; }
cache
a[0]
anda[1]
thenb[0]
andb[1]
and seta[0] = a[0] + b[0]
in cache - there are now four bytes in cache,a[0], a[1]
andb[0], b[1]
. Cost = 100 + 100.- eject
a[0], a[1], b[0], b[1]
from cache and cachec[0]
andc[1]
thend[0]
andd[1]
and setc[0] = c[0] + d[0]
in cache. Cost = 100 + 100. - I suspect you are beginning to see where I am going.
- Total cost =
(100 + 100 + 100 + 100) * 2 = 800
和
for(int j=0;j<n;j++){ a[j] += b[j]; } for(int j=0;j<n;j++){ c[j] += d[j]; }
高速缓存
a[0]
和a[1]
随后b[0]
与b[1]
和集a[0] = a[0] + b[0]
高速缓存-现在有在高速缓存中的四个字节,a[0], a[1]
和b[0], b[1]
。成本 = 100 + 100。- 设置
a[1] = a[1] + b[1]
在缓存中。成本 = 1 + 1。 - 重复
c
和d
。 总成本 =
(100 + 100 + 1 + 1) * 2 = 404
和
for(int j=0;j<n;j++){ a[j] += b[j]; c[j] += d[j]; }
高速缓存
a[0]
和a[1]
随后b[0]
与b[1]
和集a[0] = a[0] + b[0]
高速缓存-现在有在高速缓存中的四个字节,a[0], a[1]
和b[0], b[1]
。成本 = 100 + 100。- 弹出
a[0], a[1], b[0], b[1]
从缓存和缓存c[0]
并c[1]
随后d[0]
与d[1]
和设置c[0] = c[0] + d[0]
在缓存中。成本 = 100 + 100。 - 我怀疑你开始明白我要去哪里了。
- 总成本 =
(100 + 100 + 100 + 100) * 2 = 800
This is a classic cache thrash scenario.
这是一个经典的缓存抖动场景。
回答by Emilio Garavaglia
It's not because of a different code, but because of caching: RAM is slower than the CPU registers and a cache memory is inside the CPU to avoid to write the RAM every time a variable is changing. But the cache is not big as the RAM is, hence, it maps only a fraction of it.
这不是因为不同的代码,而是因为缓存:RAM 比 CPU 寄存器慢,并且缓存内存在 CPU 内部以避免每次变量更改时写入 RAM。但是缓存没有 RAM 大,因此,它只映射其中的一小部分。
The first code modifies distant memory addresses alternating them at each loop, thus requiring continuously to invalidate the cache.
第一个代码修改远程内存地址,在每个循环中交替它们,因此需要不断地使缓存无效。
The second code don't alternate: it just flow on adjacent addresses twice. This makes all the job to be completed in the cache, invalidating it only after the second loop starts.
第二个代码不交替:它只是在相邻地址上流动两次。这使得所有作业都在缓存中完成,只有在第二个循环开始后才使其失效。
回答by Emilio Garavaglia
I cannot replicate the results discussed here.
我无法复制这里讨论的结果。
I don't know if poor benchmark code is to blame, or what, but the two methods are within 10% of each other on my machine using the following code, and one loop is usually just slightly faster than two - as you'd expect.
我不知道是否应该归咎于糟糕的基准代码,或者什么,但是使用以下代码,这两种方法在我的机器上彼此相差 10% 以内,并且一个循环通常比两个循环快一点 - 正如您所想的预计。
Array sizes ranged from 2^16 to 2^24, using eight loops. I was careful to initialize the source arrays so the +=
assignment wasn't asking the FPUto add memory garbage interpreted as a double.
数组大小范围从 2^16 到 2^24,使用八个循环。我很小心地初始化了源数组,所以+=
分配没有要求FPU添加解释为双精度的内存垃圾。
I played around with various schemes, such as putting the assignment of b[j]
, d[j]
to InitToZero[j]
inside the loops, and also with using += b[j] = 1
and += d[j] = 1
, and I got fairly consistent results.
我打得四处各种方案,如把的分配b[j]
,d[j]
以InitToZero[j]
环内,并且还用+= b[j] = 1
和+= d[j] = 1
,和我有相当一致的效果。
As you might expect, initializing b
and d
inside the loop using InitToZero[j]
gave the combined approach an advantage, as they were done back-to-back before the assignments to a
and c
, but still within 10%. Go figure.
正如您所料,初始化b
和d
在循环内 usingInitToZero[j]
为组合方法提供了一个优势,因为它们在分配给a
and之前背靠背完成c
,但仍然在 10% 以内。去搞清楚。
Hardware is Dell XPS 8500with generation 3 Core i7@ 3.4 GHz and 8 GB memory. For 2^16 to 2^24, using eight loops, the cumulative time was 44.987 and 40.965 respectively. Visual C++ 2010, fully optimized.
硬件是戴尔 XPS 8500,配备第 3 代Core i7@ 3.4 GHz 和 8 GB 内存。对于 2^16 到 2^24,使用 8 个循环,累计时间分别为 44.987 和 40.965。Visual C++ 2010,完全优化。
PS: I changed the loops to count down to zero, and the combined method was marginally faster. Scratching my head. Note the new array sizing and loop counts.
PS:我将循环更改为倒计时为零,并且组合方法稍微快一些。挠我的头。请注意新的数组大小和循环计数。
// MemBufferMystery.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <iostream>
#include <cmath>
#include <string>
#include <time.h>
#define dbl double
#define MAX_ARRAY_SZ 262145 //16777216 // AKA (2^24)
#define STEP_SZ 1024 // 65536 // AKA (2^16)
int _tmain(int argc, _TCHAR* argv[]) {
long i, j, ArraySz = 0, LoopKnt = 1024;
time_t start, Cumulative_Combined = 0, Cumulative_Separate = 0;
dbl *a = NULL, *b = NULL, *c = NULL, *d = NULL, *InitToOnes = NULL;
a = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
b = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
c = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
d = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
InitToOnes = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
// Initialize array to 1.0 second.
for(j = 0; j< MAX_ARRAY_SZ; j++) {
InitToOnes[j] = 1.0;
}
// Increase size of arrays and time
for(ArraySz = STEP_SZ; ArraySz<MAX_ARRAY_SZ; ArraySz += STEP_SZ) {
a = (dbl *)realloc(a, ArraySz * sizeof(dbl));
b = (dbl *)realloc(b, ArraySz * sizeof(dbl));
c = (dbl *)realloc(c, ArraySz * sizeof(dbl));
d = (dbl *)realloc(d, ArraySz * sizeof(dbl));
// Outside the timing loop, initialize
// b and d arrays to 1.0 sec for consistent += performance.
memcpy((void *)b, (void *)InitToOnes, ArraySz * sizeof(dbl));
memcpy((void *)d, (void *)InitToOnes, ArraySz * sizeof(dbl));
start = clock();
for(i = LoopKnt; i; i--) {
for(j = ArraySz; j; j--) {
a[j] += b[j];
c[j] += d[j];
}
}
Cumulative_Combined += (clock()-start);
printf("\n %6i miliseconds for combined array sizes %i and %i loops",
(int)(clock()-start), ArraySz, LoopKnt);
start = clock();
for(i = LoopKnt; i; i--) {
for(j = ArraySz; j; j--) {
a[j] += b[j];
}
for(j = ArraySz; j; j--) {
c[j] += d[j];
}
}
Cumulative_Separate += (clock()-start);
printf("\n %6i miliseconds for separate array sizes %i and %i loops \n",
(int)(clock()-start), ArraySz, LoopKnt);
}
printf("\n Cumulative combined array processing took %10.3f seconds",
(dbl)(Cumulative_Combined/(dbl)CLOCKS_PER_SEC));
printf("\n Cumulative seperate array processing took %10.3f seconds",
(dbl)(Cumulative_Separate/(dbl)CLOCKS_PER_SEC));
getchar();
free(a); free(b); free(c); free(d); free(InitToOnes);
return 0;
}
I'm not sure why it was decided that MFLOPS was a relevant metric. I though the idea was to focus on memory accesses, so I tried to minimize the amount of floating point computation time. I left in the +=
, but I am not sure why.
我不确定为什么决定 MFLOPS 是一个相关的指标。我虽然这个想法是专注于内存访问,所以我试图最大限度地减少浮点计算时间。我离开了+=
,但我不知道为什么。
A straight assignment with no computation would be a cleaner test of memory access time and would create a test that is uniform irrespective of the loop count. Maybe I missed something in the conversation, but it is worth thinking twice about. If the plus is left out of the assignment, the cumulative time is almost identical at 31 seconds each.
没有计算的直接分配将是对内存访问时间的更清晰的测试,并且会创建一个统一的测试,而不管循环计数如何。也许我在谈话中遗漏了一些东西,但值得三思。如果加号被排除在分配之外,则累积时间几乎相同,每个都是 31 秒。
回答by James
It's because the CPU doesn't have so many cache misses (where it has to wait for the array data to come from the RAM chips). It would be interesting for you to adjust the size of the arrays continually so that you exceed the sizes of the level 1 cache(L1), and then the level 2 cache(L2), of your CPU and plot the time taken for your code to execute against the sizes of the arrays. The graph shouldn't be a straight line like you'd expect.
这是因为 CPU 没有那么多缓存未命中(它必须等待来自 RAM 芯片的阵列数据)。不断调整数组的大小,以便超过CPU 的1 级缓存(L1) 和2 级缓存(L2) 的大小,并绘制代码所需的时间,这对您来说会很有趣根据数组的大小执行。该图不应像您期望的那样是一条直线。
回答by Guillaume Kiz
The first loop alternates writing in each variable. The second and third ones only make small jumps of element size.
第一个循环交替写入每个变量。第二个和第三个只对元素大小进行小幅跳跃。
Try writing two parallel lines of 20 crosses with a pen and paper separated by 20 cm. Try once finishing one and then the other line and try another time by writting a cross in each line alternately.
尝试用相距 20 厘米的笔和纸写出两条 20 个十字的平行线。尝试完成一次然后另一行,然后通过在每行交替写一个叉来尝试另一次。
回答by Francis Cugler
The Original Question
原始问题
Why is one loop so much slower than two loops?
为什么一个循环比两个循环慢这么多?
Conclusion:
结论:
Case 1is a classic interpolation problem that happens to be an inefficient one. I also think that this was one of the leading reasons why many machine architectures and developers ended up building and designing multi-core systems with the ability to do multi-threaded applications as well as parallel programming.
案例 1是一个经典的插值问题,它恰好是一个低效的问题。我还认为这是许多机器架构和开发人员最终构建和设计具有执行多线程应用程序和并行编程能力的多核系统的主要原因之一。
Looking at it from this kind of an approach without involving how the Hardware, OS, and Compiler(s) works together to do heap allocations that involve working with RAM, Cache, Page Files, etc.; the mathematics that is at the foundation of these algorithms shows us which of these two is the better solution.
从这种方法来看,不涉及硬件、操作系统和编译器如何协同工作来进行涉及 RAM、缓存、页面文件等工作的堆分配;这些算法的基础数学向我们展示了这两个中哪一个是更好的解决方案。
We can use an analogy of a Boss
being a Summation
that will represent a For Loop
that has to travel between workers A
& B
.
我们可以用一个比喻Boss
是一个Summation
将代表一个For Loop
有工人之间的旅行A
和B
。
We can easily see that Case 2is at least half as fast if not a little more than Case 1due to the difference in the distance that is needed to travel and the time taken between the workers. This math lines up almost virtually and perfectly with both the BenchMark Times as well as the number of differences in Assembly Instructions.
我们不难看出,第2种情况是至少一半快,如果不是有点超过案例1由于是需要的旅行与工人之间所花费的时间的距离差。这个数学计算几乎完美地与基准时间以及装配说明中的差异数量对齐。
I will now begin to explain how all of this works below.
我现在将在下面开始解释所有这些是如何工作的。
Assessing The Problem
评估问题
The OP's code:
OP的代码:
const int n=100000;
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
And
和
for(int j=0;j<n;j++){
a1[j] += b1[j];
}
for(int j=0;j<n;j++){
c1[j] += d1[j];
}
The Consideration
考虑
Considering the OP's original question about the 2 variants of the for loops and his amended question towards the behavior of caches along with many of the other excellent answers and useful comments; I'd like to try and do something different here by taking a different approach about this situation and problem.
考虑到 OP 关于 for 循环的 2 个变体的原始问题和他针对缓存行为的修正问题以及许多其他出色的答案和有用的评论;我想通过对这种情况和问题采取不同的方法来尝试在这里做一些不同的事情。
The Approach
该方法
Considering the two loops and all of the discussion about cache and page filing I'd like to take another approach as to looking at this from a different perspective. One that doesn't involve the cache and page files nor the executions to allocate memory, in fact, this approach doesn't even concern the actual hardware or the software at all.
考虑到这两个循环以及所有关于缓存和页面归档的讨论,我想采用另一种方法从不同的角度看待这个问题。一种不涉及缓存和页面文件,也不涉及分配内存的执行,事实上,这种方法甚至根本不涉及实际的硬件或软件。
The Perspective
观点
After looking at the code for a while it became quite apparent what the problem is and what is generating it. Let's break this down into an algorithmic problem and look at it from the perspective of using mathematical notations then apply an analogy to the math problems as well as to the algorithms.
在查看了一段时间的代码后,问题是什么以及产生它的原因变得非常明显。让我们把它分解成一个算法问题,从使用数学符号的角度来看待它,然后将类比应用于数学问题和算法。
What We Do Know
我们所知道的
We know is that this loop will run 100,000 times. We also know that a1
, b1
, c1
& d1
are pointers on a 64-bit architecture. Within C++ on a 32-bit machine, all pointers are 4 bytes and on a 64-bit machine, they are 8 bytes in size since pointers are of a fixed length.
我们知道这个循环会运行 100,000 次。我们也知道a1
, b1
, c1
&d1
是 64 位架构上的指针。在 32 位机器上的 C++ 中,所有指针都是 4 个字节,而在 64 位机器上,它们的大小为 8 个字节,因为指针的长度是固定的。
We know that we have 32 bytes in which to allocate for in both cases. The only difference is we are allocating 32 bytes or 2 sets of 2-8bytes on each iteration wherein the 2nd case we are allocating 16 bytes for each iteration for both of the independent loops.
我们知道在这两种情况下我们都有 32 个字节可供分配。唯一的区别是我们在每次迭代中分配 32 个字节或 2 组 2-8 字节,其中第二种情况我们为两个独立循环的每次迭代分配 16 个字节。
Both loops still equal 32 bytes in total allocations. With this information let's now go ahead and show the general math, algorithms, and analogy of these concepts.
两个循环在总分配中仍然等于 32 个字节。有了这些信息,现在让我们继续展示这些概念的一般数学、算法和类比。
We do know the number of times that the same set or group of operations that will have to be performed in both cases. We do know the amount of memory that needs to be allocated in both cases. We can assess that the overall workload of the allocations between both cases will be approximately the same.
我们确实知道在这两种情况下必须执行相同的一组或一组操作的次数。我们确实知道在这两种情况下需要分配的内存量。我们可以评估两种情况之间分配的总体工作量将大致相同。
What We Don't Know
我们不知道的事
We do not know how long it will take for each case unless if we set a counter and run a benchmark test. However, the benchmarks were already included from the original question and from some of the answers and comments as well; and we can see a significant difference between the two and this is the whole reasoning for this proposal to this problem.
除非我们设置计数器并运行基准测试,否则我们不知道每种情况需要多长时间。但是,基准已经包含在原始问题以及一些答案和评论中;我们可以看到两者之间的显着差异,这就是针对此问题的建议的全部原因。
Let's Investigate
让我们调查一下
It is already apparent that many have already done this by looking at the heap allocations, benchmark tests, looking at RAM, Cache, and Page Files. Looking at specific data points and specific iteration indices were also included and the various conversations about this specific problem have many people starting to question other related things about it. How do we begin to look at this problem by using mathematical algorithms and applying an analogy to it? We start off by making a couple of assertions! Then we build out our algorithm from there.
很明显,许多人已经通过查看堆分配、基准测试、查看 RAM、缓存和页面文件来做到这一点。还包括查看特定数据点和特定迭代索引,关于这个特定问题的各种对话让许多人开始质疑其他相关的事情。我们如何开始通过使用数学算法并对其进行类比来看待这个问题?我们首先做出几个断言!然后我们从那里构建我们的算法。
Our Assertions:
我们的主张:
- We will let our loop and its iterations be a Summation that starts at 1 and ends at 100000 instead of starting with 0 as in the loops for we don't need to worry about the 0 indexing scheme of memory addressing since we are just interested in the algorithm itself.
- In both cases we have 4 functions to work with and 2 function calls with 2 operations being done on each function call. We will set these up as functions and calls to functions as the following:
F1()
,F2()
,f(a)
,f(b)
,f(c)
andf(d)
.
- 我们将让我们的循环及其迭代是一个从 1 开始到 100000 结束的总和,而不是像在循环中那样从 0 开始,因为我们不需要担心内存寻址的 0 索引方案,因为我们只对算法本身。
- 在这两种情况下,我们有 4 个函数要处理,2 个函数调用,每个函数调用执行 2 个操作。我们将设置这些函数和函数调用,如下:
F1()
,F2()
,f(a)
,f(b)
,f(c)
和f(d)
。
The Algorithms:
算法:
1st Case:- Only one summation but two independent function calls.
第一种情况:- 只有一个求和,但有两个独立的函数调用。
Sum n=1 : [1,100000] = F1(), F2();
F1() = { f(a) = f(a) + f(b); }
F2() = { f(c) = f(c) + f(d); }
2nd Case:- Two summations but each has its own function call.
第二种情况:- 两个求和,但每个求和都有自己的函数调用。
Sum1 n=1 : [1,100000] = F1();
F1() = { f(a) = f(a) + f(b); }
Sum2 n=1 : [1,100000] = F1();
F1() = { f(c) = f(c) + f(d); }
If you noticed F2()
only exists in Sum
from Case1
where F1()
is contained in Sum
from Case1
and in both Sum1
and Sum2
from Case2
. This will be evident later on when we begin to conclude that there is an optimization that is happening within the second algorithm.
如果您注意到F2()
仅存在于Sum
from Case1
whereF1()
中包含Sum
fromCase1
和 in both Sum1
and Sum2
from Case2
。稍后当我们开始得出结论认为第二个算法中正在发生优化时,这一点将很明显。
The iterations through the first case Sum
calls f(a)
that will add to its self f(b)
then it calls f(c)
that will do the same but add f(d)
to itself for each 100000
iterations. In the second case, we have Sum1
and Sum2
that both act the same as if they were the same function being called twice in a row.
通过第一个 caseSum
调用的迭代f(a)
将添加到其自身,f(b)
然后它调用f(c)
将执行相同的操作,但f(d)
每次100000
迭代都会添加到自身。在第二种情况下,我们有Sum1
和Sum2
两者的行为相同,就好像它们是连续调用两次的同一个函数。
In this case we can treat Sum1
and Sum2
as just plain old Sum
where Sum
in this case looks like this: Sum n=1 : [1,100000] { f(a) = f(a) + f(b); }
and now this looks like an optimization where we can just consider it to be the same function.
在这种情况下,我们可以将Sum1
和Sum2
视为简单的旧Sum
,Sum
在这种情况下看起来像这样:Sum n=1 : [1,100000] { f(a) = f(a) + f(b); }
现在这看起来像是一种优化,我们可以将其视为相同的功能。
Summary with Analogy
类比总结
With what we have seen in the second case it almost appears as if there is optimization since both for loops have the same exact signature, but this isn't the real issue. The issue isn't the work that is being done by f(a)
, f(b)
, f(c)
, and f(d)
. In both cases and the comparison between the two, it is the difference in the distance that the Summation has to travel in each case that gives you the difference in execution time.
对于我们在第二种情况下看到的情况,它几乎看起来好像有优化,因为两个 for 循环具有相同的确切签名,但这不是真正的问题。这个问题是不是正在做的工作f(a)
,f(b)
,f(c)
,和f(d)
。在这两种情况下以及两者之间的比较中,求和在每种情况下必须传播的距离的差异会导致执行时间的差异。
Think of the For Loops
as being the Summations
that does the iterations as being a Boss
that is giving orders to two people A
& B
and that their jobs are to meat C
& D
respectively and to pick up some package from them and return it. In this analogy, the for loops or summation iterations and condition checks themselves don't actually represent the Boss
. What actually represents the Boss
is not from the actual mathematical algorithms directly but from the actual concept of Scope
and Code Block
within a routine or subroutine, method, function, translation unit, etc. The first algorithm has 1 scope where the 2nd algorithm has 2 consecutive scopes.
的认为For Loops
作为是Summations
,做迭代作为一个Boss
被发号施令两个人A
及B
与他们的工作是肉C
及D
分别拿起从他们一些包并返回它。在这个类比中,for 循环或求和迭代和条件检查本身实际上并不代表Boss
. 实际代表的Boss
不是直接来自实际的数学算法,而是来自程序或子程序、方法、函数、翻译单元等的实际概念Scope
和Code Block
内部。第一个算法有1个范围,第二个算法有2个连续的范围。
Within the first case on each call slip, the Boss
goes to A
and gives the order and A
goes off to fetch B's
package then the Boss
goes to C
and gives the orders to do the same and receive the package from D
on each iteration.
在每个调用单上的第一种情况下,Boss
转到A
并发出订单并A
开始获取B's
包裹,然后Boss
转到C
并发出订单以执行相同的操作并D
在每次迭代中接收包裹。
Within the second case, the Boss
works directly with A
to go and fetch B's
package until all packages are received. Then the Boss
works with C
to do the same for getting all of D's
packages.
在第二种情况下,Boss
直接使用A
go 和 fetchB's
包,直到收到所有包。然后Boss
使用C
相同的方法获取所有D's
包。
Since we are working with an 8-byte pointer and dealing with heap allocation let's consider the following problem. Let's say that the Boss
is 100 feet from A
and that A
is 500 feet from C
. We don't need to worry about how far the Boss
is initially from C
because of the order of executions. In both cases, the Boss
initially travels from A
first then to B
. This analogy isn't to say that this distance is exact; it is just a useful test case scenario to show the workings of the algorithms.
由于我们正在使用一个 8 字节的指针并处理堆分配,让我们考虑以下问题。让我们说,Boss
从百英尺A
那A
是500英尺C
。由于执行的顺序,我们不需要担心Boss
最初的距离有多远C
。在这两种情况下,Boss
最初从A
first 然后到B
。这个类比并不是说这个距离是精确的;它只是一个有用的测试用例场景,用于展示算法的工作原理。
In many cases when doing heap allocations and working with the cache and page files, these distances between address locations may not vary that much or they can vary significantly depending on the nature of the data types and the array sizes.
在许多情况下,在进行堆分配并使用缓存和页面文件时,地址位置之间的这些距离可能不会有太大变化,或者它们可能会因数据类型和数组大小的性质而显着变化。
The Test Cases:
测试用例:
First Case:On first iteration the Boss
has to initially go 100 feet to give the order slip to A
and A
goes off and does his thing, but then the Boss
has to travel 500 feet to C
to give him his order slip. Then on the next iteration and every other iteration after the Boss
has to go back and forth 500 feet between the two.
第一种情况:在第一次迭代中Boss
已开始进入100英尺到给定单票据来A
和A
熄灭,不关他的事,但随后Boss
有前往500英尺来C
给他自己的存根发票。然后在下一次迭代和之后的每一次迭代中,Boss
必须在两者之间来回 500 英尺。
Second Case:The Boss
has to travel 100 feet on the first iteration to A
, but after that, he is already there and just waits for A
to get back until all slips are filled. Then the Boss
has to travel 500 feet on the first iteration to C
because C
is 500 feet from A
. Since this Boss( Summation, For Loop )
is being called right after working with A
he then just waits there as he did with A
until all of C's
order slips are done.
第二种情况:在Boss
有行驶100英尺的第一次迭代A
,但在那之后,他已经在那里,只是等待A
取回,直到所有的票据被填满。然后Boss
必须在第一次迭代中移动 500 英尺,C
因为C
距离 500 英尺A
。由于这Boss( Summation, For Loop )
是在与A
他合作后立即调用的,因此他只是在那里等待,A
直到完成所有C's
订单。
The Difference In Distances Traveled
旅行距离的差异
const n = 100000
distTraveledOfFirst = (100 + 500) + ((n-1)*(500 + 500);
// Simplify
distTraveledOfFirst = 600 + (99999*100);
distTraveledOfFirst = 600 + 9999900;
distTraveledOfFirst = 10000500;
// Distance Traveled On First Algorithm = 10,000,500ft
distTraveledOfSecond = 100 + 500 = 600;
// Distance Traveled On Second Algorithm = 600ft;
The Comparison of Arbitrary Values
任意值的比较
We can easily see that 600 is far less than 10 million. Now, this isn't exact, because we don't know the actual difference in distance between which address of RAM or from which Cache or Page File each call on each iteration is going to be due to many other unseen variables. This is just an assessment of the situation to be aware of and looking at it from the worst-case scenario.
我们不难看出,600远小于1000万。现在,这并不准确,因为我们不知道在每次迭代中每次调用的 RAM 地址之间或来自哪个缓存或页面文件之间的实际距离差异将是由于许多其他看不见的变量。这只是对要了解的情况的评估,并从最坏的情况下看待它。
From these numbers it would almost appear as if Algorithm One should be 99%
slower than Algorithm Two; however, this is only the Boss's
part or responsibility of the algorithms and it doesn't account for the actual workers A
, B
, C
, & D
and what they have to do on each and every iteration of the Loop. So the boss's job only accounts for about 15 - 40% of the total work being done. The bulk of the work that is done through the workers has a slightly bigger impact towards keeping the ratio of the speed rate differences to about 50-70%
从这些数字来看,似乎算法一应该99%
比算法二慢;然而,这只是Boss's
算法的一部分或职责,并没有考虑实际的工作人员A
, B
, C
, &D
以及他们在循环的每次迭代中必须做的事情。所以老板的工作只占完成的总工作量的 15-40% 左右。通过工人完成的大部分工作对将速度差异的比率保持在 50-70% 左右的影响稍大
The Observation:- The differences between the two algorithms
观察:-两种算法之间的差异
In this situation, it is the structure of the process of the work being done. It goes to show that Case 2is more efficient from both the partial optimization of having a similar function declaration and definition where it is only the variables that differ by name and the distance traveled.
在这种情况下,它是正在完成的工作过程的结构。它表明情况 2从具有类似函数声明和定义的部分优化中更有效,其中只有名称和行进距离不同的变量。
We also see that the total distance traveled in Case 1is much farther than it is in Case 2and we can consider this distance traveled our Time Factorbetween the two algorithms. Case 1has considerable more work to do than Case 2does.
我们还看到,案例 1 中行驶的总距离远比案例 2 中的远得多,我们可以考虑在两种算法之间行驶的时间因子的距离。案例 1比案例 2有更多的工作要做。
This is observable from the evidence of the ASM
instructions that were shown in both cases. Along with what was already stated about these cases, this doesn't account for the fact that in Case 1the boss will have to wait for both A
& C
to get back before he can go back to A
again for each iteration. It also doesn't account for the fact that if A
or B
is taking an extremely long time then both the Boss
and the other worker(s) are idle waiting to be executed.
这可以从ASM
两种情况下显示的说明的证据中观察到。除了已经说明的关于这些案例的内容之外,这并没有解释这样一个事实,即在案例 1 中,boss 必须等待两个A
&C
返回,然后才能A
再次返回以进行每次迭代。它还没有考虑这样一个事实,即如果A
或B
花费了很长时间,则Boss
该工作人员和其他工作人员都空闲等待执行。
In Case 2the only one being idle is the Boss
until the worker gets back. So even this has an impact on the algorithm.
在案例 2 中,唯一空闲的是Boss
直到工人回来。所以即使这样也会对算法产生影响。
The OPs Amended Question(s)
OP 修正问题
EDIT: The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:
编辑:结果证明这个问题无关紧要,因为行为严重取决于数组 (n) 和 CPU 缓存的大小。所以如果有进一步的兴趣,我改写这个问题:
Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?
您能否深入了解导致不同缓存行为的细节,如下图的五个区域所示?
It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.
通过为这些 CPU 提供类似的图表,指出 CPU/缓存架构之间的差异也可能很有趣。
Regarding These Questions
关于这些问题
As I have demonstrated without a doubt, there is an underlying issue even before the Hardware and Software becomes involved.
正如我毫无疑问地证明的那样,甚至在涉及硬件和软件之前就存在一个潜在的问题。
Now as for the management of memory and caching along with page files, etc. which all work together in an integrated set of systems between the following:
现在对于内存和缓存以及页面文件等的管理,它们都在以下集成的系统集中协同工作:
The Architecture
{ Hardware, Firmware, some Embedded Drivers, Kernels and ASM Instruction Sets }.The OS
{ File and Memory Management systems, Drivers and the Registry }.The Compiler
{ Translation Units and Optimizations of the Source Code }.- And even the
Source Code
itself with its set(s) of distinctive algorithms.
The Architecture
{ 硬件、固件、一些嵌入式驱动程序、内核和 ASM 指令集}。The OS
{文件和内存管理系统、驱动程序和注册表}。The Compiler
{ 源代码的翻译单位和优化}。- 甚至它
Source Code
本身及其独特的算法集。
We can already see that there is a bottleneck that is happening within the first algorithm before we even apply it to any machine with any arbitrary Architecture
, OS
, and Programmable Language
compared to the second algorithm. There already existed a problem before involving the intrinsics of a modern computer.
我们已经可以看到有是第一种算法中发生的事情之前,我们甚至可以与任意适用于任何机器的瓶颈Architecture
,OS
以及Programmable Language
相较于第二算法。在涉及现代计算机的内在特性之前,已经存在一个问题。
The Ending Results
最终结果
However; it is not to say that these new questions are not of importance because they themselves are and they do play a role after all. They do impact the procedures and the overall performance and that is evident with the various graphs and assessments from many who have given their answer(s) and or comment(s).
然而; 并不是说这些新问题不重要,因为它们本身很重要,毕竟它们确实发挥了作用。它们确实会影响程序和整体性能,这在许多给出答案和/或评论的人的各种图表和评估中显而易见。
If you paid attention to the analogy of the Boss
and the two workers A
& B
who had to go and retrieve packages from C
& D
respectively and considering the mathematical notations of the two algorithms in question; you can see without the involvement of the computer hardware and software Case 2
is approximately 60%
faster than Case 1
.
如果你注意到Boss
两个工人A
&的类比,B
他们不得不分别从C
& 中检索包,D
并考虑所讨论的两种算法的数学符号;你可以看到没有计算机硬件和软件的介入Case 2
大约60%
比Case 1
.
When you look at the graphs and charts after these algorithms have been applied to some source code, compiled, optimized, and executed through the OS to perform their operations on a given piece of hardware, you can even see a little more degradation between the differences in these algorithms.
当您查看将这些算法应用于某些源代码、编译、优化并通过操作系统执行以在给定硬件上执行其操作之后的图形和图表时,您甚至可以看到差异之间的降级更多在这些算法中。
If the Data
set is fairly small it may not seem all that bad of a difference at first. However, since Case 1
is about 60 - 70%
slower than Case 2
we can look at the growth of this function in terms of the differences in time executions:
如果Data
集合相当小,一开始看起来可能没有那么糟糕。然而,由于Case 1
大约60 - 70%
慢于Case 2
我们可以看一下这个功能的的时间执行的不同方面的增长:
DeltaTimeDifference approximately = Loop1(time) - Loop2(time)
//where
Loop1(time) = Loop2(time) + (Loop2(time)*[0.6,0.7]) // approximately
// So when we substitute this back into the difference equation we end up with
DeltaTimeDifference approximately = (Loop2(time) + (Loop2(time)*[0.6,0.7])) - Loop2(time)
// And finally we can simplify this to
DeltaTimeDifference approximately = [0.6,0.7]*Loop2(time)
This approximation is the average difference between these two loops both algorithmically and machine operations involving software optimizations and machine instructions.
这个近似值是这两个循环在算法和机器操作方面的平均差异,涉及软件优化和机器指令。
When the data set grows linearly, so does the difference in time between the two. Algorithm 1 has more fetches than algorithm 2 which is evident when the Boss
has to travel back and forth the maximum distance between A
& C
for every iteration after the first iteration while Algorithm 2 the Boss
has to travel to A
once and then after being done with A
he has to travel a maximum distance only one time when going from A
to C
.
当数据集线性增长时,两者之间的时间差也会增加。算法 1 比算法 2 具有更多的提取,这很明显,当第一次迭代后每次迭代必须在&Boss
之间来回移动最大距离时,而算法 2必须移动到一次,然后在完成后他必须移动从到的最大距离只有一次。A
C
Boss
A
A
A
C
Trying to have the Boss
focusing on doing two similar things at once and juggling them back and forth instead of focusing on similar consecutive tasks is going to make him quite angry by the end of the day since he had to travel and work twice as much. Therefore do not lose the scope of the situation by letting your boss getting into an interpolated bottleneck because the boss's spouse and children wouldn't appreciate it.
试图Boss
集中精力同时做两件类似的事情并来回处理它们,而不是专注于类似的连续任务,这会让他在一天结束时非常生气,因为他必须出差和工作两倍。因此,不要因为老板的配偶和孩子不会欣赏而让你的老板陷入插值瓶颈,从而失去情况的范围。
Amendment: Software Engineering Design Principles
修订:软件工程设计原则
-- The difference between Local Stack
and Heap Allocated
computations within iterative for loops and the difference between their usages, their efficiencies, and effectiveness --
--迭代 for 循环中的计算Local Stack
和Heap Allocated
计算之间的区别以及它们的用法、效率和有效性之间的区别 --
The mathematical algorithm that I proposed above mainly applies to loops that perform operations on data that is allocated on the heap.
我上面提出的数学算法主要适用于对堆上分配的数据执行操作的循环。
- Consecutive Stack Operations:
- If the loops are performing operations on data locally within a single code block or scope that is within the stack frame it will still sort of apply, but the memory locations are much closer where they are typically sequential and the difference in distance traveled or execution time is almost negligible. Since there are no allocations being done within the heap, the memory isn't scattered, and the memory isn't being fetched through ram. The memory is typically sequential and relative to the stack frame and stack pointer.
- When consecutive operations are being done on the stack, a modern Processor will cache repetitive values and addresses keeping these values within local cache registers. The time of operations or instructions here is on the order of nano-seconds.
- Consecutive Heap Allocated Operations:
- When you begin to apply heap allocations and the processor has to fetch the memory addresses on consecutive calls, depending on the architecture of the CPU, the Bus Controller, and the Ram modules the time of operations or execution can be on the order of micro to milliseconds. In comparison to cached stack operations, these are quite slow.
- The CPU will have to fetch the memory address from Ram and typically anything across the system bus is slow compared to the internal data paths or data buses within the CPU itself.
- 连续堆栈操作:
- 如果循环在堆栈帧内的单个代码块或范围内本地对数据执行操作,它仍然会适用,但内存位置更接近它们通常是连续的,并且行进距离或执行时间的差异几乎可以忽略不计。由于没有在堆内进行分配,因此内存不会分散,并且不会通过 ram 获取内存。内存通常是顺序的,并且相对于堆栈帧和堆栈指针。
- 当在堆栈上进行连续操作时,现代处理器将缓存重复值并将这些值保存在本地缓存寄存器中。这里的操作或指令的时间是纳秒级的。
- 连续堆分配操作:
- 当您开始应用堆分配并且处理器必须在连续调用中获取内存地址时,取决于 CPU、总线控制器和 Ram 模块的架构,操作或执行时间可以是微到毫秒。与缓存堆栈操作相比,这些操作非常慢。
- CPU 必须从 Ram 中获取内存地址,并且与 CPU 本身的内部数据路径或数据总线相比,系统总线上的任何内容通常都很慢。
So when you are working with data that needs to be on the heap and you are traversing through them in loops, it is more efficient to keep each data set and its corresponding algorithms within its own single loop. You will get better optimizations compared to trying to factor out consecutive loops by putting multiple operations of different data sets that are on the heap into a single loop.
因此,当您处理需要在堆上的数据并且您在循环中遍历它们时,将每个数据集及其相应算法保留在其自己的单个循环中会更有效。与尝试通过将堆上不同数据集的多个操作放入单个循环中来分解连续循环相比,您将获得更好的优化。
It is okay to do this with data that is on the stack since they are frequently cached, but not for data that has to have its memory address queried every iteration.
可以对堆栈上的数据执行此操作,因为它们经常被缓存,但不适用于每次迭代都必须查询其内存地址的数据。
This is where Software Engineering and Software Architecture Design comes into play. It is the ability to know how to organize your data, knowing when to cache your data, knowing when to allocate your data on the heap, knowing how to design and implement your algorithms, and knowing when and where to call them.
这就是软件工程和软件架构设计发挥作用的地方。它是知道如何组织数据、知道何时缓存数据、知道何时在堆上分配数据、知道如何设计和实现算法以及知道何时何地调用它们的能力。
You might have the same algorithm that pertains to the same data set, but you might want one implementation design for its stack variant and another for its heap-allocated variant just because of the above issue that is seen from its O(n)
complexity of the algorithm when working with the heap.
您可能拥有属于同一数据集的相同算法,但您可能需要一种用于其堆栈变体的实现设计和另一种用于其堆分配变体的实现设计,这仅仅是因为O(n)
在工作时从算法的复杂性中可以看出上述问题与堆。
From what I've noticed over the years many people do not take this fact into consideration. They will tend to design one algorithm that works on a particular data set and they will use it regardless of the data set being locally cached on the stack or if it was allocated on the heap.
根据我多年来的观察,很多人没有考虑到这个事实。他们倾向于设计一种适用于特定数据集的算法,并且无论数据集是本地缓存在堆栈上还是分配在堆上,他们都会使用它。
If you want true optimization, yes it might seem like code duplication, but to generalize it would be more efficient to have two variants of the same algorithm. One for stack operations, and the other for heap operations that are performed in iterative loops!
如果你想要真正的优化,是的,它可能看起来像代码重复,但概括地说,拥有相同算法的两个变体会更有效。一个用于堆栈操作,另一个用于在迭代循环中执行的堆操作!
Here's a pseudo example: Two simple structs, one algorithm.
这是一个伪示例:两个简单的结构,一个算法。
struct A {
int data;
A() : data{0}{}
A(int a) : data{a}{}
};
struct B {
int data;
B() : data{0}{}
A(int b) : data{b}{}
}
template<typename T>
void Foo( T& t ) {
// do something with t
}
// some looping operation: first stack then heap.
// stack data:
A dataSetA[10] = {};
B dataSetB[10] = {};
// For stack operations this is okay and efficient
for (int i = 0; i < 10; i++ ) {
Foo(dataSetA[i]);
Foo(dataSetB[i]);
}
// If the above two were on the heap then performing
// the same algorithm to both within the same loop
// will create that bottleneck
A* dataSetA = new [] A();
B* dataSetB = new [] B();
for ( int i = 0; i < 10; i++ ) {
Foo(dataSetA[i]); // dataSetA is on the heap here
Foo(dataSetB[i]); // dataSetB is on the heap here
} // this will be inefficient.
// To improve the efficiency above, put them into separate loops...
for (int i = 0; i < 10; i++ ) {
Foo(dataSetA[i]);
}
for (int i = 0; i < 10; i++ ) {
Foo(dataSetB[i]);
}
// This will be much more efficient than above.
// The code isn't perfect syntax, it's only psuedo code
// to illustrate a point.
This is what I was referring to by having separate implementations for stack variants versus heap variants. The algorithms themselves don't matter too much, it's the looping structures that you will use them in that do.
这就是我所指的堆栈变体与堆变体的单独实现。算法本身并不重要,重要的是您将在其中使用它们的循环结构。
回答by mathengineer
It may be old C++ and optimizations. On my computer I obtained almost the same speed:
它可能是旧的 C++ 和优化。在我的电脑上,我获得了几乎相同的速度:
One loop: 1.577 ms
一个循环:1.577 毫秒
Two loops: 1.507 ms
两个循环:1.507 毫秒
I run Visual Studio 2015 on an E5-1620 3.5 GHz processor with 16 GB RAM.
我在具有 16 GB RAM 的 E5-1620 3.5 GHz 处理器上运行 Visual Studio 2015。