C++ 为英特尔 Sandybridge 系列 CPU 中的管道取消优化程序

Question

提问by Cowmoogun

I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions:

我已经绞尽脑汁想完成这个任务一个星期了，我希望这里有人能引导我走向正确的道路。让我从导师的指示开始：

Your assignment is the opposite of our first lab assignment, which was to optimize a prime number program. Your purpose in this assignment is to pessimize the program, i.e. make it run slower. Both of these are CPU-intensive programs. They take a few seconds to run on our lab PCs. You may not change the algorithm.
To deoptimize the program, use your knowledge of how the Intel i7 pipeline operates. Imagine ways to re-order instruction paths to introduce WAR, RAW, and other hazards. Think of ways to minimize the effectiveness of the cache. Be diabolically incompetent.

您的作业与我们的第一个实验室作业相反，后者是优化素数程序。你在这个作业中的目的是使程序悲观，即让它运行得更慢。这两个都是 CPU 密集型程序。它们需要几秒钟才能在我们的实验室 PC 上运行。您不得更改算法。
要取消优化程序，请使用您对英特尔 i7 管道运行方式的了解。想象一下重新排序指令路径以引入 WAR、RAW 和其他危害的方法。想办法最大限度地降低缓存的有效性。成为恶魔般的无能。

The assignment gave a choice of Whetstone or Monte-Carlo programs. The cache-effectiveness comments are mostly only applicable to Whetstone, but I chose the Monte-Carlo simulation program:

作业给出了油石或蒙特卡洛程序的选择。缓存有效性评论大多只适用于Whetstone，但我选择了蒙特卡罗模拟程序：

// Un-modified baseline for pessimization, as given in the assignment
#include <algorithm>    // Needed for the "max" function
#include <cmath>
#include <iostream>

// A simple implementation of the Box-Muller algorithm, used to generate
// gaussian random numbers - necessary for the Monte Carlo method below
// Note that C++11 actually provides std::normal_distribution<> in 
// the <random> library, which can be used instead of this function
double gaussian_box_muller() {
  double x = 0.0;
  double y = 0.0;
  double euclid_sq = 0.0;

  // Continue generating two uniform random variables
  // until the square of their "euclidean distance" 
  // is less than unity
  do {
    x = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
    y = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
    euclid_sq = x*x + y*y;
  } while (euclid_sq >= 1.0);

  return x*sqrt(-2*log(euclid_sq)/euclid_sq);
}

// Pricing a European vanilla call option with a Monte Carlo method
double monte_carlo_call_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
  double S_adjust = S * exp(T*(r-0.5*v*v));
  double S_cur = 0.0;
  double payoff_sum = 0.0;

  for (int i=0; i<num_sims; i++) {
    double gauss_bm = gaussian_box_muller();
    S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
    payoff_sum += std::max(S_cur - K, 0.0);
  }

  return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}

// Pricing a European vanilla put option with a Monte Carlo method
double monte_carlo_put_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
  double S_adjust = S * exp(T*(r-0.5*v*v));
  double S_cur = 0.0;
  double payoff_sum = 0.0;

  for (int i=0; i<num_sims; i++) {
    double gauss_bm = gaussian_box_muller();
    S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
    payoff_sum += std::max(K - S_cur, 0.0);
  }

  return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}

int main(int argc, char **argv) {
  // First we create the parameter list                                                                               
  int num_sims = 10000000;   // Number of simulated asset paths                                                       
  double S = 100.0;  // Option price                                                                                  
  double K = 100.0;  // Strike price                                                                                  
  double r = 0.05;   // Risk-free rate (5%)                                                                           
  double v = 0.2;    // Volatility of the underlying (20%)                                                            
  double T = 1.0;    // One year until expiry                                                                         

  // Then we calculate the call/put values via Monte Carlo                                                                          
  double call = monte_carlo_call_price(num_sims, S, K, r, v, T);
  double put = monte_carlo_put_price(num_sims, S, K, r, v, T);

  // Finally we output the parameters and prices                                                                      
  std::cout << "Number of Paths: " << num_sims << std::endl;
  std::cout << "Underlying:      " << S << std::endl;
  std::cout << "Strike:          " << K << std::endl;
  std::cout << "Risk-Free Rate:  " << r << std::endl;
  std::cout << "Volatility:      " << v << std::endl;
  std::cout << "Maturity:        " << T << std::endl;

  std::cout << "Call Price:      " << call << std::endl;
  std::cout << "Put Price:       " << put << std::endl;

  return 0;
}

The changes I have made seemed to increase the code running time by a second but I'm not entirely sure what I can change to stall the pipeline without adding code. A point to the right direction would be awesome, I appreciate any responses.

我所做的更改似乎将代码运行时间增加了一秒，但我不完全确定我可以更改什么以在不添加代码的情况下停止管道。指向正确方向的点会很棒，我感谢任何回应。

Update: the professor who gave this assignment posted some details

更新：分配此作业的教授发布了一些详细信息

The highlights are:

亮点是：

It's a second semester architecture class at a community college (using the Hennessy and Patterson textbook).
the lab computers have Haswell CPUs
The students have been exposed to the CPUIDinstruction and how to determine cache size, as well as intrinsics and the CLFLUSHinstruction.
any compiler options are allowed, and so is inline asm.
Writing your own square root algorithm was announced as being outside the pale

这是社区学院的第二学期建筑课程（使用 Hennessy 和 Patterson 教科书）。
实验室计算机有 Haswell CPU
学生们已经接触了CPUID指令和如何确定缓存大小，以及内在函数和CLFLUSH指令。
任何编译器选项都是允许的，内联汇编也是如此。
编写您自己的平方根算法被宣布为在苍白之外

Cowmoogun's comments on the meta thread indicate that it wasn't clear compiler optimizations could be part of this, and assumed -O0, and that a 17% increase in run-time was reasonable.

Cowmoogun 对元线程的评论表明，尚不清楚编译器优化是否可能是其中的一部分，并假设-O0运行时间增加 17% 是合理的。

So it sounds like the goal of the assignment was to get students to re-order the existing work to reduce instruction-level parallelism or things like that, but it's not a bad thing that people have delved deeper and learned more.

所以听起来这个作业的目标是让学生重新安排现有的工作，以减少指令级的并行性或类似的事情，但人们深入研究并学到更多并不是一件坏事。

Keep in mind that this is a computer-architecture question, not a question about how to make C++ slow in general.

请记住，这是一个计算机体系结构问题，而不是关于如何使 C++ 变慢的问题。

Answer 1

回答by Peter Cordes

Important background reading: Agner Fog's microarch pdf, and probably also Ulrich Drepper's What Every Programmer Should Know About Memory. See also the other links in the x86tag wiki, especially Intel's optimization manuals, and David Kanter's analysis of the Haswell microarchitecture, with diagrams.

重要背景阅读：Agner Fog 的 microarch pdf，可能还有 Ulrich Drepper 的What Every Programmer should Know About Memory。另请参阅x86标签 wiki 中的其他链接，尤其是 Intel 的优化手册，以及 David Kanter对 Haswell 微体系结构的分析以及图表。

Very cool assignment; much better than the ones I've seen where students were asked to optimize some code for gcc -O0, learning a bunch of tricks that don't matter in real code. In this case, you're being asked to learn about the CPU pipeline and use that to guide your de-optimization efforts, not just blind guessing. The most fun part of this one is justifying each pessimization with "diabolical incompetence", not intentional malice.

很酷的任务；比我见过的那些要求学生为优化一些代码gcc -O0，学习一堆在实际代码中无关紧要的技巧要好得多。在这种情况下，您被要求了解 CPU 管道并使用它来指导您的去优化工作，而不仅仅是盲目猜测。 这个最有趣的部分是用“恶魔般的无能”来证明每一种悲观主义，而不是故意的恶意。

Problems with the assignment wording and code:

赋值语句和代码的问题：

The uarch-specific options for this code are limited. It doesn't use any arrays, and much of the cost is calls to exp/loglibrary functions. There isn't an obvious way to have more or less instruction-level parallelism, and the loop-carried dependency chain is very short.

此代码的 uarch 特定选项是有限的。它不使用任何数组，大部分成本是调用exp/log库函数。没有明显的方法来实现或多或少的指令级并行性，并且循环携带的依赖链非常短。

I'd love to see an answer that attempted to get a slowdown from re-arranging the expressions to change the dependencies, to reduce ILPjust from dependencies (hazards).I haven't attempted it.

我很想看到一个答案，它试图通过重新排列表达式来改变依赖关系来减慢速度，以减少依赖关系（危险）的ILP。我没有尝试过。

Intel Sandybridge-family CPUs are aggressive out-of-order designs that spend lots of transistors and power to find parallelism and avoid hazards (dependencies) that would trouble a classic RISC in-order pipeline. Usually the only traditional hazards that slow it down are RAW "true" dependencies that cause throughput to be limited by latency.

英特尔 Sandybridge 系列 CPU 是激进的无序设计，它花费大量晶体管和电源来寻找并行性并避免会给经典 RISC 有序管道带来麻烦的危险（依赖性）。通常，降低速度的唯一传统危害是原始“真实”依赖关系，导致吞吐量受到延迟的限制。

WAR and WAW hazardsfor registers are pretty much not an issue, thanks to register renaming. (except for popcnt/lzcnt/tzcnt, which have a false dependency their destination on Intel CPUs, even though it's write-only. i.e. WAW being handled as a RAW hazard + a write). For memory ordering, modern CPUs use store queues to delay commit into cache until retirement, also avoiding WAR and WAW hazards.

由于寄存器重命名，寄存器的WAR 和 WAW 危害几乎不是问题。（除popcnt/lzcnt/tzcnt，其中有一个错误的依赖他们的Intel CPU的目的地，即使它的只写。即WAW正在为RAW +危害写入处理）。对于内存排序，现代 CPU 使用存储队列来延迟提交到缓存直到退役，同时避免 WAR 和 WAW 危害。

Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?has more about register renaming and hiding FMA latency in an FP dot product loop.

为什么 mulss 在 Haswell 上只需要 3 个周期，与 Agner 的指令表不同？有更多关于寄存器重命名和隐藏 FP 点积循环中的 FMA 延迟。

The "i7" brand-name was introduced with Nehalem (successor to Core2), and some Intel manuals even say "Core i7" when they seem to mean Nehalem, but they kept the "i7" branding for Sandybridgeand later microarchitectures. SnB is when the P6-family evolved into a new species, the SnB-family. In many ways, Nehalem has more in common with Pentium III than with Sandybridge (e.g. register read stalls and ROB-read stalls don't happen on SnB, because it changed to using a physical register file. Also a uop cache and a different internal uop format). The term "i7 architecture" is not useful, because it makes little sense to group the SnB-family with Nehalem but not Core2. (Nehalem did introduce the shared inclusive L3 cache architecture for connecting multiple cores together, though. And also integrated GPUs. So chip-level, the naming makes more sense.)

“i7”品牌名称是在 Nehalem（Core2 的后继产品）中引入的，一些 Intel 手册甚至在似乎是 Nehalem 时说“Core i7”，但他们为 Sandybridge和后来的微架构保留了“i7”品牌名称。 SnB 是 P6 家族进化为新物种 SnB 家族的时期。在许多方面，Nehalem 与 Pentium III 的共同点多于 Sandybridge（例如，寄存器读取停顿和 ROB 读取停顿不会发生在 SnB 上，因为它改为使用物理寄存器文件。还有一个 uop 缓存和不同的内部uop 格式）。 术语“i7 架构”没有用，因为将 SnB 家族与 Nehalem 而非 Core2 归为一组毫无意义。（不过，Nehalem 确实引入了共享的包容性 L3 缓存架构，用于将多个内核连接在一起。还有集成的 GPU。所以芯片级，命名更有意义。）

Summary of the good ideas that diabolical incompetence can justify

恶魔的无能可以证明的好主意摘要

Even the diabolically incompetent are unlikely to add obviously useless work or an infinite loop, and making a mess with C++/Boost classes is beyond the scope of the assignment.

即使是非常无能的人也不太可能添加明显无用的工作或无限循环，并且将 C++/Boost 类弄得一团糟超出了任务的范围。

Multi-thread with a single sharedstd::atomic<uint64_t>loop counter, so the right total number of iterations happen. Atomic uint64_t is especially bad with -m32 -march=i586. For bonus points, arrange for it to be misaligned, and crossing a page boundary with an uneven split (not 4:4).
False sharingfor some other non-atomic variable -> memory-order mis-speculation pipeline clears, as well as extra cache misses.
Instead of using -on FP variables, XOR the high byte with 0x80 to flip the sign bit, causing store-forwarding stalls.
Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID/ RDTSCor a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly.
Change multiplies by constants to divides by their reciprocal ("for ease of reading"). div is slow and not fully pipelined.
Vectorize the multiply/sqrt with AVX (SIMD), but fail to use vzeroupperbefore calls to scalar math-library exp()and log()functions, causing AVX<->SSE transition stalls.
Store the RNG output in a linked list, or in arrays which you traverse out of order. Same for the result of each iteration, and sum at the end.

具有单个共享std::atomic<uint64_t>循环计数器的多线程，因此发生正确的总迭代次数。原子 uint64_t 对-m32 -march=i586. 要获得奖励积分，请将其安排为未对齐，并以不均匀的分割（不是 4:4）跨越页面边界。
一些其他非原子变量的错误共享-> 内存顺序错误推测管道清除，以及额外的缓存未命中。
不是-在 FP 变量上使用，而是将高字节与 0x80 进行异或以翻转符号位，从而导致存储转发停止。
独立地为每次迭代计时，甚至比RDTSC. 例如CPUID/RDTSC或进行系统调用的时间函数。序列化指令本质上是管道不友好的。
将乘以常数更改为除以它们的倒数（“为了便于阅读”）。 div 很慢并且没有完全流水线化。
使用 AVX (SIMD) 向量化乘法/平方，但vzeroupper在调用标量数学库exp()和log()函数之前无法使用，导致AVX<->SSE 转换停顿。
将 RNG 输出存储在链接列表中，或存储在您乱序遍历的数组中。每次迭代的结果相同，最后求和。

Also covered in this answer but excluded from the summary: suggestions that would be just as slow on a non-pipelined CPU, or that don't seem to be justifiable even with diabolical incompetence. e.g. many gimp-the-compiler ideas that produce obviously different / worse asm.

也包含在此答案中，但不包括在摘要中：建议在非流水线 CPU 上同样缓慢，或者即使在无能的情况下似乎也没有道理。例如，许多 gimp-the-compiler 的想法会产生明显不同/更糟糕的 asm。

Multi-thread badly

多线程不好

Maybe use OpenMP to multi-thread loops with very few iterations, with way more overhead than speed gain. Your monte-carlo code has enough parallelism to actually get a speedup, though, esp. if we succeed at making each iteration slow. (Each thread computes a partial payoff_sum, added at the end). #omp parallelon that loop would probably be an optimization, not a pessimization.

也许使用 OpenMP 以很少的迭代进行多线程循环，开销比速度增益更多。不过，您的蒙特卡罗代码具有足够的并行性，可以实际获得加速，尤其是。如果我们成功地使每次迭代变慢。（每个线程计算一个 partial payoff_sum，在最后添加）。 #omp parallel在那个循环上可能是一种优化，而不是一种悲观化。

Multi-thread but force both threads to share the same loop counter (with atomicincrements so the total number of iterations is correct).This seems diabolically logical. This means using a staticvariable as a loop counter. This justifies use of atomicfor loop counters, and creates actual cache-line ping-ponging(as long as the threads don't run on the same physical core with hyperthreading; that might not be asslow). Anyway, this is muchslower than the un-contended case for lock inc. And lock cmpxchg8bto atomically increment a contended uint64_ton a 32bit system will have to retry in a loop instead of having the hardware arbitrate an atomic inc.

多线程但强制两个线程共享相同的循环计数器（atomic递增，因此总迭代次数是正确的）。这似乎非常合乎逻辑。这意味着使用static变量作为循环计数器。这证明使用atomicfor 循环计数器是合理的，并创建实际的缓存行乒乓（只要线程不在具有超线程的同一物理核心上运行；那可能不会那么慢）。无论如何，这是很多比未主张的情况下慢lock inc。并且在 32 位系统上lock cmpxchg8b以原子方式递增竞争uint64_t将不得不在循环中重试，而不是让硬件仲裁原子inc。

Also create false sharing, where multiple threads keep their private data (e.g. RNG state) in different bytes of the same cache line. (Intel tutorial about it, including perf counters to look at). There's a microarchitecture-specific aspect to this: Intel CPUs speculate on memory mis-ordering nothappening, and there's a memory-order machine-clear perf event to detect this, at least on P4. The penalty might not be as large on Haswell. As that link points out, a locked instruction assumes this will happen, avoiding mis-speculation. A normal load speculates that other cores won't invalidate a cache line between when the load executes and when it retires in program-order (unless you use pause). True sharing without locked instructions is usually a bug. It would be interesting to compare a non-atomic shared loop counter with the atomic case. To really pessimize, keep the shared atomic loop counter, and cause false sharing in the same or a different cache line for some other variable.

还创建错误共享，其中多个线程将其私有数据（例如 RNG 状态）保存在同一缓存行的不同字节中。（有关它的英特尔教程，包括要查看的性能计数器）。 这有一个特定于微体系结构的方面：英特尔 CPU 推测不会发生内存错误排序，并且有一个内存顺序机器清除性能事件来检测这一点，至少在 P4 上。Haswell 的惩罚可能没有那么大。正如该链接所指出的，locked 指令假定这会发生，避免错误推测。正常负载推测其他内核不会使负载执行和按程序顺序退出之间的缓存行无效（除非你使用pause）。没有locked 指令的真正共享通常是一个错误。将非原子共享循环计数器与原子情况进行比较会很有趣。真正悲观的是，保留共享原子循环计数器，并在相同或不同的缓存行中导致某些其他变量的错误共享。

Random uarch-specific ideas:

随机 uarch 特定的想法：

If you can introduce any unpredictable branches, that will pessimize the code substantially. Modern x86 CPUs have quite long pipelines, so a mispredict costs ~15 cycles (when running from the uop cache).

如果您可以引入任何不可预测的分支，那将使代码大幅悲观。现代 x86 CPU 具有相当长的管道，因此错误预测会花费约 15 个周期（从 uop 缓存运行时）。

Dependency chains:

依赖链：

I think this was one of the intended parts of the assignment.

我认为这是作业的预期部分之一。

Defeat the CPU's ability to exploit instruction-level parallelism by choosing an order of operations that has one long dependency chain instead of multiple short dependency chains. Compilers aren't allowed to change the order of operations for FP calculations unless you use -ffast-math, because that can change the results (as discussed below).

通过选择具有一个长依赖链而不是多个短依赖链的操作顺序来击败 CPU 利用指令级并行性的能力。除非您使用-ffast-math，否则不允许编译器更改 FP 计算的操作顺序，因为这会更改结果（如下所述）。

To really make this effective, increase the length of a loop-carried dependency chain. Nothing leaps out as obvious, though: The loops as written have very short loop-carried dependency chains: just an FP add. (3 cycles). Multiple iterations can have their calculations in-flight at once, because they can start well before the payoff_sum +=at the end of the previous iteration. (log()and exptake many instructions, but not a lot more than Haswell's out-of-order window for finding parallelism: ROB size=192 fused-domain uops, and scheduler size=60 unfused-domain uops. As soon as execution of the current iteration progresses far enough to make room for instructions from the next iteration to issue, any parts of it that have their inputs ready (i.e. independent/separate dep chain) can start executing when older instructions leave the execution units free (e.g. because they're bottlenecked on latency, not throughput.).

要真正使其有效，请增加循环携带的依赖链的长度。不过，没有什么比这更明显的了：所写的循环具有非常短的循环携带依赖链：只是一个 FP 添加。（3 个周期）。多次迭代可以同时进行计算，因为它们可以payoff_sum +=在上一次迭代结束之前很好地开始。（log()并exp采用许多指令，但不会比Haswell 用于查找并行性的乱序窗口多很多：ROB 大小=192 融合域 uops，调度程序大小=60 未融合域 uops. 一旦当前迭代的执行进展到足以为下一次迭代发出的指令腾出空间，当旧指令离开执行单元时，它的任何输入准备就绪的部分（即独立/独立的dep链）都可以开始执行免费（例如，因为它们的瓶颈在于延迟，而不是吞吐量。）。

The RNG state will almost certainly be a longer loop-carried dependency chain than the addps.

RNG 状态几乎肯定是比addps.

Use slower/more FP operations (esp. more division):

使用更慢/更多的 FP 操作（尤其是更多的除法）：

Divide by 2.0 instead of multiplying by 0.5, and so on. FP multiply is heavily pipelined in Intel designs, and has one per 0.5c throughput on Haswell and later. FP divsd/divpdis only partially pipelined. (Although Skylake has an impressive one per 4c throughput for divpd xmm, with 13-14c latency, vs not pipelined at all on Nehalem (7-22c)).

除以 2.0 而不是乘以 0.5，依此类推。FP 乘法在 Intel 设计中大量流水线化，并且在 Haswell 和更高版本上每 0.5c 吞吐量有一个。 FP divsd/divpd只是部分流水线化。（尽管 Skylake 每 4c 的吞吐量令人印象深刻divpd xmm，延迟为13-14c，而在 Nehalem (7-22c) 上根本没有流水线）。

The do { ...; euclid_sq = x*x + y*y; } while (euclid_sq >= 1.0);is clearly testing for a distance, so clearly it would be proper to sqrt()it. :P (sqrtis even slower than div).

该do { ...; euclid_sq = x*x + y*y; } while (euclid_sq >= 1.0);显然是测试的距离，所以显然这将是适当的，以sqrt()它。:P （sqrt甚至比慢div）。

As @Paul Clayton suggests, rewriting expressions with associative/distributive equivalents can introduce more work (as long as you don't use -ffast-mathto allow the compiler to re-optimize). (exp(T*(r-0.5*v*v))could become exp(T*r - T*v*v/2.0). Note that while math on real numbers is associative, floating point math is not, even without considering overflow/NaN (which is why -ffast-mathisn't on by default). See Paul's commentfor a very hairy nested pow()suggestion.

正如@Paul Clayton 所建议的那样，使用关联/分布等价物重写表达式可以引入更多工作（只要您不使用-ffast-math以允许编译器重新优化）。 (exp(T*(r-0.5*v*v))可以成为exp(T*r - T*v*v/2.0). 请注意，虽然实数数学是关联的，但浮点数学不是，即使不考虑溢出/NaN（这就是-ffast-math默认情况下不启用的原因）。请参阅Paul 的评论，了解一个非常多毛的嵌套pow()建议。

If you can scale the calculations down to very small numbers, then FP math ops take ~120 extra cycles to trap to microcode when an operation on two normal numbers produces a denormal. See Agner Fog's microarch pdf for the exact numbers and details. This is unlikely since you have a lot of multiplies, so the scale factor would be squared and underflow all the way to 0.0. I don't see any way to justify the necessary scaling with incompetence (even diabolical), only intentional malice.

如果您可以将计算缩小到非常小的数字，那么当对两个正常数字的操作产生一个 denormal 时，FP 数学运算需要大约 120 个额外的周期才能捕获到微代码。有关确切数字和详细信息，请参阅 Agner Fog 的 microarch pdf。这不太可能，因为您有很多乘法，因此比例因子将被平方并一直下溢到 0.0。我认为没有任何方法可以证明无能（甚至是恶魔般的）的必要缩放是合理的，只有故意的恶意。

If you can use intrinsics (`<immintrin.h>`)

如果您可以使用内在函数 ( `<immintrin.h>`)

Use movntito evict your data from cache. Diabolical: it's new and weakly-ordered, so that should let the CPU run it faster, right? Or see that linked question for a case where someone was in danger of doing exactly this (for scattered writes where only some of the locations were hot). clflushis probably impossible without malice.

用于movnti从缓存中驱逐您的数据。恶魔：它是新的并且是弱排序的，所以应该让 CPU 运行得更快，对吧？或者查看有关某人有危险这样做的情况的链接问题（对于只有某些位置很热的分散写入）。 clflush没有恶意可能是不可能的。

Use integer shuffles between FP math operations to cause bypass delays.

在 FP 数学运算之间使用整数混洗来导致旁路延迟。

Mixing SSE and AVX instructions without proper use of vzerouppercauses large stalls in pre-Skylake(and a different penalty in Skylake). Even without that, vectorizing badly can be worse than scalar (more cycles spent shuffling data into/out of vectors than saved by doing the add/sub/mul/div/sqrt operations for 4 Monte-Carlo iterations at once, with 256b vectors). add/sub/mul execution units are fully pipelined and full-width, but div and sqrt on 256b vectors aren't as fast as on 128b vectors (or scalars), so the speedup isn't dramatic for double.

混合SSE和AVX指令不正确使用的vzeroupper预SKYLAKE微架构的原因大排挡（和不同的惩罚在SKYLAKE微架构）。即使没有这些，严重的向量化也可能比标量更糟糕（将数据混入/混出向量所花费的周期比一次执行 4 次蒙特卡洛迭代的 add/sub/mul/div/sqrt 操作节省的周期更多，有 256b 个向量） . add/sub/mul 执行单元是完全流水线化和全宽的，但是 256b 向量上的 div 和 sqrt 不如 128b 向量（或标量）上的快，因此对于double.

exp()and log()don't have hardware support, so that part would require extracting vector elements back to scalar and calling the library function separately, then shuffling the results back into a vector. libm is typically compiled to only use SSE2, so will use the legacy-SSE encodings of scalar math instructions. If your code uses 256b vectors and calls expwithout doing a vzeroupperfirst, then you stall. After returning, an AVX-128 instruction like vmovsdto set up the next vector element as an arg for expwill also stall. And then exp()will stall again when it runs an SSE instruction. This is exactly what happened in this question, causing a 10x slowdown.(Thanks @ZBoson).

exp()并且log()没有硬件支持，因此该部分需要将向量元素提取回标量并单独调用库函数，然后将结果重新转换回向量。libm 通常被编译为仅使用 SSE2，因此将使用标量数学指令的 legacy-SSE 编码。如果您的代码使用 256b 向量并在exp没有先执行的情况下调用vzeroupper，那么您将停滞不前。返回后，一条 AVX-128 指令，比如vmovsd将下一个向量元素设置为 arg forexp也将停止。然后exp()在运行 SSE 指令时会再次停止。 这正是这个问题中发生的事情，导致速度下降 10 倍。（感谢@ZBoson）。

See also Nathan Kurz's experiments with Intel's math lib vs. glibc for this code. Future glibc will come with vectorized implementations of exp()and so on.

对于此代码，另请参阅Nathan Kurz 对英特尔数学库与 glibc 的实验。未来的 glibc 将带有等等的矢量化实现exp()。

If targeting pre-IvB, or esp. Nehalem, try to get gcc to cause partial-register stalls with 16bit or 8bit operations followed by 32bit or 64bit operations. In most cases, gcc will use movzxafter an 8 or 16bit operation, but here's a case where gcc modifies ahand then reads ax

如果针对前 IvB，或 esp。Nehalem，尝试让 gcc 在 16 位或 8 位操作后跟 32 位或 64 位操作时导致部分寄存器停顿。在大多数情况下，gcc 会movzx在 8 位或 16 位操作后使用，但这里是 gcc 修改ah然后读取的情况ax

With (inline) asm:

使用（内联）汇编：

With (inline) asm, you could break the uop cache: A 32B chunk of code that doesn't fit in three 6uop cache lines forces a switch from the uop cache to the decoders. An incompetent ALIGNusing many single-byte nops instead of a couple long nops on a branch target inside the inner loop might do the trick. Or put the alignment padding after the label, instead of before. :P This only matters if the frontend is a bottleneck, which it won't be if we succeeded at pessimizing the rest of the code.

使用（内联）asm，您可以破坏 uop 缓存：无法放入三个 6uop 缓存行的 32B 代码块会强制从 uop 缓存切换到解码器。在内部循环内的分支目标上ALIGN使用许多单字节nop而不是几个长nops的无能者可能会成功。或者将对齐填充放在标签之后，而不是之前。:P 这仅在前端是瓶颈时才重要，如果我们成功地对其余代码进行了悲观，就不会如此。

Use self-modifying code to trigger pipeline clears (aka machine-nukes).

使用自修改代码来触发管道清除（又名机器核武器）。

LCP stallsfrom 16bit instructions with immediates too large to fit in 8 bits are unlikely to be useful. The uop cache on SnB and later means you only pay the decode penalty once. On Nehalem (the first i7), it might work for a loop that doesn't fit in the 28 uop loop buffer. gcc will sometimes generate such instructions, even with -mtune=inteland when it could have used a 32bit instruction.

来自 16 位指令的LCP 停顿，立即数太大而无法容纳 8 位，不太可能有用。SnB 和更高版本上的 uop 缓存意味着您只需支付一次解码惩罚。在 Nehalem（第一个 i7）上，它可能适用于不适合 28 uop 循环缓冲区的循环。gcc 有时会生成这样的指令，即使-mtune=intel它可以使用 32 位指令。

A common idiom for timing is CPUID(to serialize) then RDTSC. Time every iteration separately with a CPUID/RDTSCto make sure the RDTSCisn't reordered with earlier instructions, which will slow things down a lot. (In real life, the smart way to time is to time all the iterations together, instead of timing each separately and adding them up).

一个常见的计时习惯用法是CPUID（序列化） thenRDTSC。时间与单独每次迭代CPUID/RDTSC以确保RDTSC不会与早期指示，这将慢下来重新排序很多。（在现实生活中，聪明的计时方法是将所有迭代一起计时，而不是分别计时并将它们相加）。

Cause lots of cache misses and other memory slowdowns

导致大量缓存未命中和其他内存变慢

Use a union { double d; char a[8]; }for some of your variables. Cause a store-forwarding stallby doing a narrow store (or Read-Modify-Write) to just one of the bytes. (That wiki article also covers a lot of other microarchitectural stuff for load/store queues). e.g. flip the sign of a doubleusing XOR 0x80 on just the high byte, instead of a -operator. The diabolically incompetent developer may have heard that FP is slower than integer, and thus try to do as much as possible using integer ops. (A very good compiler targeting FP math in SSE registers may possibly compile this to an xorpswith a constant in another xmm register, but the only way this isn't terrible for x87 is if the compiler realizes that it's negating the value and replaces the next add with a subtract.)

union { double d; char a[8]; }对某些变量使用 a 。通过仅对其中一个字节执行窄存储（或读取-修改-写入）来导致存储转发停顿。（那篇 wiki 文章还涵盖了许多其他用于加载/存储队列的微架构内容）。例如，仅在高字节而不是运算符上使用 XOR 0x80翻转 a 的符号double-。无能的开发人员可能听说过 FP 比整数慢，因此尝试尽可能多地使用整数操作。（针对 SSE 寄存器中的 FP 数学的非常好的编译器可能会将其编译为xorps在另一个 xmm 寄存器中有一个常量，但对于 x87 来说，这并不可怕的唯一方法是编译器意识到它正在取反值并用减法替换下一个加法。）

Use volatileif you're compiling with -O3and not using std::atomic, to force the compiler to actually store/reload all over the place. Global variables (instead of locals) will also force some stores/reloads, but the C++ memory model's weak orderingdoesn't require the compiler to spill/reload to memory all the time.

使用volatile，如果你有编译-O3和不使用std::atomic，强制编译实际存储/重装所有的地方。全局变量（而不是局部变量）也会强制一些存储/重新加载，但C++ 内存模型的弱排序不需要编译器一直溢出/重新加载到内存。

Replace local vars with members of a big struct, so you can control the memory layout.

用大结构的成员替换局部变量，这样你就可以控制内存布局。

Use arrays in the struct for padding (and storing random numbers, to justify their existence).

在结构中使用数组进行填充（并存储随机数，以证明它们的存在）。

Choose your memory layout so everything goes into a different line in the same "set" in the L1 cache. It's only 8-way associative, i.e. each set has 8 "ways". Cache lines are 64B.

选择您的内存布局，以便所有内容都进入 L1 缓存中同一“集合”中的不同行。它只有 8 路关联，即每个集合有 8 个“路”。缓存行是 64B。

Even better, put things exactly 4096B apart, since loads have a false dependency on stores to different pages but with the same offset within a page. Aggressive out-of-order CPUs use Memory Disambiguation to figure out when loads and stores can be reordered without changing the results, and Intel's implementation has false-positives that prevent loads from starting early. Probably they only check bits below the page offset, so the check can start before the TLB has translated the high bits from a virtual page to a physical page. As well as Agner's guide, see an answer from Stephen Canon, and also a section near the end of @Krazy Glew's answer on the same question. (Andy Glew was one of the architects of Intel's original P6 microarchitecture.)

更好的是，将内容完全分开 4096B，因为加载对存储到不同页面的错误依赖，但在页面内具有相同的偏移量。激进的乱序 CPU 使用内存消歧来确定何时可以在不更改结果的情况下重新排序加载和存储，并且英特尔的实现具有误报，可防止加载过早启动。可能他们只检查页面偏移量以下的位，因此检查可以在 TLB 将高位从虚拟页面转换为物理页面之前开始。除了 Agner 的指南，请参阅Stephen Canon 的回答，以及@Krazy Glew 对同一问题的回答接近结尾的部分。（Andy Glew 是英特尔原始 P6 微架构的架构师之一。）

Use __attribute__((packed))to let you mis-align variables so they span cache-line or even page boundaries. (So a load of one doubleneeds data from two cache-lines). Misaligned loads have no penalty in any Intel i7 uarch, except when crossing cache lines and page lines. Cache-line splits still take extra cycles. Skylake dramatically reduces the penalty for page split loads, from 100 to 5 cycles. (Section 2.1.3). Perhaps related to being able to do two page walks in parallel.

使用__attribute__((packed))让您误对准变量，使它们跨越高速缓存行或偶数页边界。（所以一个负载double需要来自两个缓存行的数据）。未对齐的加载在任何 Intel i7 uarch 中都没有损失，除非跨越缓存线和页面线。缓存行拆分仍然需要额外的周期。Skylake 显着减少了页面拆分加载的损失，从 100 个周期减少到 5 个周期。（第 2.1.3 节）。也许与能够并行执行两页遍历有关。

A page-split on an atomic<uint64_t>should be just about the worst case, esp. if it's 5 bytes in one page and 3 bytes in the other page, or anything other than 4:4. Even splits down the middle are more efficient for cache-line splits with 16B vectors on some uarches, IIRC. Put everything in a alignas(4096) struct __attribute((packed))(to save space, of course), including an array for storage for the RNG results. Achieve the misalignment by using uint8_tor uint16_tfor something before the counter.

页面拆分atomic<uint64_t>应该是最坏的情况，尤其是。如果一页中有 5 个字节，另一页中有 3 个字节，或者 4:4 以外的任何内容。对于某些 uarch 上的 16B 向量的缓存行拆分，IIRC 甚至中间的拆分也更有效。把所有东西都放在一个alignas(4096) struct __attribute((packed))（当然是为了节省空间），包括一个用于存储 RNG 结果的数组。通过在计数器之前使用uint8_t或uint16_t为某物来实现未对准。

If you can get the compiler to use indexed addressing modes, that will defeat uop micro-fusion. Maybe by using #defines to replace simple scalar variables with my_data[constant].

如果你能让编译器使用索引寻址模式，那将打败 uop micro-fusion。也许通过使用#defines 将简单的标量变量替换为my_data[constant].

If you can introduce an extra level of indirection, so load/store addresses aren't known early, that can pessimize further.

如果您可以引入额外的间接级别，那么加载/存储地址不会提前知道，这可能会进一步悲观。

Traverse arrays in non-contiguous order

以非连续顺序遍历数组

I think we can come up with incompetent justification for introducing an array in the first place: It lets us separate the random number generation from the random number use. Results of each iteration could also be stored in an array, to be summed later (with more diabolical incompetence).

我认为我们可以首先提出引入数组的无能理由：它让我们将随机数的生成与随机数的使用分开。每次迭代的结果也可以存储在一个数组中，以便稍后求和（更加无能）。

For "maximum randomness", we could have a thread looping over the random array writing new random numbers into it. The thread consuming the random numbers could generate a random index to load a random number from. (There's some make-work here, but microarchitecturally it helps for load-addresses to be known early so any possible load latency can be resolved before the loaded data is needed.) Having a reader and writer on different cores will cause memory-ordering mis-speculation pipeline clears (as discussed earlier for the false-sharing case).

对于“最大随机性”，我们可以让一个线程循环遍历随机数组，将新的随机数写入其中。消耗随机数的线程可以生成一个随机索引以从中加载随机数。（这里有一些制作工作，但从微体系结构上讲，它有助于尽早知道加载地址，因此可以在需要加载数据之前解决任何可能的加载延迟。）在不同的内核上使用读取器和写入器将导致内存排序错误- 推测管道清除（如前面讨论的虚假共享案例）。

For maximum pessimization, loop over your array with a stride of 4096 bytes (i.e. 512 doubles). e.g.

为了最大限度地悲观化，以 4096 字节的步长（即 512 双倍）循环遍历您的数组。例如

for (int i=0 ; i<512; i++)
    for (int j=i ; j<UPPER_BOUND ; j+=512)
        monte_carlo_step(rng_array[j]);

So the access pattern is 0, 4096, 8192, ...,
8, 4104, 8200, ...
16, 4112, 8208, ...

所以访问模式是 0, 4096, 8192, ...,
8, 4104, 8200, ...
16, 4112, 8208, ...

This is what you'd get for accessing a 2D array like double rng_array[MAX_ROWS][512]in the wrong order (looping over rows, instead of columns within a row in the inner loop, as suggested by @JesperJuhl). If diabolical incompetence can justify a 2D array with dimensions like that, garden variety real-world incompetence easily justifies looping with the wrong access pattern. This happens in real code in real life.

这是您double rng_array[MAX_ROWS][512]以错误的顺序访问二维数组所得到的结果（如@JesperJuhl 所建议的那样，在行内循环，而不是在内部循环中的一行中循环列）。如果恶魔般的无能可以证明具有这样维度的 2D 数组是合理的，那么现实世界中的各种无能很容易证明以错误的访问模式进行循环是合理的。这发生在现实生活中的真实代码中。

Adjust the loop bounds if necessary to use many different pages instead of reusing the same few pages, if the array isn't that big. Hardware prefetching doesn't work (as well/at all) across pages. The prefetcher can track one forward and one backward stream within each page (which is what happens here), but will only act on it if the memory bandwidth isn't already saturated with non-prefetch.

如果数组不是那么大，则在必要时调整循环边界以使用许多不同的页面而不是重复使用相同的几页。硬件预取在页面之间不起作用（也/根本不起作用）。预取器可以在每个页面中跟踪一个前向流和一个后向流（这就是这里发生的情况），但只有在内存带宽尚未因非预取而饱和时才会对其进行操作。

This will also generate lots of TLB misses, unless the pages get merged into a hugepage (Linux does this opportunistically for anonymous (not file-backed) allocations like malloc/newthat use mmap(MAP_ANONYMOUS)).

这也将产生大量 TLB 未命中，除非页面被合并到一个大页面（Linux 为匿名（非文件支持）分配（如malloc/new使用mmap(MAP_ANONYMOUS)）机会性地这样做）。

Instead of an array to store the list of results, you could use a linked list. Then every iteration would require a pointer-chasing load (a RAW true dependency hazard for the load-address of the next load). With a bad allocator, you might manage to scatter the list nodes around in memory, defeating cache. With a diabolically incompetent allocator, it could put every node at the beginning of its own page. (e.g. allocate with mmap(MAP_ANONYMOUS)directly, without breaking up pages or tracking object sizes to properly support free).

您可以使用链表来代替存储结果列表的数组。然后每次迭代都需要一个指针追逐加载（下一次加载的加载地址的原始真实依赖风险）。使用错误的分配器，您可能会设法将列表节点分散在内存中，从而破坏缓存。使用一个非常无能的分配器，它可以将每个节点放在自己页面的开头。（例如，mmap(MAP_ANONYMOUS)直接分配 with ，无需拆分页面或跟踪对象大小以正确支持free）。

These aren't really microarchitecture-specific, and have little to do with the pipeline (most of these would also be a slowdown on a non-pipelined CPU).

这些并不是真正特定于微体系结构的，并且与管道几乎没有关系（其中大部分也会导致非管道 CPU 的减速）。

Somewhat off-topic: make the compiler generate worse code / do more work:

有点跑题：让编译器生成更糟糕的代码/做更多的工作：

Use C++11 std::atomic<int>and std::atomic<double>for the most pessimal code. The MFENCEs and locked instructions are quite slow even without contention from another thread.

使用 C++11std::atomic<int>和std::atomic<double>最悲观的代码。lock即使没有来自另一个线程的争用，MFENCEs 和ed 指令也很慢。

-m32will make slower code, because x87 code will be worse than SSE2 code. The stack-based 32bit calling convention takes more instructions, and passes even FP args on the stack to functions like exp(). atomic<uint64_t>::operator++on -m32requires a lock cmpxchg8Bloop(i586). (So use that for loop counters! [Evil laugh]).

-m32会使代码变慢，因为 x87 代码会比 SSE2 代码更糟糕。基于堆栈的 32 位调用约定需要更多指令，甚至将堆栈上的 FP 参数传递给exp(). atomic<uint64_t>::operator++on-m32需要一个lock cmpxchg8B循环（i586）。（所以将它用于循环计数器！[邪恶的笑]）。

-march=i386will also pessimize (thanks @Jesper). FP compares with fcomare slower than 686 fcomi. Pre-586 doesn't provide an atomic 64bit store, (let alone a cmpxchg), so all 64bit atomicops compile to libgcc function calls (which is probably compiled for i686, rather than actually using a lock). Try it on the Godbolt Compiler Explorer link in the last paragraph.

-march=i386也会悲观（感谢@Jesper）。FPfcom比 686 慢fcomi。Pre-586 不提供原子 64 位存储（更不用说 cmpxchg），因此所有 64 位atomic操作都编译为 libgcc 函数调用（这可能是为 i686 编译的，而不是实际使用锁）。在最后一段的 Godbolt 编译器资源管理器链接上试试。

Use long double/ sqrtl/ explfor extra precision and extra slowness in ABIs where sizeof(long double) is 10 or 16 (with padding for alignment). (IIRC, 64bit Windows uses 8byte long doubleequivalent to double. (Anyway, load/store of 10byte (80bit) FP operands is 4 / 7 uops, vs. floator doubleonly taking 1 uop each for fld m64/m32/fst). Forcing x87 with long doubledefeats auto-vectorization even for gcc -m64 -march=haswell -O3.

使用long double/ sqrtl/expl额外的精确度和额外的缓慢中的ABI其中的sizeof（ long double）为10或16（具有填充用于对准）。（IIRC，64位Windows应用8字节long double等效于double（无论如何，负载/ 10byte的商店（80bit的）FP操作数是4/7微指令，与float或double仅服用1 UOP各为fld m64/m32/ fst）。强制的x87与long double负自动矢量即使对于海湾合作委员会-m64 -march=haswell -O3。

If not using atomic<uint64_t>loop counters, use long doublefor everything, including loop counters.

如果不使用atomic<uint64_t>循环计数器，请使用long double所有内容，包括循环计数器。

atomic<double>compiles, but read-modify-write operations like +=aren't supported for it (even on 64bit). atomic<long double>has to call a library function just for atomic loads/stores. It's probably really inefficient, because the x86 ISA doesn't naturally support atomic 10byte loads/stores, and the only way I can think of without locking (cmpxchg16b) requires 64bit mode.

atomic<double>编译，但+=不支持读-修改-写操作（即使在 64 位上）。 atomic<long double>必须仅为原子加载/存储调用库函数。这可能真的很低效，因为 x86 ISA 自然不支持原子 10 字节加载/存储，而且我能想到的唯一没有锁定 ( cmpxchg16b) 的方法需要 64 位模式。

At -O0, breaking up a big expression by assigning parts to temporary vars will cause more store/reloads. Without volatileor something, this won't matter with optimization settings that a real build of real code would use.

在-O0，通过将部分分配给临时变量来分解大表达式将导致更多的存储/重新加载。没有volatile或其他东西，这与真实代码的真实构建将使用的优化设置无关。

C aliasing rules allow a charto alias anything, so storing through a char*forces the compiler to store/reload everything before/after the byte-store, even at -O3. (This is a problem for auto-vectorizing code that operates on an array of uint8_t, for example.)

C 别名规则允许 achar对任何内容进行别名，因此通过 a 存储会char*强制编译器在字节存储之前/之后存储/重新加载所有内容，即使在-O3. （例如，对于在的数组上运行的uint8_t自动矢量化代码来说，这是一个问题。）

Try uint16_tloop counters, to force truncation to 16bit, probably by using 16bit operand-size (potential stalls) and/or extra movzxinstructions (safe). Signed overflow is undefined behaviour, so unless you use -fwrapvor at least -fno-strict-overflow, signed loop counters don't have to be re-sign-extended every iteration, even if used as offsets to 64bit pointers.

尝试使用uint16_t循环计数器，强制截断为 16 位，可能是通过使用 16 位操作数大小（潜在停顿）和/或额外movzx指令（安全）。有符号溢出是未定义的行为，因此除非您使用-fwrapv或至少-fno-strict-overflow，有符号循环计数器不必在每次迭代时重新进行符号扩展，即使用作 64 位指针的偏移量也是如此。

Force conversion from integer to floatand back again. And/or double<=>floatconversions. The instructions have greater-than-one latency, and scalar int->float (cvtsi2ss) is badly designed to not zero the rest of the xmm register. (gcc inserts an extra pxorto break dependencies, for this reason.)

强制从整数float来回转换。和/或double<=>float转换。这些指令具有大于 1 的延迟，并且标量 int->float ( cvtsi2ss) 设计得很糟糕，无法将 xmm 寄存器的其余部分归零。（pxor出于这个原因，gcc 插入了一个额外的来打破依赖关系。）

Frequently set your CPU affinity to a different CPU(suggested by @Egwor). diabolical reasoning: You don't want one core to get overheated from running your thread for a long time, do you? Maybe swapping to another core will let that core turbo to a higher clock speed. (In reality: they're so thermally close to each other that this is highly unlikely except in a multi-socket system). Now just get the tuning wrong and do it way too often. Besides the time spent in the OS saving/restoring thread state, the new core has cold L2/L1 caches, uop cache, and branch predictors.

经常将您的 CPU 关联设置为不同的 CPU（由 @Egwor 建议）。恶魔般的推理：您不希望一个内核因长时间运行线程而过热，对吗？也许交换到另一个核心会让该核心加速到更高的时钟速度。（实际上：它们在热量上非常接近，除非在多插座系统中，否则这是极不可能的）。现在只是调错了，而且经常这样做。除了在 OS 保存/恢复线程状态中花费的时间外，新内核还具有冷 L2/L1 缓存、uop 缓存和分支预测器。

Introducing frequent unnecessary system calls can slow you down no matter what they are. Although some important but simple ones like gettimeofdaymay be implemented in user-space with, with no transition to kernel mode. (glibc on Linux does this with the kernel's help, since the kernel exports code in the vdso).

频繁地引入不必要的系统调用，不管它们是什么都会减慢你的速度。虽然一些重要但简单的类似gettimeofday可以在用户空间中实现，而无需转换到内核模式。（Linux 上的 glibc 在内核的帮助下完成此操作，因为内核导出vdso.

For more on system call overhead (including cache/TLB misses after returning to user-space, not just the context switch itself), the FlexSC paperhas some great perf-counter analysis of the current situation, as well as a proposal for batching system calls from massively multi-threaded server processes.

有关系统调用开销的更多信息（包括返回用户空间后的缓存/TLB 未命中，而不仅仅是上下文切换本身），FlexSC 论文对当前情况有一些很好的性能计数器分析，以及批处理系统的建议来自大规模多线程服务器进程的调用。

Answer 2

回答by Jesper Juhl

A few things that you can do to make things perform as bad as possible:

你可以做一些事情来让事情表现得尽可能糟糕：

compile the code for the i386 architecture. This will prevent the use of SSE and newer instructions and force the use of the x87 FPU.
use std::atomicvariables everywhere. This will make them very expensive due to the compiler being forced to insert memory barriers all over the place. And this is something an incompetent person might plausibly do to "ensure thread safety".
make sure to access memory in the worst possible way for the prefetcher to predict (column major vs row major).
to make your variables extra expensive you could make sure they all have 'dynamic storage duration' (heap allocated) by allocating them with newrather than letting them have 'automatic storage duration' (stack allocated).
make sure that all memory you allocate is very oddly aligned and by all means avoid allocating huge pages, since doing so would be much too TLB efficient.
whatever you do, don't build your code with the compilers optimizer enabled. And make sure to enable the most expressive debug symbols you can (won't make the code runslower, but it'll waste some extra disk space).

编译 i386 架构的代码。这将阻止使用 SSE 和更新的指令并强制使用 x87 FPU。
std::atomic到处使用变量。由于编译器被迫在所有地方插入内存屏障，这将使它们变得非常昂贵。这是一个不称职的人可能会为“确保线程安全”而做的事情。
确保以预取器预测的最糟糕的方式访问内存（列主要 vs 行主要）。
为了使您的变量更加昂贵，您可以通过分配它们new而不是让它们具有“自动存储持续时间”（分配堆栈）来确保它们都具有“动态存储持续时间”（堆分配）。
确保您分配的所有内存都非常奇怪地对齐，并且一定要避免分配大页面，因为这样做会使 TLB 效率过高。
无论您做什么，都不要在启用编译器优化器的情况下构建您的代码。并确保启用最具表现力的调试符号（不会使代码运行速度变慢，但会浪费一些额外的磁盘空间）。

Note: This answer basically just summarizes my comments that @Peter Cordes already incorporated into his very good answer. Suggest he get's your upvote if you only have one to spare :)

注意：这个答案基本上只是总结了我的评论，@Peter Cordes 已经纳入了他的非常好的答案中。如果你只有一个空余，建议他得到你的赞成:)

Answer 3

回答by Michas

You can use long doublefor computation. On x86 it should be the 80-bit format. Only the legacy, x87 FPU has support for this.

您可以long double用于计算。在 x86 上，它应该是 80 位格式。只有旧版 x87 FPU 支持此功能。

Few shortcomings of x87 FPU:

x87 FPU的几个缺点：

Lack of SIMD, may need more instructions.
Stack based, problematic for super scalar and pipelined architectures.
Separate and quite small set of registers, may need more conversion from other registers and more memory operations.
On the Core i7 there are 3 ports for SSE and only 2 for x87, the processor can execute less parallel instructions.

缺少 SIMD，可能需要更多指令。
基于堆栈，对于超标量和流水线架构有问题。
独立且很小的一组寄存器，可能需要从其他寄存器进行更多转换和更多内存操作。
Core i7 上有 3 个 SSE 端口，而 x87 只有 2 个端口，处理器可以执行更少的并行指令。

Answer 4

回答by Surt

Late answer but I don't feel we have abused linked lists and the TLB enough.

迟到的答案，但我认为我们滥用链表和 TLB 还不够。

Use mmap to allocate your nodes, such that your mostly use the MSB of the address. This should result in long TLB lookup chains, a page is 12 bits, leaving 52 bits for the translation, or around 5 levels it must travers each time. With a bit of luck they must go to memory each time for 5 levels lookup plus 1 memory access to get to your node, the top level will most likely be in cache somewhere, so we can hope for 5*memory access. Place the node so that is strides the worst border so that reading the next pointer would cause another 3-4 translation lookups. This might also totally wreck the cache due to the massive amount of translation lookups. Also the size of the virtual tables might cause most of the user data to be paged to disk for extra time.

使用 mmap 分配您的节点，以便您主要使用地址的 MSB。这应该会导致很长的 TLB 查找链，一个页面是 12 位，剩下 52 位用于翻译，或者每次必须遍历大约 5 个级别。幸运的是，他们每次都必须进入内存进行 5 级查找和 1 次内存访问才能到达您的节点，顶级很可能在某个地方的缓存中，因此我们可以希望获得 5* 内存访问。放置节点，使其跨越最坏的边界，以便读取下一个指针将导致另外 3-4 次翻译查找。由于大量的翻译查找，这也可能完全破坏缓存。此外，虚拟表的大小可能会导致大部分用户数据被分页到磁盘中以花费额外的时间。

When reading from the single linked list, make sure to read from the start of the list each time to cause maximum delay in reading a single number.

从单链表中读取时，请确保每次从链表的开头读取，以导致读取单个数字的最大延迟。

C++ 为英特尔 Sandybridge 系列 CPU 中的管道取消优化程序

提问by Cowmoogun

Update: the professor who gave this assignment posted some details

更新：分配此作业的教授发布了一些详细信息

回答by Peter Cordes

Summary of the good ideas that diabolical incompetence can justify

恶魔的无能可以证明的好主意摘要

Multi-thread badly

多线程不好

Random uarch-specific ideas:

随机 uarch 特定的想法：

Dependency chains:

依赖链：

Use slower/more FP operations (esp. more division):

使用更慢/更多的 FP 操作（尤其是更多的除法）：

If you can use intrinsics (`<immintrin.h>`)

如果您可以使用内在函数 ( `<immintrin.h>`)

With (inline) asm:

使用（内联）汇编：

Cause lots of cache misses and other memory slowdowns

导致大量缓存未命中和其他内存变慢

Traverse arrays in non-contiguous order

以非连续顺序遍历数组

Somewhat off-topic: make the compiler generate worse code / do more work:

有点跑题：让编译器生成更糟糕的代码/做更多的工作：

回答by Jesper Juhl

回答by Michas

回答by Surt

相关推荐

最近更新

标签

C++ 为英特尔 Sandybridge 系列 CPU 中的管道取消优化程序

提问by Cowmoogun

Update: the professor who gave this assignment posted some details

更新：分配此作业的教授发布了一些详细信息

回答by Peter Cordes

Summary of the good ideas that diabolical incompetence can justify

恶魔的无能可以证明的好主意摘要

Multi-thread badly

多线程不好

Random uarch-specific ideas:

随机 uarch 特定的想法：

Dependency chains:

依赖链：

Use slower/more FP operations (esp. more division):

使用更慢/更多的 FP 操作（尤其是更多的除法）：

If you can use intrinsics (<immintrin.h>)

如果您可以使用内在函数 ( <immintrin.h>)

With (inline) asm:

使用（内联）汇编：

Cause lots of cache misses and other memory slowdowns

导致大量缓存未命中和其他内存变慢

Traverse arrays in non-contiguous order

以非连续顺序遍历数组

Somewhat off-topic: make the compiler generate worse code / do more work:

有点跑题：让编译器生成更糟糕的代码/做更多的工作：

回答by Jesper Juhl

回答by Michas

回答by Surt

相关推荐

C++ 没有重载函数的实例与参数列表匹配。

C++ 如何在 OSX 下的 SDL/OpenGL 应用程序中加载 JPG/PNG 纹理

C++ 将 .lib 文件添加到 Visual Studio 2015 中的项目

将 ADC（带进位相加）组装到 C++

相关推荐

最近更新

标签

If you can use intrinsics (`<immintrin.h>`)

如果您可以使用内在函数 ( `<immintrin.h>`)