用于测试 Collatz 猜想的 C++ 代码比手写程序集更快 - 为什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40354978/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
C++ code for testing the Collatz conjecture faster than hand-written assembly - why?
提问by jeffer son
I wrote these two solutions for Project Euler Q14, in assembly and in C++. They are the same identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with
我用汇编和 C++为Project Euler Q14编写了这两个解决方案。它们是用于测试Collatz 猜想的相同的蛮力方法。组装解决方案是用
nasm -felf64 p14.asm && gcc p14.o -o p14
The C++ was compiled with
C++ 是用
g++ p14.cpp -o p14
Assembly, p14.asm
集会, p14.asm
section .data
fmt db "%d", 10, 0
global main
extern printf
section .text
main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i
l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx
l2:
test rax, 1
jpe even
mov rbx, 3
mul rbx
inc rax
jmp c1
even:
mov rbx, 2
xor rdx, rdx
div rbx
c1:
inc r10
cmp rax, 1
jne l2
cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx
cmp rcx, 2
jne l1
mov rdi, fmt
xor rax, rax
call printf
ret
C++, p14.cpp
C++,p14.cpp
#include <iostream>
using namespace std;
int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = n*3 + 1;
++count;
}
return count;
}
int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
cout << maxi << endl;
}
I know about the compiler optimizations to improve speed and everything, but I don't see many ways to optimize my assembly solution further (speaking programmatically not mathematically).
我知道编译器优化可以提高速度和一切,但我没有看到很多方法可以进一步优化我的程序集解决方案(以编程方式而不是数学方式)。
The C++ code has modulus every term and division every even term, where assembly is only one division per even term.
C++ 代码每一项都有模数,每偶数项都有除法,其中汇编每偶数项只有一个除法。
But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking out of mainly curiosity.
但是程序集平均比 C++ 解决方案长 1 秒。为什么是这样?我问的主要是出于好奇。
Execution times
执行次数
My system: 64 bit Linux on ?1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).
我的系统:64 位 Linux 运行在 ?1.4 GHz Intel Celeron 2955U(Haswell 微架构)上。
g++
(unoptimized): avg 1272 msg++ -O3
avg 578 msoriginal asm (div) avg 2650 ms
Asm (shr)
avg 679 ms@johnfound asm, assembled with nasm avg 501 ms
@hidefromkgb asmavg 200 ms
@Veedrac C++avg 81 ms with
-O3
, 305 ms with-O0
g++
(未优化):平均 1272 毫秒g++ -O3
平均 578 毫秒原始 asm (div) 平均 2650 毫秒
Asm (shr)
平均 679 毫秒@johnfound asm,与 nasm 组装在一起,平均 501 毫秒
@hidefromkgb asm平均 200 毫秒
@hidefromkgb asm 由 @Peter Cordes 优化平均 145 毫秒
@Veedrac C++平均 81 ms with
-O3
,305 ms with-O0
回答by Peter Cordes
If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0
(compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).
如果您认为 64 位 DIV 指令是除以 2 的好方法,那么难怪编译器的 asm 输出击败您的手写代码,即使使用-O0
(编译速度快,无需额外优化,并在/之后存储/重新加载到内存中)在每个 C 语句之前,以便调试器可以修改变量)。
See Agner Fog's Optimizing Assembly guideto learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86tag wiki for more perf links.
请参阅Agner Fog 的 Optimizing Assembly 指南以了解如何编写高效的 asm。他还有指令表和微架构指南,了解特定 CPU 的具体细节。另请参阅x86标签 wiki 以获取更多性能链接。
See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).
另请参阅有关使用手写 asm 击败编译器的更一般问题:内联汇编语言是否比本机 C++ 代码慢?. TL:DR:是的,如果你做错了(比如这个问题)。
Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slidesshowing how various C compilers optimize some really simple functions with cool tricks. Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is in a similar vein.
通常,您可以让编译器完成它的工作,尤其是当您尝试编写可以高效编译的 C++ 时。还看到汇编比编译语言更快吗?. 其中一个答案链接到这些简洁的幻灯片,展示了各种 C 编译器如何使用很酷的技巧优化一些非常简单的函数。 Matt Godbolt 的 CppCon2017 演讲“我的编译器最近为我做了什么?Unbolting the Compiler's Lid”与此类似。
even:
mov rbx, 2
xor rdx, rdx
div rbx
On Intel Haswell, div r64
is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks.In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.
在 Intel Haswell 上,div r64
是 36 uop,延迟为32-96 个周期,吞吐量为每 21-74 个周期一个。(加上 2 个 uops 来设置 RBX 和零 RDX,但乱序执行可以提前运行)。 像 DIV 这样的高 uop 计数指令是微编码的,这也会导致前端瓶颈。在这种情况下,延迟是最相关的因素,因为它是循环携带依赖链的一部分。
shr rax, 1
does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.
shr rax, 1
执行相同的无符号除法:它是 1 uop,具有 1c 延迟,并且每个时钟周期可以运行 2 个。
For comparison, 32-bit division is faster, but still horrible vs. shifts. idiv r32
is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.
相比之下,32 位除法速度更快,但与移位相比仍然很糟糕。idiv r32
是 9 uop,22-29c 延迟,Haswell 上每 8-11c 吞吐量一个。
As you can see from looking at gcc's -O0
asm output (Godbolt compiler explorer), it only uses shifts instructions. clang -O0
does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)
从查看 gcc 的-O0
asm 输出(Godbolt 编译器资源管理器)可以看出,它仅使用移位指令。clang-O0
确实像您想象的那样天真地编译,甚至两次使用 64 位 IDIV。(在优化时,如果源代码使用相同的操作数进行除法和取模,编译器会使用 IDIV 的两个输出,如果它们完全使用 IDIV)
GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse(non power of 2) to avoid IDIV (see div_by_13
in the above godbolt link).
GCC 没有完全朴素的模式;它总是通过 GIMPLE 转换,这意味着无法禁用某些“优化”。这包括识别常数除法和使用移位(2 的幂)或定点乘法逆(非 2 的幂)来避免 IDIV(参见div_by_13
上面的 Godbolt 链接)。
gcc -Os
(optimize for size) doesuse IDIV for non-power-of-2 division,
unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.
gcc -Os
(优化大小)确实使用 IDIV 进行非 2 次幂除法,不幸的是,即使在乘法逆码只是稍大但速度更快的情况下也是如此。
Helping the compiler
帮助编译器
(summary for this case: use uint64_t n
)
(此案例的摘要:使用uint64_t n
)
First of all, it's only interesting to look at optimized compiler output. (-O3
). -O0
speed is basically meaningless.
首先,只有查看优化的编译器输出才有趣。( -O3
). -O0
速度基本没有意义。
Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answerand what I did with the non-unrolled loop in @Veedrac's code below.)
查看您的 asm 输出(在 Godbolt 上,或参见How to remove "noise" from GCC/clang assembly output?)。当编译器一开始没有编写最佳代码时:以引导编译器编写更好代码的方式编写 C/C++ 源代码通常是最好的方法。您必须了解 asm,并且知道什么是有效的,但是您可以间接地应用这些知识。编译器也是一个很好的想法来源:有时 clang 会做一些很酷的事情,你可以手持 gcc 做同样的事情:看到这个答案以及我在下面@Veedrac 的代码中对非展开循环做了什么。)
This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs.Comments on @johnfound's answerdiscuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3
and g++ -O3 -march=skylake
will do the right thing. (Or -march=native
.) Or -mtune=...
to just tune, without using instructions that other CPUs might not support.
这种方法是可移植的,并且在 20 年后,一些未来的编译器可以将其编译为在未来硬件(x86 或不是 x86)上有效的任何东西,可能使用新的 ISA 扩展或自动矢量化。15 年前的手写 x86-64 asm 通常不会针对 Skylake 进行最佳调整。例如,当时不存在 compare&branch 宏融合。 现在对于一种微体系结构的手工制作的 asm 来说是最佳的,对于其他当前和未来的 CPU 来说可能不是最佳的。对@johnfound 回答的评论讨论了 AMD Bulldozer 和 Intel Haswell 之间的主要区别,这对这段代码有很大的影响。但在理论上,g++ -O3 -march=bdver3
并且g++ -O3 -march=skylake
会做正确的事情。(或-march=native
.)或者-mtune=...
只是调整,而不使用其他 CPU 可能不支持的指令。
My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.
我的感觉是,引导编译器对你关心的当前 CPU 有好处的 asm 不应该成为未来编译器的问题。在寻找转换代码的方法方面,它们有望比当前的编译器更好,并且可以找到适用于未来 CPU 的方法。无论如何,未来的 x86 可能不会在当前 x86 上的任何优点上表现糟糕,而且未来的编译器在实现诸如从 C 源代码移动数据之类的东西时将避免任何特定于 asm 的陷阱,如果它没有看到更好的东西。
Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Other optimizations are also affected. Read https://gcc.gnu.org/wiki/DontUseInlineAsmbefore using asm. (And avoid MSVC-style inline asm: inputs/outputs have to go through memory which adds overhead.)
手写 asm 是优化器的黑匣子,因此当内联使输入成为编译时常量时,常量传播不起作用。其他优化也受到影响。在使用 asm 之前阅读https://gcc.gnu.org/wiki/DontUseInlineAsm。(并避免 MSVC 风格的内联 asm:输入/输出必须通过内存,这会增加开销。)
In this case: your n
has a signed type, and gcc uses the SAR/SHR/ADD sequence that gives the correct rounding. (IDIV and arithmetic-shift "round" differently for negative inputs, see the SAR insn set ref manual entry). (IDK if gcc tried and failed to prove that n
can't be negative, or what. Signed-overflow is undefined behaviour, so it should have been able to.)
在这种情况下:您n
有一个带符号的类型,并且 gcc 使用 SAR/SHR/ADD 序列来提供正确的舍入。(对于负输入,IDIV 和算术移位“舍入”不同,请参阅SAR insn set ref 手册条目)。(IDK 如果 gcc 尝试并未能证明n
不能为负,或者什么。签名溢出是未定义的行为,所以它应该能够。)
You should have used uint64_t n
, so it can just SHR. And so it's portable to systems where long
is only 32-bit (e.g. x86-64 Windows).
你应该用过uint64_t n
,所以它可以只是 SHR。因此它可以移植到long
只有 32 位的系统(例如 x86-64 Windows)。
BTW, gcc's optimizedasm output looks pretty good (using unsigned long n
): the inner loop it inlines into main()
does this:
顺便说一句,gcc 的优化asm 输出看起来很不错(使用unsigned long n
):它内联到的内部循环main()
执行以下操作:
# from gcc5.4 -O3 plus my comments
# edx= count=1
# rax= uint64_t n
.L9: # do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
mov rdi, rax
shr rdi # rdi = n>>1;
test al, 1 # set flags based on n%2 (aka n&1)
mov rax, rcx
cmove rax, rdi # n= (n%2) ? 3*n+1 : n/2;
add edx, 1 # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
cmp/branch to update max and maxi, and then do the next n
The inner loop is branchless, and the critical path of the loop-carried dependency chain is:
内循环是无分支的,循环携带的依赖链的关键路径为:
- 3-component LEA (3 cycles)
- cmov (2 cycles on Haswell, 1c on Broadwell or later).
- 3 组分 LEA(3 个循环)
- cmov(Haswell 上 2 个周期,Broadwell 或更高版本上 1c)。
Total: 5 cycle per iteration, latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (in theory: I haven't tested with perf counters to see if it really runs at 5c/iter).
总计:每次迭代 5 个周期,延迟瓶颈。乱序执行与此并行处理其他所有事情(理论上:我还没有使用性能计数器进行测试以查看它是否真的以 5c/iter 运行)。
The FLAGS input of cmov
(produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.
cmov
(由 TEST 生成)的 FLAGS 输入比 RAX 输入(来自 LEA->MOV)的生成速度更快,因此它不在关键路径上。
Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.
同样,产生 CMOV 的 RDI 输入的 MOV->SHR 不在关键路径上,因为它也比 LEA 快。IvyBridge 及更高版本上的 MOV 具有零延迟(在寄存器重命名时处理)。(它仍然需要一个 uop 和管道中的一个插槽,所以它不是免费的,只是零延迟)。LEA dep 链中的额外 MOV 是其他 CPU 上瓶颈的一部分。
The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.
cmp/jne 也不是关键路径的一部分:它不是循环携带的,因为控制依赖是通过分支预测 + 推测执行来处理的,这与关键路径上的数据依赖不同。
Beating the compiler
击败编译器
GCC did a pretty good job here. It could save one code byte by using inc edx
instead of add edx, 1
, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.
GCC 在这方面做得很好。它可以通过使用inc edx
代替add edx, 1
来节省一个代码字节,因为没有人关心 P4 及其对部分标志修改指令的错误依赖性。
It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc
instead of test
/ cmovz
.
它还可以保存所有 MOV 指令,并且 TEST: SHR 设置 CF= 移出的位,因此我们可以使用/cmovc
代替。test
cmovz
### Hand-optimized version of what gcc does
.L9: #do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
shr rax, 1 # n>>=1; CF = n&1 = n%2
cmovc rax, rcx # n= (n&1) ? 3*n+1 : n/2;
inc edx # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
See @johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)
请参阅@johnfound 的另一个巧妙技巧的答案:通过在 SHR 的标志结果上进行分支并将其用于 CMOV 来删除 CMP:仅当 n 为 1(或 0)时才为零。(有趣的事实:如果您读取标志结果,在 Nehalem 或更早版本上使用 count != 1 的 SHR 会导致停顿。这就是他们将其设为单 uop 的方式。不过,shift-by-1 特殊编码很好。)
Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantlyon CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency. The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See @johnfound's answer, where he has timing results from an AMD CPU.
避免使用 MOV 对 Haswell 上的延迟根本没有帮助(x86 的 MOV 真的可以“免费”吗?为什么我根本不能重现它?)。它确实对 Intel pre-IvB 和 AMD Bulldozer 系列等 CPU 有很大帮助,其中 MOV 不是零延迟。编译器浪费的 MOV 指令确实会影响关键路径。BD 的 complex-LEA 和 CMOV 都具有较低的延迟(分别为 2c 和 1c),因此它是延迟的较大部分。此外,吞吐量瓶颈成为一个问题,因为它只有两个整数 ALU 管道。 请参阅@johnfound's answer,他在那里获得了来自 AMD CPU 的计时结果。
Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple n
values in parallel in an interleaved loop (see below).
即使在 Haswell 上,这个版本也可以通过避免一些偶然的延迟来帮助一点,即非关键 uop 从关键路径上的一个执行端口窃取执行端口,将执行延迟 1 个周期。(这称为资源冲突)。它还节省了一个寄存器,这n
在交错循环中并行处理多个值时可能会有所帮助(见下文)。
LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components ([base+idx+const]
, which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).
LEA 的延迟取决于Intel SnB 系列 CPU上的寻址模式。3c 为 3 个组件([base+idx+const]
,需要两个单独的添加),但只有 1c 有 2 个或更少的组件(一个添加)。某些 CPU(如 Core2)甚至在单个周期内执行 3 组件 LEA,但 SnB 系列则不然。更糟糕的是,英特尔 SnB 系列对延迟进行了标准化,因此没有 2c uops,否则 3 组件 LEA 将只有像推土机一样的 2c。(3 组件 LEA 在 AMD 上也较慢,只是没有那么多)。
So lea rcx, [rax + rax*2]
/ inc rcx
is only 2c latency, faster than lea rcx, [rax + rax*2 + 1]
, on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.
所以lea rcx, [rax + rax*2]
/inc rcx
仅为 2c 延迟,比lea rcx, [rax + rax*2 + 1]
Haswell 等 Intel SnB 系列 CPU 上的快。在 BD 上收支平衡,在 Core2 上更糟。它确实花费了额外的 uop,通常不值得节省 1c 延迟,但延迟是这里的主要瓶颈,Haswell 有足够宽的管道来处理额外的 uop 吞吐量。
Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACAor other static-analysis tools; they just use some heuristics.)
gcc、icc 和 clang(在 Godbolt 上)都没有使用 SHR 的 CF 输出,总是使用 AND 或 TEST。愚蠢的编译器。:P 它们是复杂机器的重要组成部分,但聪明的人通常可以在小规模问题上击败它们。(当然,要考虑数千到数百万倍!编译器不会使用详尽的算法来搜索所有可能的做事方式,因为在优化大量内联代码时会花费太长时间,这就是他们做得最好。他们也没有在目标微架构中对管道进行建模,至少与IACA或其他静态分析工具的细节不同;他们只是使用了一些启发式方法。)
Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main
, but that's fine because each thread can just check a range of n
values and produce a pair of integers as a result.
简单的循环展开无济于事;这个循环瓶颈在于循环携带的依赖链的延迟,而不是循环开销/吞吐量。这意味着它适用于超线程(或任何其他类型的 SMT),因为 CPU 有大量时间来交错来自两个线程的指令。这意味着在 中并行化循环main
,但这很好,因为每个线程可以只检查一系列n
值并产生一对整数作为结果。
Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max
/ maxi
. This creates more instruction-level parallelism.
在单个线程中手动交错也可能是可行的。也许并行计算一对数字的序列,因为每个数字只需要几个寄存器,并且它们都可以更新相同的max
/ maxi
。这创造了更多的指令级并行性。
The trick is deciding whether to wait until all the n
values have reached 1
before getting another pair of starting n
values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.
诀窍是决定是否等到所有n
值都达到1
后再获得另一对起始值n
,或者是否突破并为达到结束条件的一个获得新的起点,而不触及其他序列的寄存器。可能最好让每个链都处理有用的数据,否则您必须有条件地增加其计数器。
You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n
hadn't reached 1
yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n
values up in the air. Maybe only worth with 256b vector (4x uint64_t
).
您甚至可以使用 SSE 打包比较内容来执行此操作,以有条件地增加n
尚未到达的向量元素的计数器1
。然后为了隐藏 SIMD 条件增量实现的更长的延迟,您需要将更多的n
值向量保持在空中。也许只有 256b 向量(4x uint64_t
)才值得。
I think the best strategy to make detection of a 1
"sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a 1
in an element, the increment-vector will have a zero, and +=0 is a no-op.
我认为检测1
“粘性”的最佳策略是屏蔽您添加以增加计数器的全 1 向量。因此,1
在元素中看到 a 后,增量向量将为零,而 +=0 是无操作。
Untested idea for manual vectorization
未经测试的手动矢量化想法
# starting with YMM0 = [ n_d, n_c, n_b, n_a ] (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1): increment vector
# ymm5 = all-zeros: count vector
.inner_loop:
vpaddq ymm1, ymm0, xmm0
vpaddq ymm1, ymm1, xmm0
vpaddq ymm1, ymm1, set1_epi64(1) # ymm1= 3*n + 1. Maybe could do this more efficiently?
vprllq ymm3, ymm0, 63 # shift bit 1 to the sign bit
vpsrlq ymm0, ymm0, 1 # n /= 2
# FP blend between integer insns may cost extra bypass latency, but integer blends don't have 1 bit controlling a whole qword.
vpblendvpd ymm0, ymm0, ymm1, ymm3 # variable blend controlled by the sign bit of each 64-bit element. I might have the source operands backwards, I always have to look this up.
# ymm0 = updated n in each element.
vpcmpeqq ymm1, ymm0, set1_epi64(1)
vpandn ymm4, ymm1, ymm4 # zero out elements of ymm4 where the compare was true
vpaddq ymm5, ymm5, ymm4 # count++ in elements where n has never been == 1
vptest ymm4, ymm4
jnz .inner_loop
# Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero
vextracti128 ymm0, ymm5, 1
vpmaxq .... crap this doesn't exist
# Actually just delay doing a horizontal max until the very very end. But you need some way to record max and maxi.
You can and should implement this with intrinsics instead of hand-written asm.
您可以并且应该使用内在函数而不是手写 asm 来实现它。
Algorithmic / implementation improvement:
算法/实现改进:
Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)
除了用更高效的 asm 实现相同的逻辑之外,还要寻找简化逻辑的方法,或者避免多余的工作。例如记忆以检测序列的共同结尾。或者甚至更好,一次查看 8 个尾随位(gnasher 的回答)
@EOF points out that tzcnt
(or bsf
) could be used to do multiple n/=2
iterations in one step. That's probably better than SIMD vectorizing; no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar n
s in parallel in different integer registers, though.
@EOF 指出tzcnt
(或bsf
) 可用于n/=2
在一个步骤中进行多次迭代。这可能比 SIMD 向量化更好;没有 SSE 或 AVX 指令可以做到这一点。不过,它仍然与n
在不同的整数寄存器中并行执行多个 scalar 兼容。
So the loop might look like this:
所以循环可能是这样的:
goto loop_entry; // C++ structured like the asm, for illustration only
do {
n = n*3 + 1;
loop_entry:
shift = _tzcnt_u64(n);
n >>= shift;
count += shift;
} while(n != 1);
This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).
这可能会显着减少迭代次数,但在没有 BMI2 的英特尔 SnB 系列 CPU 上,可变计数转换很慢。3 uop,2c 延迟。(他们对 FLAGS 有输入依赖性,因为 count=0 意味着标志未修改。他们将其作为数据依赖性处理,并采用多个 uops,因为 uop 只能有 2 个输入(无论如何,pre-HSW/BDW))。这就是抱怨 x86 疯狂的 CISC 设计的人所指的那种。它使 x86 CPU 比今天从头开始设计 ISA 时要慢,即使是以几乎相似的方式。(即这是“x86 税”的一部分,会消耗速度/功率。)SHRX/SHLX/SARX (BMI2) 是一个巨大的胜利(1 uop / 1c 延迟)。
It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1
, though. @Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).
它还将 tzcnt(Haswell 及更高版本上的 3c)放在关键路径上,因此它显着延长了循环携带的依赖链的总延迟。不过,它确实消除了对 CMOV 或准备寄存器的任何需要n>>1
。@Veedrac 的答案通过推迟多次迭代的 tzcnt/shift 来克服这一切,这是非常有效的(见下文)。
We can safely use BSFor TZCNTinterchangeably, because n
can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).
我们可以安全地交替使用BSF或TZCNT,因为n
在这一点上永远不会为零。TZCNT 的机器代码在不支持 BMI1 的 CPU 上解码为 BSF。(无意义的前缀被忽略,所以 REP BSF 作为 BSF 运行)。
TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF
, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll
even with -mno-bmi
.
在支持 TZCNT 的 AMD CPU 上,TZCNT 的性能比 BSF 好得多,因此使用 可能是个好主意REP BSF
,即使您不关心在输入为零而不是输出为零时设置 ZF。当您使用__builtin_ctzll
even with时,某些编译器会执行此操作-mno-bmi
。
They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)
它们在 Intel CPU 上执行相同的操作,因此如果这很重要,只需保存字节即可。英特尔(Skylake 之前)上的 TZCNT 仍然对所谓的只写输出操作数有错误依赖,就像 BSF 一样,以支持未记录的行为,即输入 = 0 的 BSF 未修改其目的地。所以你需要解决这个问题,除非只针对 Skylake 进行优化,所以额外的 REP 字节没有任何好处。(英特尔经常超出 x86 ISA 手册的要求,以避免破坏广泛使用的代码,这些代码依赖于它不应该或追溯不允许的东西。例如,Windows 9x 假设没有对 TLB 条目进行推测性预取,这是安全的在编写代码时,在英特尔更新 TLB 管理规则之前。)
Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for @Veedrac's code, you see it breaking the dep chain with xor-zeroingon the register it's about to use as TZCNT's destination when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: they can micro-fuse a memory operand with an indexed addressing modeon Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.
无论如何,Haswell 上的 LZCNT/TZCNT 与 POPCNT 具有相同的 false dep:请参阅此问答。这就是为什么在@Veedrac 代码的 gcc 的 asm 输出中,您会看到它在不使用 dst=src 时将用作 TZCNT 目标的寄存器上的异或归零破坏了 dep 链。由于 TZCNT/LZCNT/POPCNT 永远不会使它们的目的地未定义或未修改,因此这种对 Intel CPU 输出的错误依赖是性能错误/限制。据推测,一些晶体管/电源值得让它们像进入同一执行单元的其他 uops 一样工作。唯一的优点是与另一个 uarch 限制的交互:它们可以将内存操作数与索引寻址模式进行微融合在 Haswell 上,但在 Skylake 上,英特尔删除了 LZCNT/TZCNT 的错误 dep,它们“取消层压”索引寻址模式,而 POPCNT 仍然可以微融合任何 addr 模式。
Improvements to ideas / code from other answers:
来自其他答案的想法/代码的改进:
@hidefromkgb's answerhas a nice observation that you're guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow: ROR rdi,2
is faster than SHRD rdi,rdi,2
, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.
@hidefromkgb 的回答有一个很好的观察结果,即保证您可以在 3n+1 后右移一次。与在步骤之间省略检查相比,您可以更有效地进行计算。但是,该答案中的 asm 实现被破坏了(它取决于 OF,在 SHRD 之后未定义,计数 > 1),并且慢:ROR rdi,2
比 快SHRD rdi,rdi,2
,并且在关键路径上使用两个 CMOV 指令比额外的 TEST 慢可以并行运行。
I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in @hidefromkgb's answer. (This answer hit the 30k char limit from the large Godbolt URLs, but shortlinks can rotand were too long for goo.gl anyway.)
我把整理/改进的 C(它指导编译器生成更好的 asm),并在 Godbolt 上测试+更快地工作(在 C 下方的评论中):请参阅@hidefromkgb 的答案中的链接。(这个答案达到了来自大型 Godbolt URL 的 30k 字符限制,但短链接可能会腐烂并且对于 goo.gl 来说太长了。)
Also improved the output-printing to convert to a string and make one write()
instead of writing one char at a time. This minimizes impact on timing the whole program with perf stat ./collatz
(to record performance counters), and I de-obfuscated some of the non-critical asm.
还改进了输出打印以转换为字符串并生成一个write()
而不是一次写入一个字符。这最大限度地减少了对整个程序计时的影响perf stat ./collatz
(以记录性能计数器),并且我对一些非关键的 asm 进行了去混淆处理。
@Veedrac's code
@Vedrac 的代码
I got a minor speedup from right-shifting as much as we knowneeds doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.
我从右移到我们知道需要做的范围内获得了轻微的加速,并检查以继续循环。在 Core2Duo (Merom) 上,从 limit=1e8 的 7.5s 降至 7.275s,展开因子为 16。
code + comments on Godbolt. Don't use this version with clang; it does something silly with the defer-loop. Using a tmp counter k
and then adding it to count
later changes what clang does, but that slightlyhurts gcc.
Godbolt 上的代码 + 评论。不要将这个版本与 clang 一起使用;它对延迟循环做了一些愚蠢的事情。使用 tmp 计数器k
然后将其添加到count
稍后会更改 clang 的作用,但这会稍微伤害 gcc。
See discussion in comments: Veedrac's code is excellenton CPUs with BMI1 (i.e. not Celeron/Pentium)
请参阅评论中的讨论: Veedrac 的代码在具有 BMI1(即不是赛扬/奔腾)的 CPU 上非常出色
回答by johnfound
Claiming that the C++ compiler can produce more optimal code than a competent assembly language programmer is a very bad mistake. And especially in this case. The human always can make the code better that the compiler can, and this particular situation is good illustration of this claim.
声称 C++ 编译器可以生成比有能力的汇编语言程序员更优化的代码是一个非常严重的错误。尤其是在这种情况下。人类总是可以将代码做得比编译器更好,这种特殊情况很好地说明了这一说法。
The timing difference you're seeing is because the assembly code in the question is very far from optimal in the inner loops.
您看到的时间差异是因为问题中的汇编代码在内部循环中远非最佳。
(The below code is 32-bit, but can be easily converted to 64-bit)
(以下代码为 32 位,但可以轻松转换为 64 位)
For example, the sequence function can be optimized to only 5 instructions:
例如,序列函数可以优化为只有 5 条指令:
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
The whole code looks like:
整个代码看起来像:
include "%lib%/freshlib.inc"
@BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"
start:
InitializeAll
mov ecx, 999999
xor edi, edi ; max
xor ebx, ebx ; max i
.main_loop:
xor esi, esi
mov eax, ecx
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
cmp edi, esi
cmovb edi, esi
cmovb ebx, ecx
dec ecx
jnz .main_loop
OutputValue "Max sequence: ", edi, 10, -1
OutputValue "Max index: ", ebx, 10, -1
FinalizeAll
stdcall TerminateAll, 0
In order to compile this code, FreshLibis needed.
为了编译此代码,需要FreshLib。
In my tests, (1 GHz AMD A4-1200 processor), the above code is approximately four times faster than the C++ code from the question (when compiled with -O0
: 430 ms vs. 1900 ms), and more than two times faster (430 ms vs. 830 ms) when the C++ code is compiled with -O3
.
在我的测试中,(1 GHz AMD A4-1200 处理器),上面的代码比问题中的 C++ 代码快四倍(编译时-O0
:430 ms vs. 1900 ms),快两倍多(430 ms vs. 830 ms),当 C++ 代码使用-O3
.
The output of both programs is the same: max sequence = 525 on i = 837799.
两个程序的输出是相同的:在 i = 837799 上的最大序列 = 525。
回答by gnasher729
For more performance: A simple change is observing that after n = 3n+1, n will be even, so you can divide by 2 immediately. And n won't be 1, so you don't need to test for it. So you could save a few if statements and write:
为了获得更高的性能:一个简单的变化是观察到 n = 3n+1 之后,n 将是偶数,因此您可以立即除以 2。并且 n 不会是 1,所以你不需要测试它。因此,您可以保存一些 if 语句并编写:
while (n % 2 == 0) n /= 2;
if (n > 1) for (;;) {
n = (3*n + 1) / 2;
if (n % 2 == 0) {
do n /= 2; while (n % 2 == 0);
if (n == 1) break;
}
}
Here's a bigwin: If you look at the lowest 8 bits of n, all the steps until you divided by 2 eight times are completely determined by those eight bits. For example, if the last eight bits are 0x01, that is in binary your number is ???? 0000 0001 then the next steps are:
这是一个巨大的胜利:如果您查看 n 的最低 8 位,则在您除以 2 八次之前的所有步骤完全由这 8 位决定。例如,如果最后八位是 0x01,那么你的二进制数是 ???? 0000 0001 那么接下来的步骤是:
3n+1 -> ???? 0000 0100
/ 2 -> ???? ?000 0010
/ 2 -> ???? ??00 0001
3n+1 -> ???? ??00 0100
/ 2 -> ???? ???0 0010
/ 2 -> ???? ???? 0001
3n+1 -> ???? ???? 0100
/ 2 -> ???? ???? ?010
/ 2 -> ???? ???? ??01
3n+1 -> ???? ???? ??00
/ 2 -> ???? ???? ???0
/ 2 -> ???? ???? ????
So all these steps can be predicted, and 256k + 1 is replaced with 81k + 1. Something similar will happen for all combinations. So you can make a loop with a big switch statement:
所以所有这些步骤都可以预测,256k + 1 替换为 81k + 1。所有组合都会发生类似的事情。所以你可以用一个大的 switch 语句做一个循环:
k = n / 256;
m = n % 256;
switch (m) {
case 0: n = 1 * k + 0; break;
case 1: n = 81 * k + 1; break;
case 2: n = 81 * k + 1; break;
...
case 155: n = 729 * k + 425; break;
...
}
Run the loop until n ≤ 128, because at that point n could become 1 with fewer than eight divisions by 2, and doing eight or more steps at a time would make you miss the point where you reach 1 for the first time. Then continue the "normal" loop - or have a table prepared that tells you how many more steps are need to reach 1.
运行循环直到 n ≤ 128,因为在那个点 n 可能会变成 1,而除以 2 少于 8 个,并且一次执行 8 个或更多步骤会使您错过第一次达到 1 的点。然后继续“正常”循环 - 或者准备一张表格,告诉您达到 1 还需要多少步。
PS. I strongly suspect Peter Cordes' suggestion would make it even faster. There will be no conditional branches at all except one, and that one will be predicted correctly except when the loop actually ends. So the code would be something like
附注。我强烈怀疑 Peter Cordes 的建议会使它更快。除了一个条件分支之外,根本没有条件分支,并且除非循环实际结束,否则将正确预测该分支。所以代码会是这样的
static const unsigned int multipliers [256] = { ... }
static const unsigned int adders [256] = { ... }
while (n > 128) {
size_t lastBits = n % 256;
n = (n >> 8) * multipliers [lastBits] + adders [lastBits];
}
In practice, you would measure whether processing the last 9, 10, 11, 12 bits of n at a time would be faster. For each bit, the number of entries in the table would double, and I excect a slowdown when the tables don't fit into L1 cache anymore.
在实践中,您将衡量一次处理 n 的最后 9、10、11、12 位是否会更快。对于每一位,表中的条目数将增加一倍,当表不再适合 L1 缓存时,我会放慢速度。
PPS. If you need the number of operations: In each iteration we do exactly eight divisions by two, and a variable number of (3n + 1) operations, so an obvious method to count the operations would be another array. But we can actually calculate the number of steps (based on number of iterations of the loop).
聚苯乙烯。如果您需要操作次数:在每次迭代中,我们正好进行 8 次除以 2,以及可变数量的 (3n + 1) 次操作,因此计算操作数的一个明显方法是另一个数组。但是我们实际上可以计算步数(基于循环的迭代次数)。
We could redefine the problem slightly: Replace n with (3n + 1) / 2 if odd, and replace n with n / 2 if even. Then every iteration will do exactly 8 steps, but you could consider that cheating :-) So assume there were r operations n <- 3n+1 and s operations n <- n/2. The result will be quite exactly n' = n * 3^r / 2^s, because n <- 3n+1 means n <- 3n * (1 + 1/3n). Taking the logarithm we find r = (s + log2 (n' / n)) / log2 (3).
我们可以稍微重新定义这个问题:如果是奇数,则将 n 替换为 (3n + 1) / 2,如果是偶数,则将 n 替换为 n / 2。然后每次迭代都会执行 8 个步骤,但您可以考虑作弊:-) 所以假设有 r 个操作 n <- 3n+1 和 s 个操作 n <- n/2。结果将完全是 n' = n * 3^r / 2^s,因为 n <- 3n+1 意味着 n <- 3n * (1 + 1/3n)。取对数我们发现 r = (s + log2 (n' / n)) / log2 (3)。
If we do the loop until n ≤ 1,000,000 and have a precomputed table how many iterations are needed from any start point n ≤ 1,000,000 then calculating r as above, rounded to the nearest integer, will give the right result unless s is truly large.
如果我们执行循环直到 n ≤ 1,000,000 并且有一个预先计算的表格,从任何起点 n ≤ 1,000,000 开始需要多少次迭代,那么如上所述计算 r,四舍五入到最接近的整数,将给出正确的结果,除非 s 真的很大。
回答by hidefromkgb
On a rather unrelated note: more performance hacks!
在一个相当不相关的说明上:更多的性能黑客!
[the first ?conjecture? has been finally debunked by @ShreevatsaR; removed]
When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element
N
(shown first):- [even] [odd]
- [odd] [even]
- [even] [even]
To leap past these 2 elements means to compute
(N >> 1) + N + 1
,((N << 1) + N + 1) >> 1
andN >> 2
, respectively.Let`s prove that for both cases (1) and (2) it is possible to use the first formula,
(N >> 1) + N + 1
.Case (1) is obvious. Case (2) implies
(N & 1) == 1
, so if we assume (without loss of generality) that N is 2-bit long and its bits areba
from most- to least-significant, thena = 1
, and the following holds:(N << 1) + N + 1: (N >> 1) + N + 1: b10 b1 b1 b + 1 + 1 ---- --- bBb0 bBb
where
B = !b
. Right-shifting the first result gives us exactly what we want.Q.E.D.:
(N & 1) == 1 ? (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1
.As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.
[第一个?猜想?终于被@ShreevatsaR 揭穿了;删除]
在遍历序列时,我们只能得到当前元素的 2-邻域中的 3 种可能情况
N
(首先显示):- [偶] [奇]
- [奇偶]
- [偶] [偶]
飞跃过去,这些2层元素的方法来计算
(N >> 1) + N + 1
,((N << 1) + N + 1) >> 1
并N >> 2
分别。让我们证明对于 (1) 和 (2) 两种情况,都可以使用第一个公式,
(N >> 1) + N + 1
。情况(1)是显而易见的。情况 (2) 意味着
(N & 1) == 1
,所以如果我们假设(不失一般性)N 是 2 位长并且它的位ba
从最高到最低有效,那么a = 1
,并且以下成立:(N << 1) + N + 1: (N >> 1) + N + 1: b10 b1 b1 b + 1 + 1 ---- --- bBb0 bBb
哪里
B = !b
。右移第一个结果给了我们我们想要的。QED:
(N & 1) == 1 ? (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1
。经证明,我们可以使用单个三元运算一次遍历序列 2 个元素。又减少了 2 倍的时间。
The resulting algorithm looks like this:
生成的算法如下所示:
uint64_t sequence(uint64_t size, uint64_t *path) {
uint64_t n, i, c, maxi = 0, maxc = 0;
for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
c = 2;
while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
c += 2;
if (n == 2)
c++;
if (c > maxc) {
maxi = i;
maxc = c;
}
}
*path = maxc;
return maxi;
}
int main() {
uint64_t maxi, maxc;
maxi = sequence(1000000, &maxc);
printf("%llu, %llu\n", maxi, maxc);
return 0;
}
Here we compare n > 2
because the process may stop at 2 instead of 1 if the total length of the sequence is odd.
在这里我们进行比较,n > 2
因为如果序列的总长度是奇数,过程可能会在 2 而不是 1 处停止。
[EDIT:]
[编辑:]
Let`s translate this into assembly!
让我们把它翻译成汇编!
MOV RCX, 1000000;
DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;
@main:
XOR RSI, RSI;
LEA RDI, [RCX + 1];
@loop:
ADD RSI, 2;
LEA RDX, [RDI + RDI*2 + 2];
SHR RDX, 1;
SHRD RDI, RDI, 2; ror rdi,2 would do the same thing
CMOVL RDI, RDX; Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
CMOVS RDI, RDX;
CMP RDI, 2;
JA @loop;
LEA RDX, [RSI + 1];
CMOVE RSI, RDX;
CMP RAX, RSI;
CMOVB RAX, RSI;
CMOVB RBX, RCX;
SUB RCX, 2;
JA @main;
MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;
@itoa:
XOR RDX, RDX;
DIV RCX;
ADD RDX, '0';
PUSH RDX;
TEST RAX, RAX;
JNE @itoa;
PUSH RCX;
LEA RAX, [RBX + 1];
TEST RBX, RBX;
MOV RBX, RDI;
JNE @itoa;
POP RCX;
INC RDI;
MOV RDX, RDI;
@outp:
MOV RSI, RSP;
MOV RAX, RDI;
SYSCALL;
POP RAX;
TEST RAX, RAX;
JNE @outp;
LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;
Use these commands to compile:
使用这些命令进行编译:
nasm -f elf64 file.asm
ld -o file file.o
See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt. (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)
请参阅 Peter Cordes在 Godbolt 上的 C 和改进/修正版本的 asm 。(编者注:抱歉把我的东西放在你的答案中,但我的答案达到了 Godbolt 链接 + 文本的 30k 字符限制!)
回答by Mangu Singh Rajpurohit
C++ programs are translated to assembly programs during the generation of machine code from the source code. It would be virtually wrong to say assembly is slower than C++. Moreover, the binary code generated differs from compiler to compiler. So a smart C++ compiler mayproduce binary code more optimal and efficient than a dumb assembler's code.
在从源代码生成机器代码的过程中,C++ 程序被转换为汇编程序。说汇编比 C++ 慢实际上是错误的。此外,生成的二进制代码因编译器而异。因此,智能 C++ 编译器可能会生成比愚蠢的汇编程序代码更优化和更高效的二进制代码。
However I believe your profiling methodology has certain flaws. The following are general guidelines for profiling:
但是,我相信您的分析方法存在某些缺陷。以下是概要分析的一般准则:
- Make sure your system is in its normal/idle state. Stop all running processes (applications) that you started or that use CPU intensively (or poll over the network).
- Your datasize must be greater in size.
- Your test must run for something more than 5-10 seconds.
- Do not rely on just one sample. Perform your test N times. Collect results and calculate the mean or median of the result.
- 确保您的系统处于正常/空闲状态。停止所有您启动或大量使用 CPU(或通过网络轮询)的正在运行的进程(应用程序)。
- 您的数据大小必须更大。
- 您的测试必须运行超过 5-10 秒。
- 不要只依赖一个样本。进行 N 次测试。收集结果并计算结果的平均值或中位数。
回答by Emanuel Landeholm
For the Collatz problem, you can get a significant boost in performance by caching the "tails". This is a time/memory trade-off. See: memoization (https://en.wikipedia.org/wiki/Memoization). You could also look into dynamic programming solutions for other time/memory trade-offs.
对于 Collatz 问题,您可以通过缓存“尾部”来显着提高性能。这是一个时间/内存权衡。请参阅:备忘录(https://en.wikipedia.org/wiki/Memoization)。您还可以研究其他时间/内存权衡的动态编程解决方案。
Example python implementation:
示例python实现:
import sys
inner_loop = 0
def collatz_sequence(N, cache):
global inner_loop
l = [ ]
stop = False
n = N
tails = [ ]
while not stop:
inner_loop += 1
tmp = n
l.append(n)
if n <= 1:
stop = True
elif n in cache:
stop = True
elif n % 2:
n = 3*n + 1
else:
n = n // 2
tails.append((tmp, len(l)))
for key, offset in tails:
if not key in cache:
cache[key] = l[offset:]
return l
def gen_sequence(l, cache):
for elem in l:
yield elem
if elem in cache:
yield from gen_sequence(cache[elem], cache)
raise StopIteration
if __name__ == "__main__":
le_cache = {}
for n in range(1, 4711, 5):
l = collatz_sequence(n, le_cache)
print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))
print("inner_loop = {}".format(inner_loop))
回答by Ped7g
From comments:
来自评论:
But, this code never stops (because of integer overflow) !?! Yves Daoust
但是,这段代码永远不会停止(因为整数溢出)!?!伊夫·达乌斯特
For many numbers it will notoverflow.
对于许多数字,它不会溢出。
If it willoverflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.
如果它会溢出 - 对于那些不幸的初始种子之一,溢出的数字很可能会收敛到 1 而不会再次溢出。
Still this poses interesting question, is there some overflow-cyclic seed number?
这仍然提出了一个有趣的问题,是否有一些溢出循环种子数?
Any simple final converging series starts with power of two value (obvious enough?).
任何简单的最终收敛系列都以二值的幂开始(足够明显?)。
2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to shr rax
producing ZF=1.
2^64 将溢出为零,根据算法,这是未定义的无限循环(仅以 1 结束),但由于shr rax
产生 ZF=1 ,答案中的最佳解决方案将完成。
Can we produce 2^64? If the starting number is 0x5555555555555555
, it's odd number, next number is then 3n+1, which is 0xFFFFFFFFFFFFFFFF + 1
= 0
. Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. The cmp rax,1
of Peter Cordes will end in infinite loop(QED variant 1, "cheapo" through undefined 0
number).
我们可以生产 2^64 吗?如果起始数为0x5555555555555555
,则为奇数,下一个数为 3n+1,即0xFFFFFFFFFFFFFFFF + 1
= 0
。理论上处于未定义的算法状态,但 johnfound 的优化答案将通过在 ZF=1 上退出来恢复。的cmp rax,1
彼得科尔德的将在无限循环结束(QED变体1,“小气鬼”通过未定义0
数量)。
How about some more complex number, which will create cycle without 0
?
Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.
一些更复杂的数字怎么样,它会在没有 的情况下创建循环0
?坦率地说,我不确定,我的数学理论太模糊了,无法认真思考如何认真对待它。但直觉上我会说,对于每个数字,该级数都会收敛到 1:0 < 数字,因为 3n+1 公式迟早会慢慢地将原始数字(或中间数)的每个非 2 质因子变成 2 的某个幂. 所以我们不需要担心原始系列的无限循环,只有溢出会阻碍我们。
So I just put few numbers into sheet and took a look on 8 bit truncated numbers.
所以我只是将几个数字放入表格中并查看了 8 位截断的数字。
There are three values overflowing to 0
: 227
, 170
and 85
(85
going directly to 0
, other two progressing toward 85
).
有三个值溢出到0
: 227
,170
和85
(85
直接到0
, 其他两个前进到85
)。
But there's no value creating cyclic overflow seed.
但是创建循环溢出种子没有任何价值。
Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already 27
is affected! It does reach value 9232
in proper non-truncated series (first truncated value is 322
in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is 13120
(for the 255
itself), maximum number of steps to converge to 1
is about 128
(+-2, not sure if "1" is to count, etc...).
有趣的是,我做了一个检查,这是第一个遭受 8 位截断的数字,并且已经27
受到影响!它确实达到9232
了适当的非截断序列中的值(第一个截断值322
在第 12 步中),并且以非截断方式为 2-255 个输入数字中的任何一个达到的最大值是13120
(对于255
本身),最大步数收敛到1
大约是128
(+-2,不确定“1”是否要计数,等等......)。
Interestingly enough (for me) the number 9232
is maximum for many other source numbers, what's so special about it? :-O 9232
= 0x2410
... hmmm.. no idea.
有趣的是(对我来说)这个数字9232
对于许多其他源数字是最大的,它有什么特别之处?:-O 9232
= 0x2410
... 嗯.. 不知道。
Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to kbits, but with cmp number,1
terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as 0
after truncation.
不幸的是,我无法深入了解这个系列,为什么它会收敛以及将它们截断为k位的含义是什么,但是在cmp number,1
终止条件下,当然可以将算法置于无限循环中,特定输入值以0
之后结尾截断。
But the value 27
overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value 1
, you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).
但是27
8 位情况下溢出的值是一种警报,这看起来像如果您计算达到 value 的步骤1
数,对于总 k 位整数集合中的大多数数字,您将得到错误的结果。对于 8 位整数,256 个数字中的 146 个数字通过截断影响了系列(其中一些可能仍然意外地达到了正确的步数,我懒得检查)。
回答by Damon
You did not post the code generated by the compiler, so there' some guesswork here, but even without having seen it, one can say that this:
你没有发布编译器生成的代码,所以这里有一些猜测,但即使没有看到,也可以这样说:
test rax, 1
jpe even
... has a 50% chance of mispredicting the branch, and that will come expensive.
... 有 50% 的概率错误预测分支,而且代价高昂。
The compiler almost certainly does both computations (which costs neglegibly more since the div/mod is quite long latency, so the multiply-add is "free") and follows up with a CMOV. Which, of course, has a zeropercent chance of being mispredicted.
编译器几乎可以肯定会进行这两种计算(由于 div/mod 的延迟很长,因此乘加是“免费的”,因此成本高得可以忽略不计)并跟进 CMOV。当然,其中被错误预测的可能性为零。
回答by Dmitry Rubanovich
Even without looking at assembly, the most obvious reason is that /= 2
is probably optimized as >>=1
and many processors have a very quick shift operation. But even if a processor doesn't have a shift operation, the integer division is faster than floating point division.
即使不考虑组装,最明显的原因是它/= 2
可能已优化,因为>>=1
许多处理器具有非常快速的移位操作。但即使处理器没有移位操作,整数除法也比浮点除法快。
Edit:your milage may vary on the "integer division is faster than floating point division" statement above. The comments below reveal that the modern processors have prioritized optimizing fp division over integer division. So if someone were looking for the most likely reason for the speedup which this thread's question asks about, then compiler optimizing /=2
as >>=1
would be the best 1st place to look.
编辑:您的里程可能因上面的“整数除法比浮点除法更快”而异。下面的评论表明现代处理器优先优化 fp 除法而不是整数除法。因此,如果有人在寻找最有可能的原因是此线程的问题问一下,那么编译器优化加速/=2
的>>=1
将是最好的第一名看。
On an unrelated note, if n
is odd, the expression n*3+1
will always be even. So there is no need to check. You can change that branch to
在不相关的注释上,如果n
是奇数,则表达式n*3+1
将始终为偶数。所以没有必要检查。您可以将该分支更改为
{
n = (n*3+1) >> 1;
count += 2;
}
So the whole statement would then be
所以整个语句将是
if (n & 1)
{
n = (n*3 + 1) >> 1;
count += 2;
}
else
{
n >>= 1;
++count;
}
回答by gnasher729
As a generic answer, not specifically directed at this task: In many cases, you can significantly speed up any program by making improvements at a high level. Like calculating data once instead of multiple times, avoiding unnecessary work completely, using caches in the best way, and so on. These things are much easier to do in a high level language.
作为通用答案,并非专门针对此任务:在许多情况下,您可以通过在较高级别进行改进来显着加快任何程序的速度。比如一次而不是多次计算数据,完全避免不必要的工作,以最好的方式使用缓存,等等。这些事情在高级语言中要容易得多。
Writing assembler code, it is possibleto improve on what an optimising compiler does, but it is hard work. And once it's done, your code is much harder to modify, so it is much more difficult to add algorithmic improvements. Sometimes the processor has functionality that you cannot use from a high level language, inline assembly is often useful in these cases and still lets you use a high level language.
编写汇编代码,可以改进优化编译器的功能,但这是一项艰巨的工作。而且一旦完成,您的代码将更难修改,因此添加算法改进要困难得多。有时处理器具有高级语言无法使用的功能,内联汇编在这些情况下通常很有用,并且仍然可以让您使用高级语言。
In the Euler problems, most of the time you succeed by building something, finding why it is slow, building something better, finding why it is slow, and so on and so on. That is very, very hard using assembler. A better algorithm at half the possible speed will usually beat a worse algorithm at full speed, and getting the full speed in assembler isn't trivial.
在欧拉问题中,大多数情况下,您通过构建某些东西、找出它慢的原因、构建更好的东西、找到它慢的原因等等来取得成功。使用汇编程序非常非常困难。以可能速度一半的更好算法通常会以全速击败更差的算法,并且在汇编程序中获得全速并非易事。