内联汇编语言是否比本机 C++ 代码慢?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9601427/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 13:03:00  来源:igfitidea点击:

Is inline assembly language slower than native C++ code?

c++cperformanceassembly

提问by user957121

I tried to compare the performance of inline assembly language and C++ code, so I wrote a function that add two arrays of size 2000 for 100000 times. Here's the code:

我试图比较内联汇编语言和 C++ 代码的性能,所以我写了一个函数,将两个大小为 2000 的数组相加 100000 次。这是代码:

#define TIMES 100000
void calcuC(int *x,int *y,int length)
{
    for(int i = 0; i < TIMES; i++)
    {
        for(int j = 0; j < length; j++)
            x[j] += y[j];
    }
}


void calcuAsm(int *x,int *y,int lengthOfArray)
{
    __asm
    {
        mov edi,TIMES
        start:
        mov esi,0
        mov ecx,lengthOfArray
        label:
        mov edx,x
        push edx
        mov eax,DWORD PTR [edx + esi*4]
        mov edx,y
        mov ebx,DWORD PTR [edx + esi*4]
        add eax,ebx
        pop edx
        mov [edx + esi*4],eax
        inc esi
        loop label
        dec edi
        cmp edi,0
        jnz start
    };
}

Here's main():

这是main()

int main() {
    bool errorOccured = false;
    setbuf(stdout,NULL);
    int *xC,*xAsm,*yC,*yAsm;
    xC = new int[2000];
    xAsm = new int[2000];
    yC = new int[2000];
    yAsm = new int[2000];
    for(int i = 0; i < 2000; i++)
    {
        xC[i] = 0;
        xAsm[i] = 0;
        yC[i] = i;
        yAsm[i] = i;
    }
    time_t start = clock();
    calcuC(xC,yC,2000);

    //    calcuAsm(xAsm,yAsm,2000);
    //    for(int i = 0; i < 2000; i++)
    //    {
    //        if(xC[i] != xAsm[i])
    //        {
    //            cout<<"xC["<<i<<"]="<<xC[i]<<" "<<"xAsm["<<i<<"]="<<xAsm[i]<<endl;
    //            errorOccured = true;
    //            break;
    //        }
    //    }
    //    if(errorOccured)
    //        cout<<"Error occurs!"<<endl;
    //    else
    //        cout<<"Works fine!"<<endl;

    time_t end = clock();

    //    cout<<"time = "<<(float)(end - start) / CLOCKS_PER_SEC<<"\n";

    cout<<"time = "<<end - start<<endl;
    return 0;
}

Then I run the program five times to get the cycles of processor, which could be seen as time. Each time I call one of the function mentioned above only.

然后我运行该程序五次以获取处理器的周期,这可以看作是时间。每次我只调用上面提到的函数之一。

And here comes the result.

结果来了。

Function of assembly version:

汇编版功能:

Debug   Release
---------------
732        668
733        680
659        672
667        675
684        694
Average:   677

Function of C++ version:

C++版本的功能:

Debug     Release
-----------------
1068      168
 999      166
1072      231
1002      166
1114      183
Average:  182

The C++ code in release mode is almost 3.7 times faster than the assembly code. Why?

发布模式下的 C++ 代码几乎比汇编代码快 3.7 倍。为什么?

I guess that the assembly code I wrote is not as effective as those generated by GCC. It's hard for a common programmer like me to wrote code faster than its opponent generated by a compiler.Does that mean I should not trust the performance of assembly language written by my hands, focus on C++ and forget about assembly language?

我猜我写的汇编代码没有GCC生成的那么有效。像我这样的普通程序员很难写出比编译器生成的代码更快的代码。这是否意味着我不应该相信自己亲手编写的汇编语言的性能,专注于C++而忘记汇编语言?

回答by Adriano Repetti

Yes, most times.

是的,大多数时候。

First of all you start from wrong assumption that a low-level language (assembly in this case) will always produce faster code than high-level language (C++ and C in this case). It's not true. Is C code always faster than Java code? No because there is another variable: programmer. The way you write code and knowledge of architecture details greatly influence performance (as you saw in this case).

首先,您错误地假设低级语言(在这种情况下是汇编)总是比高级语言(在这种情况下是 C++ 和 C)生成更快的代码。这不是真的。C 代码总是比 Java 代码快吗?不,因为还有另一个变量:程序员。您编写代码的方式和架构细节的知识极大地影响了性能(正如您在本例中看到的那样)。

You can alwaysproduce an example where handmade assembly code is better than compiled code but usuallyit's a fictional example or a single routine not a trueprogram of 500.000+ lines of C++ code). I think compilers will produce better assembly code 95% times and sometimes, only some rare times,you may need to write assembly code for few, short, highly used, performance criticalroutines or when you have to access features your favorite high-level language does not expose. Do you want a touch of this complexity? Read this awesome answerhere on SO.

总是可以生成一个示例,其中手工汇编代码比编译代码更好,但通常它是一个虚构的示例或单个例程,而不是500.000 多行 C++ 代码的真实程序)。我认为编译器会产生 95% 的更好的汇编代码,有时,只有少数几次,您可能需要为少数、简短、使用率高、性能关键的例程编写汇编代码,或者当您必须访问您最喜欢的高级语言的功能时不暴露。你想感受一下这种复杂性吗?在 SO 上阅读这个很棒的答案

Why this?

为什么这个?

First of all because compilers can do optimizations that we can't even imagine (see this short list) and they will do them in seconds(when we may need days).

首先,因为编译器可以进行我们甚至无法想象的优化(请参阅此简短列表),并且它们会在几秒钟内完成(当我们可能需要数天时)。

When you code in assembly you have to make well-defined functions with a well-defined call interface. However they can take in account whole-program optimizationand inter-procedural optimizationsuch as register allocation, constant propagation, common subexpression elimination, instruction schedulingand other complex, not obvious optimizations (Polytope model, for example). On RISCarchitecture guys stopped worrying about this many years ago (instruction scheduling, for example, is very hard to tune by hand) and modern CISCCPUs have very long pipelinestoo.

当您在汇编中编码时,您必须使用定义良好的调用接口来创建定义良好的函数。但是它们可以考虑整个程序优化过程间优化,例如寄存器分配常量传播公共子表达式消除指令调度和其他复杂的、不明显的优化(例如Polytope 模型)。在RISC架构上,人们多年前就不再担心这个问题(例如,指令调度很难手动调整),而现代CISCCPU 有很长的管道也。

For some complex microcontrollers even systemlibraries are written in C instead of assembly because their compilers produce a better (and easy to maintain) final code.

对于一些复杂的微控制器,甚至系统库都是用 C 语言而不是汇编语言编写的,因为它们的编译器会生成更好(且易于维护)的最终代码。

Compilers sometimes can automatically use some MMX/SIMDx instructionsby themselves, and if you don't use them you simply can't compare (other answers already reviewed your assembly code very well). Just for loops this is a short list of loop optimizationsof what is commonlychecked for by a compiler (do you think you could do it by yourself when your schedule has been decided for a C# program?) If you write something in assembly, I think you have to consider at least some simple optimizations. The school-book example for arrays is to unroll the cycle(its size is known at compile time). Do it and run your test again.

编译器有时可以自己自动使用一些 MMX/SIMDx 指令,如果您不使用它们,您根本无法比较(其他答案已经很好地查看了您的汇编代码)。仅 for 循环这是一个循环优化简短列表,其中列出了编译器通常检查的内容(当您为 C# 程序确定时间表时,您认为您可以自己完成吗?)如果您在汇编中编写一些东西,我觉得你至少要考虑一些简单的优化。数组的教科书示例是展开循环(其大小在编译时已知)。这样做并再次运行您的测试。

These days it's also really uncommon to need to use assembly language for another reason: the plethora of different CPUs. Do you want to support them all? Each has a specific microarchitectureand some specific instruction sets. They have different number of functional units and assembly instructions should be arranged to keep them all busy. If you write in C you may use PGObut in assembly you will then need a great knowledge of that specific architecture (and rethink and redo everything for another architecture). For small tasks the compiler usuallydoes it better, and for complex tasks usuallythe work isn't repaid (and compiler maydo betteranyway).

如今,由于另一个原因需要使用汇编语言也非常罕见:不同的 CPU 过多。你想支持他们吗?每个都有特定的微架构和一些特定的指令集。它们具有不同数量的功能单元,并且应安排组装说明以保持它们都处于忙碌状态。如果您用 C 编写,您可能会使用PGO,但在汇编中,您将需要对该特定体系结构有丰富的了解(并重新思考并为另一种体系结构重做一切)。对于小任务,编译器通常做得更好,而对于复杂的任务,通常不会得到回报(并且无论如何,编译器可能会做得更好)。

If you sit down and you take a look at your code probably you'll see that you'll gain more to redesign your algorithm than to translate to assembly (read this great post here on SO), there are high-level optimizations (and hints to compiler) you can effectively apply before you need to resort to assembly language. It's probably worth to mention that often using intrinsics you will have performance gain your're looking for and compiler will still be able to perform most of its optimizations.

如果您坐下来查看您的代码,您可能会发现重新设计算法比转换为汇编会获得更多收益(阅读SO 上的这篇很棒的文章),有高级优化(和编译器提示),您可以在需要求助于汇编语言之前有效地应用。可能值得一提的是,经常使用内在函数可以获得您正在寻找的性能提升,并且编译器仍然能够执行其大部分优化。

All this said, even when you can produce a 5~10 times faster assembly code, you should ask your customers if they prefer to payone week of your timeor to buy a 50$ faster CPU. Extreme optimization more often than not (and especially in LOB applications) is simply not required from most of us.

综上所述,即使您可以生成快 5 到 10 倍的汇编代码,您也应该询问您的客户是否愿意支付一周的时间购买速度快 50 美元的 CPU。大多数情况下,我们大多数人根本不需要极端优化(尤其是在 LOB 应用程序中)。

回答by Gunther Piez

Your assembly code is suboptimal and may be improved:

您的汇编代码不是最理想的,可能会得到改进:

  • You are pushing and popping a register (EDX) in your inner loop. This should be moved out of the loop.
  • You reload the array pointers in every iteration of the loop. This should moved out of the loop.
  • You use the loopinstruction, which is known to be dead slow on most modern CPUs(possibly a result of using an ancient assembly book*)
  • You take no advantage of manual loop unrolling.
  • You don't use available SIMDinstructions.
  • 您正在内部循环中推送和弹出寄存器 ( EDX)。这应该被移出循环。
  • 您在循环的每次迭代中重新加载数组指针。这应该移出循环。
  • 您使用的loop指令在大多数现代 CPU 上都非常慢(可能是使用古老的汇编书的结果*)
  • 您没有利用手动循环展开。
  • 您不使用可用的SIMD指令。

So unless you vastly improve your skill-set regarding assembler, it doesn't make sense for you to write assembler code for performance.

因此,除非您极大地提高有关汇编程序的技能,否则编写汇编程序代码以提高性能是没有意义的。

*Of course I don't know if you really got the loopinstruction from an ancient assembly book. But you almost never see it in real world code, as every compiler out there is smart enough to not emit loop, you only see it in IMHO bad and outdated books.

*当然我不知道你是否真的loop从一本古老的汇编书中得到了指导。但是你几乎从来没有在现实世界的代码中看到它,因为那里的每个编译器都足够聪明,不会发出loop,你只能在恕我直言的糟糕和过时的书中看到它。

回答by Matthieu M.

Even before delving into assembly, there are code transformations that exist at a higher level.

甚至在深入研究汇编之前,就存在更高级别的代码转换。

static int const TIMES = 100000;

void calcuC(int *x, int *y, int length) {
  for (int i = 0; i < TIMES; i++) {
    for (int j = 0; j < length; j++) {
      x[j] += y[j];
    }
  }
}

can be transformed into via Loop Rotation:

可以通过Loop Rotation转换为:

static int const TIMES = 100000;

void calcuC(int *x, int *y, int length) {
    for (int j = 0; j < length; ++j) {
      for (int i = 0; i < TIMES; ++i) {
        x[j] += y[j];
      }
    }
}

which is much better as far as memory locality goes.

就内存局部性而言,这要好得多。

This could be optimizes further, doing a += bX times is equivalent to doing a += X * bso we get:

这可以进一步优化,做a += bX 次相当于这样做a += X * b,我们得到:

static int const TIMES = 100000;

void calcuC(int *x, int *y, int length) {
    for (int j = 0; j < length; ++j) {
      x[j] += TIMES * y[j];
    }
}

however it seems my favorite optimizer (LLVM) does not perform this transformation.

然而,我最喜欢的优化器(LLVM)似乎没有执行这种转换。

[edit]I found that the transformation is performed if we had the restrictqualifier to xand y. Indeed without this restriction, x[j]and y[j]could alias to the same location which makes this transformation erroneous. [end edit]

[编辑]我发现如果我们有restrict限定符 tox和 ,就会执行转换y。确实没有这个限制,x[j]并且y[j]可以别名到相同的位置,这使得这种转换是错误的。[结束编辑]

Anyway, thisis, I think, the optimized C version. Already it is much simpler. Based on this, here is my crack at ASM (I let Clang generate it, I am useless at it):

无论如何,我认为是优化的 C 版本。它已经简单多了。基于此,这是我对 ASM 的破解(我让 Clang 生成它,我对此毫无用处):

calcuAsm:                               # @calcuAsm
.Ltmp0:
    .cfi_startproc
# BB#0:
    testl   %edx, %edx
    jle .LBB0_2
    .align  16, 0x90
.LBB0_1:                                # %.lr.ph
                                        # =>This Inner Loop Header: Depth=1
    imull   0000, (%rsi), %eax   # imm = 0x186A0
    addl    %eax, (%rdi)
    addq    , %rsi
    addq    , %rdi
    decl    %edx
    jne .LBB0_1
.LBB0_2:                                # %._crit_edge
    ret
.Ltmp1:
    .size   calcuAsm, .Ltmp1-calcuAsm
.Ltmp2:
    .cfi_endproc

I am afraid I don't understand where all those instructions come from, however you can always have fun and try and see how it compares... but I'd still use the optimized C version rather than the assembly one, in code, much more portable.

恐怕我不明白所有这些指令是从哪里来的,但是你总是可以玩得开心并尝试看看它是如何比较的......但我仍然会在代码中使用优化的 C 版本而不是汇编版本,更便携。

回答by Oliver Charlesworth

Short answer:yes.

简短的回答:是的。

Long answer:yes, unless you really know what you're doing, and have a reason to do so.

长答案:是的,除非您真的知道自己在做什么,并且有理由这样做。

回答by sasha

I have fixed my asm code:

我已经修复了我的 asm 代码:

  __asm
{   
    mov ebx,TIMES
 start:
    mov ecx,lengthOfArray
    mov esi,x
    shr ecx,1
    mov edi,y
label:
    movq mm0,QWORD PTR[esi]
    paddd mm0,QWORD PTR[edi]
    add edi,8
    movq QWORD PTR[esi],mm0
    add esi,8
    dec ecx 
    jnz label
    dec ebx
    jnz start
};

Results for Release version:

发布版本的结果:

 Function of assembly version: 81
 Function of C++ version: 161

The assembly code in release mode is almost 2 times faster than the C++.

发布模式下的汇编代码几乎比 C++ 快 2 倍。

回答by jalf

Does that mean I should not trust the performance of assembly language written by my hands

这是否意味着我不应该相信我亲手编写的汇编语言的性能

Yes, that is exactly what it means, and it is true for everylanguage. If you don't know how to write efficient code in language X, then you should not trust your ability to write efficient code in X. And so, if you want efficient code, you should use another language.

是的,这正是它的意思,而且对于每种语言都是如此。如果你不知道如何用 X 语言编写高效的代码,那么你不应该相信你用 X 编写高效代码的能力。因此,如果你想要高效的代码,你应该使用另一种语言。

Assembly is particularly sensitive to this, because, well, what you see is what you get. You write the specific instructions that you want the CPU to execute. With high level languages, there is a compiler in betweeen, which can transform your code and remove many inefficiencies. With assembly, you're on your own.

Assembly 对此特别敏感,因为,所见即所得。您编写希望 CPU 执行的特定指令。对于高级语言,中间有一个编译器,它可以转换您的代码并消除许多低效率。有了组装,你就靠自己了。

回答by fortran

The only reason to use assembly language nowadays is to use some features not accessible by the language.

现在使用汇编语言的唯一原因是使用该语言无法访问的一些功能。

This applies to:

这适用于:

  • Kernel programming that needs to access to certain hardware features such as the MMU
  • High performance programming that uses very specific vector or multimedia instructions not supported by your compiler.
  • 需要访问某些硬件功能(例如 MMU)的内核编程
  • 使用编译器不支持的非常具体的向量或多媒体指令的高性能编程。

But current compilers are quite smart, they can even replace two separate statements like d = a / b; r = a % b;with a single instruction that calculates the division and remainder in one go if it is available, even if C does not have such operator.

但是当前的编译器非常聪明,它们甚至可以替换两个单独的语句,就像 d = a / b; r = a % b;用一条指令一次性计算除法和余数一样,如果可用的话,即使 C 没有这样的运算符。

回答by fortran

It is true that a modern compiler does an amazing job at code optimization, yet I would still encourage you to keep on learning assembly.

诚然,现代编译器在代码优化方面做得非常出色,但我仍然鼓励您继续学习汇编。

First of all you are clearly not intimidated by it, that's a great, great plus, next - you're on the right track by profiling in order to validate or discard your speed assumptions, you are asking for input from experienced people, and you have the greatest optimizing tool known to mankind: a brain.

首先,您显然不会被它吓倒,这是一个很棒的加分项,其次-您通过分析以验证或放弃您的速度假设走在正确的轨道上,您正在寻求有经验的人的意见,并且您拥有人类已知的最伟大的优化工具: 大脑

As your experience increases, you'll learn when and where to use it (usually the tightest, innermost loops in your code, after you have deeply optimized at an algorithmic level).

随着经验的增加,您将了解何时何地使用它(通常是代码中最紧密、最内层的循环,在您在算法级别进行深度优化之后)。

For inspiration I would recommend you lookup Michael Abrash's articles (if you haven't heard from him, he is an optimization guru; he even collaborated with John Carmack in the optimization of the Quake software renderer!)

为了获得灵感,我建议您查找Michael Abrash的文章(如果您还没有听过他的消息,他是一位优化大师;他甚至与 John Carmack 合作优化了 Quake 软件渲染器!)

"there ain't no such thing as the fastest code" - Michael Abrash

“没有最快的代码”——迈克尔·亚伯拉什

回答by sasha

I have changed asm code:

我改变了汇编代码:

 __asm
{ 
    mov ebx,TIMES
 start:
    mov ecx,lengthOfArray
    mov esi,x
    shr ecx,2
    mov edi,y
label:
    mov eax,DWORD PTR [esi]
    add eax,DWORD PTR [edi]
    add edi,4   
    dec ecx 
    mov DWORD PTR [esi],eax
    add esi,4
    test ecx,ecx
    jnz label
    dec ebx
    test ebx,ebx
    jnz start
};

Results for Release version:

发布版本的结果:

 Function of assembly version: 41
 Function of C++ version: 161

The assembly code in release mode is almost 4 times faster than the C++. IMHo, the speed of assembly code depends from Programmer

发布模式下的汇编代码几乎比 C++ 快 4 倍。恕我直言,汇编代码的速度取决于程序员

回答by salaoshi

it is very interesting topic!
I have changed the MMX by SSE in Sasha's code
Here is my results:

这是一个非常有趣的话题!
我在 Sasha 的代码中通过 SSE 更改了 MMX
这是我的结果:

Function of C++ version:      315
Function of assembly(simply): 312
Function of assembly  (MMX):  136
Function of assembly  (SSE):  62

The assembly code with SSE is 5 times faster than the C++

带有 SSE 的汇编代码比 C++ 快 5 倍