C++ 现代硬件上的浮点与整数计算

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2550281/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 23:53:09  来源:igfitidea点击:

Floating point vs integer calculations on modern hardware

c++x86floating-pointx86-64

提问by maxpenguin

I am doing some performance critical work in C++, and we are currently using integer calculations for problems that are inherently floating point because "its faster". This causes a whole lot of annoying problems and adds a lot of annoying code.

我正在用 C++ 做一些性能关键的工作,我们目前正在使用整数计算来解决固有的浮点问题,因为“它更快”。这会导致很多烦人的问题并添加很多烦人的代码。

Now, I remember reading about how floating point calculations were so slow approximately circa the 386 days, where I believe (IIRC) that there was an optional co-proccessor. But surely nowadays with exponentially more complex and powerful CPUs it makes no difference in "speed" if doing floating point or integer calculation? Especially since the actual calculation time is tiny compared to something like causing a pipeline stall or fetching something from main memory?

现在,我记得读到大约 386 天左右浮点计算是如此缓慢,我相信(IIRC)有一个可选的协处理器。但是现在可以肯定的是,随着 CPU 的复杂性和功能呈指数级增长,如果进行浮点计算或整数计算,它的“速度”没有区别吗?特别是因为与导致管道停顿或从主内存中获取某些内容相比,实际计算时间很小?

I know the correct answer is to benchmark on the target hardware, what would be a good way to test this? I wrote two tiny C++ programs and compared their run time with "time" on Linux, but the actual run time is too variable (doesn't help I am running on a virtual server). Short of spending my entire day running hundreds of benchmarks, making graphs etc. is there something I can do to get a reasonable test of the relative speed? Any ideas or thoughts? Am I completely wrong?

我知道正确的答案是在目标硬件上进行基准测试,测试这个的好方法是什么?我编写了两个小型 C++ 程序,并将它们的运行时间与 Linux 上的“时间”进行了比较,但实际运行时间变化太大(对我在虚拟服务器上运行没有帮助)。除了花一整天的时间运行数百个基准测试、制作图表等,我还能做些什么来对相对速度进行合理的测试?有什么想法或想法吗?我完全错了吗?

The programs I used as follows, they are not identical by any means:

我使用的程序如下,它们无论如何都不相同:

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>

int main( int argc, char** argv )
{
    int accum = 0;

    srand( time( NULL ) );

    for( unsigned int i = 0; i < 100000000; ++i )
    {
        accum += rand( ) % 365;
    }
    std::cout << accum << std::endl;

    return 0;
}

Program 2:

方案二:

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>

int main( int argc, char** argv )
{

    float accum = 0;
    srand( time( NULL ) );

    for( unsigned int i = 0; i < 100000000; ++i )
    {
        accum += (float)( rand( ) % 365 );
    }
    std::cout << accum << std::endl;

    return 0;
}

Thanks in advance!

提前致谢!

Edit: The platform I care about is regular x86 or x86-64 running on desktop Linux and Windows machines.

编辑:我关心的平台是在桌面 Linux 和 Windows 机器上运行的常规 x86 或 x86-64。

Edit 2(pasted from a comment below): We have an extensive code base currently. Really I have come up against the generalization that we "must not use float since integer calculation is faster" - and I am looking for a way (if this is even true) to disprove this generalized assumption. I realize that it would be impossible to predict the exact outcome for us short of doing all the work and profiling it afterwards.

编辑 2(从下面的评论粘贴):我们目前有一个广泛的代码库。真的,我遇到了这样的概括,即我们“不能使用浮点数,因为整数计算速度更快”——我正在寻找一种方法(如果这是真的)来反驳这个广义假设。我意识到,如果不完成所有工作并在事后对其进行分析,就不可能预测我们的确切结果。

Anyway, thanks for all your excellent answers and help. Feel free to add anything else :).

无论如何,感谢您所有出色的回答和帮助。随意添加任何其他内容:)。

采纳答案by Dan

Alas, I can only give you an "it depends" answer...

唉,我只能给你一个“视情况而定”的答案......

From my experience, there are many, many variables to performance...especially between integer & floating point math. It varies strongly from processor to processor (even within the same family such as x86) because different processors have different "pipeline" lengths. Also, some operations are generally very simple (such as addition) and have an accelerated route through the processor, and others (such as division) take much, much longer.

根据我的经验,性能有很多很多变量……尤其是整数和浮点数学之间。由于不同的处理器具有不同的“管道”长度,因此它因处理器而异(即使在同一个系列中,例如 x86)。此外,某些操作通常非常简单(例如加法)并且通过处理器具有加速的路线,而其他操作(例如除法)需要更长的时间。

The other big variable is where the data reside. If you only have a few values to add, then all of the data can reside in cache, where they can be quickly sent to the CPU. A very, very slow floating point operation that already has the data in cache will be many times faster than an integer operation where an integer needs to be copied from system memory.

另一个大变量是数据所在的位置。如果您只有几个值要添加,那么所有数据都可以驻留在缓存中,在那里它们可以快速发送到 CPU。一个非常非常慢的浮点运算(已经在缓存中存储了数据)将比整数运算快很多倍,整数运算需要从系统内存中复制整数。

I assume that you are asking this question because you are working on a performance critical application. If you are developing for the x86 architecture, and you need extra performance, you might want to look into using the SSE extensions. This can greatly speed up single-precision floating point arithmetic, as the same operation can be performed on multiple data at once, plus there is a separate* bank of registers for the SSE operations. (I noticed in your second example you used "float" instead of "double", making me think you are using single-precision math).

我假设您问这个问题是因为您正在开发一个性能关键的应用程序。如果您正在为 x86 架构进行开发,并且需要额外的性能,您可能需要考虑使用 SSE 扩展。这可以大大加快单精度浮点运算的速度,因为可以一次对多个数据执行相同的操作,而且有一个单独的* 寄存器组用于 SSE 操作。(我注意到在你的第二个例子中你使用了“float”而不是“double”,让我觉得你在使用单精度数学)。

*Note: Using the old MMX instructions would actually slow down programs, because those old instructions actually used the same registers as the FPU does, making it impossible to use both the FPU and MMX at the same time.

*注意:使用旧的 MMX 指令实际上会减慢程序的速度,因为这些旧指令实际上使用与 FPU 相同的寄存器,从而无法同时使用 FPU 和 MMX。

回答by vladr

For example (lesser numbers are faster),

例如(数字越小速度越快),

64-bit Intel Xeon X5550 @ 2.67GHz, gcc 4.1.2 -O3

64 位 Intel Xeon X5550 @ 2.67GHz,gcc 4.1.2 -O3

short add/sub: 1.005460 [0]
short mul/div: 3.926543 [0]
long add/sub: 0.000000 [0]
long mul/div: 7.378581 [0]
long long add/sub: 0.000000 [0]
long long mul/div: 7.378593 [0]
float add/sub: 0.993583 [0]
float mul/div: 1.821565 [0]
double add/sub: 0.993884 [0]
double mul/div: 1.988664 [0]

32-bit Dual Core AMD Opteron(tm) Processor 265 @ 1.81GHz, gcc 3.4.6 -O3

32 位双核 AMD Opteron(tm) 处理器 265 @ 1.81GHz,gcc 3.4.6 -O3

short add/sub: 0.553863 [0]
short mul/div: 12.509163 [0]
long add/sub: 0.556912 [0]
long mul/div: 12.748019 [0]
long long add/sub: 5.298999 [0]
long long mul/div: 20.461186 [0]
float add/sub: 2.688253 [0]
float mul/div: 4.683886 [0]
double add/sub: 2.700834 [0]
double mul/div: 4.646755 [0]

As Dan pointed out, even once you normalize for clock frequency (which can be misleading in itself in pipelined designs), results will vary wildly based on CPU architecture(individual ALU/FPUperformance, as well asactual number of ALUs/FPUsavailable per core in superscalardesigns which influences how many independent operations can execute in parallel-- the latter factor is not exercised by the code below as all operations below are sequentially dependent.)

正如Dan 指出的那样,即使您对时钟频率进行标准化(这在流水线设计中本身可能会产生误导),结果也会因 CPU 架构(单个ALU/ FPU性能以及每个可用的 ALU/FPU 的实际数量而有很大差异。超标量设计中的核心,它影响可以并行执行的独立操作的数量——后面的因素不会被下面的代码执行,因为下面的所有操作都是顺序相关的。)

Poor man's FPU/ALU operation benchmark:

穷人的FPU/ALU操作基准:

#include <stdio.h>
#ifdef _WIN32
#include <sys/timeb.h>
#else
#include <sys/time.h>
#endif
#include <time.h>
#include <cstdlib>

double
mygettime(void) {
# ifdef _WIN32
  struct _timeb tb;
  _ftime(&tb);
  return (double)tb.time + (0.001 * (double)tb.millitm);
# else
  struct timeval tv;
  if(gettimeofday(&tv, 0) < 0) {
    perror("oops");
  }
  return (double)tv.tv_sec + (0.000001 * (double)tv.tv_usec);
# endif
}

template< typename Type >
void my_test(const char* name) {
  Type v  = 0;
  // Do not use constants or repeating values
  //  to avoid loop unroll optimizations.
  // All values >0 to avoid division by 0
  // Perform ten ops/iteration to reduce
  //  impact of ++i below on measurements
  Type v0 = (Type)(rand() % 256)/16 + 1;
  Type v1 = (Type)(rand() % 256)/16 + 1;
  Type v2 = (Type)(rand() % 256)/16 + 1;
  Type v3 = (Type)(rand() % 256)/16 + 1;
  Type v4 = (Type)(rand() % 256)/16 + 1;
  Type v5 = (Type)(rand() % 256)/16 + 1;
  Type v6 = (Type)(rand() % 256)/16 + 1;
  Type v7 = (Type)(rand() % 256)/16 + 1;
  Type v8 = (Type)(rand() % 256)/16 + 1;
  Type v9 = (Type)(rand() % 256)/16 + 1;

  double t1 = mygettime();
  for (size_t i = 0; i < 100000000; ++i) {
    v += v0;
    v -= v1;
    v += v2;
    v -= v3;
    v += v4;
    v -= v5;
    v += v6;
    v -= v7;
    v += v8;
    v -= v9;
  }
  // Pretend we make use of v so compiler doesn't optimize out
  //  the loop completely
  printf("%s add/sub: %f [%d]\n", name, mygettime() - t1, (int)v&1);
  t1 = mygettime();
  for (size_t i = 0; i < 100000000; ++i) {
    v /= v0;
    v *= v1;
    v /= v2;
    v *= v3;
    v /= v4;
    v *= v5;
    v /= v6;
    v *= v7;
    v /= v8;
    v *= v9;
  }
  // Pretend we make use of v so compiler doesn't optimize out
  //  the loop completely
  printf("%s mul/div: %f [%d]\n", name, mygettime() - t1, (int)v&1);
}

int main() {
  my_test< short >("short");
  my_test< long >("long");
  my_test< long long >("long long");
  my_test< float >("float");
  my_test< double >("double");

  return 0;
}

回答by Ben Voigt

There is likely to be a significant difference in real-world speed between fixed-point and floating-point math, but the theoretical best-case throughput of the ALU vs FPU is completely irrelevant. Instead, the number of integer and floating-point registers (real registers, not register names) on your architecture which are not otherwise used by your computation (e.g. for loop control), the number of elements of each type which fit in a cache line, optimizations possible considering the different semantics for integer vs. floating point math -- these effects will dominate. The data dependencies of your algorithm play a significant role here, so that no general comparison will predict the performance gap on your problem.

定点和浮点数学之间的实际速度可能存在显着差异,但 ALU 与 FPU 的理论最佳情况吞吐量完全无关。相反,您的架构上的整数和浮点寄存器(真实寄存器,而不是寄存器名称)的数量,它们没有被您的计算(例如循环控制)以其他方式使用,适合缓存行的每种类型的元素数量,考虑整数与浮点数学的不同语义可能进行优化——这些影响将占主导地位。算法的数据依赖性在这里起着重要作用,因此没有一般比较可以预测问题的性能差距。

For example, integer addition is commutative, so if the compiler sees a loop like you used for a benchmark (assuming the random data was prepared in advance so it wouldn't obscure the results), it can unroll the loop and calculate partial sums with no dependencies, then add them when the loop terminates. But with floating point, the compiler has to do the operations in the same order you requested (you've got sequence points in there so the compiler has to guarantee the same result, which disallows reordering) so there's a strong dependency of each addition on the result of the previous one.

例如,整数加法是可交换的,所以如果编译器看到一个像你用于基准测试的循环(假设随机数据是提前准备好的,所以它不会掩盖结果),它可以展开循环并计算部分总和没有依赖关系,然后在循环终止时添加它们。但是对于浮点,编译器必须按照您请求的相同顺序执行操作(您在那里有序列点,因此编译器必须保证相同的结果,这不允许重新排序)因此每个添加都强烈依赖于上一个的结果。

You're likely to fit more integer operands in cache at a time as well. So the fixed-point version might outperform the float version by an order of magnitude even on a machine where the FPU has theoretically higher throughput.

您也可能一次在缓存中放入更多整数操作数。因此,即使在 FPU 理论上具有更高吞吐量的机器上,定点版本的性能也可能比浮点版本高一个数量级。

回答by Potatoswatter

Addition is much faster than rand, so your program is (especially) useless.

加法比 快得多rand,所以你的程序(特别是)没用。

You need to identify performance hotspots and incrementally modify your program. It sounds like you have problems with your development environment that will need to be solved first. Is it impossible to run your program on your PC for a small problem set?

您需要识别性能热点并逐步修改您的程序。听起来您的开发环境有问题需要首先解决。是否不可能在您的 PC 上运行您的程序来解决一个小问题?

Generally, attempting FP jobs with integer arithmetic is a recipe for slow.

通常,尝试使用整数算术进行 FP 作业会导致速度变慢。

回答by MrMesees

TIL This varies (a lot). Here are some results using gnu compiler (btw I also checked by compiling on machines, gnu g++ 5.4 from xenial is a hell of a lot faster than 4.6.3 from linaro on precise)

TIL 这变化(很多)。这是使用 gnu 编译器的一些结果(顺便说一句,我还通过在机器上编译进行了检查,xenial 的 gnu g++ 5.4 在精确上比 linaro 的 4.6.3 快得多)

Intel i7 4700MQ xenial

英特尔 i7 4700MQ xenial

short add: 0.822491
short sub: 0.832757
short mul: 1.007533
short div: 3.459642
long add: 0.824088
long sub: 0.867495
long mul: 1.017164
long div: 5.662498
long long add: 0.873705
long long sub: 0.873177
long long mul: 1.019648
long long div: 5.657374
float add: 1.137084
float sub: 1.140690
float mul: 1.410767
float div: 2.093982
double add: 1.139156
double sub: 1.146221
double mul: 1.405541
double div: 2.093173

Intel i3 2370M has similar results

Intel i3 2370M 也有类似的结果

short add: 1.369983
short sub: 1.235122
short mul: 1.345993
short div: 4.198790
long add: 1.224552
long sub: 1.223314
long mul: 1.346309
long div: 7.275912
long long add: 1.235526
long long sub: 1.223865
long long mul: 1.346409
long long div: 7.271491
float add: 1.507352
float sub: 1.506573
float mul: 2.006751
float div: 2.762262
double add: 1.507561
double sub: 1.506817
double mul: 1.843164
double div: 2.877484

Intel(R) Celeron(R) 2955U (Acer C720 Chromebook running xenial)

Intel(R) Celeron(R) 2955U(运行 xenial 的 Acer C720 Chromebook)

short add: 1.999639
short sub: 1.919501
short mul: 2.292759
short div: 7.801453
long add: 1.987842
long sub: 1.933746
long mul: 2.292715
long div: 12.797286
long long add: 1.920429
long long sub: 1.987339
long long mul: 2.292952
long long div: 12.795385
float add: 2.580141
float sub: 2.579344
float mul: 3.152459
float div: 4.716983
double add: 2.579279
double sub: 2.579290
double mul: 3.152649
double div: 4.691226

DigitalOcean 1GB Droplet Intel(R) Xeon(R) CPU E5-2630L v2 (running trusty)

DigitalOcean 1GB Droplet Intel(R) Xeon(R) CPU E5-2630L v2(运行可靠)

short add: 1.094323
short sub: 1.095886
short mul: 1.356369
short div: 4.256722
long add: 1.111328
long sub: 1.079420
long mul: 1.356105
long div: 7.422517
long long add: 1.057854
long long sub: 1.099414
long long mul: 1.368913
long long div: 7.424180
float add: 1.516550
float sub: 1.544005
float mul: 1.879592
float div: 2.798318
double add: 1.534624
double sub: 1.533405
double mul: 1.866442
double div: 2.777649

AMD Opteron(tm) Processor 4122 (precise)

AMD Opteron(tm) 处理器 4122(精确)

short add: 3.396932
short sub: 3.530665
short mul: 3.524118
short div: 15.226630
long add: 3.522978
long sub: 3.439746
long mul: 5.051004
long div: 15.125845
long long add: 4.008773
long long sub: 4.138124
long long mul: 5.090263
long long div: 14.769520
float add: 6.357209
float sub: 6.393084
float mul: 6.303037
float div: 17.541792
double add: 6.415921
double sub: 6.342832
double mul: 6.321899
double div: 15.362536

This uses code from http://pastebin.com/Kx8WGUfgas benchmark-pc.c

这使用来自http://pastebin.com/Kx8WGUfg 的代码作为benchmark-pc.c

g++ -fpermissive -O3 -o benchmark-pc benchmark-pc.c

I've run multiple passes, but this seems to be the case that general numbers are the same.

我已经运行了多次通过,但这似乎是一般数字相同的情况。

One notable exception seems to be ALU mul vs FPU mul. Addition and subtraction seem trivially different.

一个值得注意的例外似乎是 ALU mul 与 FPU mul。加法和减法似乎微不足道。

Here is the above in chart form (click for full size, lower is faster and preferable):

这是上面的图表形式(点击查看全尺寸,越低越好,越快越好):

Chart of above data

以上数据图表

Update to accomodate @Peter Cordes

更新以容纳@Peter Cordes

https://gist.github.com/Lewiscowles1986/90191c59c9aedf3d08bf0b129065cccc

https://gist.github.com/Lewiscowles1986/90191c59c9aedf3d08bf0b129065cccc

i7 4700MQ Linux Ubuntu Xenial 64 位(应用了 2018-03-13 的所有补丁)
    short add: 0.773049
    short sub: 0.789793
    short mul: 0.960152
    short div: 3.273668
      int add: 0.837695
      int sub: 0.804066
      int mul: 0.960840
      int div: 3.281113
     long add: 0.829946
     long sub: 0.829168
     long mul: 0.960717
     long div: 5.363420
long long add: 0.828654
long long sub: 0.805897
long long mul: 0.964164
long long div: 5.359342
    float add: 1.081649
    float sub: 1.080351
    float mul: 1.323401
    float div: 1.984582
   double add: 1.081079
   double sub: 1.082572
   double mul: 1.323857
   double div: 1.968488
AMD Opteron(tm) 处理器 4122(精确,DreamHost 共享托管)
    short add: 1.235603
    short sub: 1.235017
    short mul: 1.280661
    short div: 5.535520
      int add: 1.233110
      int sub: 1.232561
      int mul: 1.280593
      int div: 5.350998
     long add: 1.281022
     long sub: 1.251045
     long mul: 1.834241
     long div: 5.350325
long long add: 1.279738
long long sub: 1.249189
long long mul: 1.841852
long long div: 5.351960
    float add: 2.307852
    float sub: 2.305122
    float mul: 2.298346
    float div: 4.833562
   double add: 2.305454
   double sub: 2.307195
   double mul: 2.302797
   double div: 5.485736
英特尔至强 E5-2630L v2 @ 2.4GHz(可信赖的 64 位,DigitalOcean VPS)
    short add: 1.040745
    short sub: 0.998255
    short mul: 1.240751
    short div: 3.900671
      int add: 1.054430
      int sub: 1.000328
      int mul: 1.250496
      int div: 3.904415
     long add: 0.995786
     long sub: 1.021743
     long mul: 1.335557
     long div: 7.693886
long long add: 1.139643
long long sub: 1.103039
long long mul: 1.409939
long long div: 7.652080
    float add: 1.572640
    float sub: 1.532714
    float mul: 1.864489
    float div: 2.825330
   double add: 1.535827
   double sub: 1.535055
   double mul: 1.881584
   double div: 2.777245

回答by jcoder

Two points to consider -

需要考虑的两点——

Modern hardware can overlap instructions, execute them in parallel and reorder them to make best use of the hardware. And also, any significant floating point program is likely to have significant integer work too even if it's only calculating indices into arrays, loop counter etc. so even if you have a slow floating point instruction it may well be running on a separate bit of hardware overlapped with some of the integer work. My point being that even if the floating point instructions are slow that integer ones, your overall program may run faster because it can make use of more of the hardware.

现代硬件可以重叠指令,并行执行它们并重新排序它们以充分利用硬件。而且,任何重要的浮点程序也可能有重要的整数工作,即使它只是计算数组的索引、循环计数器等。所以即使你有一个缓慢的浮点指令,它也很可能在一个单独的硬件位上运行与一些整数工作重叠。我的观点是,即使浮点指令比整数指令慢,您的整个程序也可能运行得更快,因为它可以使用更多的硬件。

As always, the only way to be sure is to profile your actual program.

与往常一样,唯一可以确定的方法是对您的实际程序进行概要分析。

Second point is that most CPUs these days have SIMD instructions for floating point that can operate on multiple floating point values all at the same time. For example you can load 4 floats into a single SSE register and the perform 4 multiplications on them all in parallel. If you can rewrite parts of your code to use SSE instructions then it seems likely it will be faster than an integer version. Visual c++ provides compiler intrinsic functions to do this, see http://msdn.microsoft.com/en-us/library/x5c07e2a(v=VS.80).aspxfor some information.

第二点是,现在大多数 CPU 都有用于浮点的 SIMD 指令,可以同时对多个浮点值进行运算。例如,您可以将 4 个浮点数加载到单个 SSE 寄存器中,并对它们并行执行 4 次乘法。如果您可以重写部分代码以使用 SSE 指令,那么它似乎比整数版本更快。Visual c++ 提供了编译器内部函数来执行此操作,有关某些信息,请参阅http://msdn.microsoft.com/en-us/library/x5c07e2a(v=VS.80).aspx

回答by Goran D

The floating point version will be much slower, if there is no remainder operation. Since all the adds are sequential, the cpu will not be able to parallelise the summation. The latency will be critical. FPU add latency is typically 3 cycles, while integer add is 1 cycle. However, the divider for the remainder operator will probably the critical part, as it is not fully pipelined on modern cpu's. so, assuming the divide/remainder instruction will consume the bulk of the time, the difference due to add latency will be small.

如果没有余数运算,浮点版本会慢得多。由于所有的加法都是顺序的,CPU 将无法并行求和。延迟将是至关重要的。FPU 添加延迟通常为 3 个周期,而整数添加为 1 个周期。然而,余数运算符的除法器可能是关键部分,因为它在现代 CPU 上没有完全流水线化。因此,假设除法/余数指令将消耗大部分时间,由于添加延迟导致的差异将很小。

回答by gnasher729

Today, integer operations are usually a little bit faster than floating point operations. So if you can do a calculation with the same operations in integer and floating point, use integer. HOWEVER you are saying "This causes a whole lot of annoying problems and adds a lot of annoying code". That sounds like you need more operations because you use integer arithmetic instead of floating point. In that case, floating point will run faster because

今天,整数运算通常比浮点运算快一点。因此,如果您可以在整数和浮点数中使用相同的运算进行计算,请使用整数。但是,您是在说“这会导致很多烦人的问题并添加很多烦人的代码”。听起来您需要更多操作,因为您使用整数算法而不是浮点数。在这种情况下,浮点运算会运行得更快,因为

  • as soon as you need more integer operations, you probably need a lot more, so the slight speed advantage is more than eaten up by the additional operations

  • the floating-point code is simpler, which means it is faster to write the code, which means that if it is speed critical, you can spend more time optimising the code.

  • 一旦你需要更多的整数运算,你可能需要更多,所以微小的速度优势被额外的运算所吞噬

  • 浮点代码更简单,这意味着编写代码的速度更快,这意味着如果它对速度至关重要,您可以花更多时间优化代码。

回答by Artem Sokolov

Unless you're writing code that will be called millions of times per second (such as, e.g., drawing a line to the screen in a graphics application), integer vs. floating-point arithmetic is rarely the bottleneck.

除非您正在编写每秒调用数百万次的代码(例如,在图形应用程序中在屏幕上画一条线),否则整数与浮点运算很少成为瓶颈。

The usual first step to the efficiency questions is to profile your code to see where the run-time is really spent. The linux command for this is gprof.

解决效率问题通常的第一步是分析您的代码以查看运行时间真正花在何处。用于此的 linux 命令是gprof.

Edit:

编辑:

Though I suppose you can always implement the line drawing algorithm using integers and floating-point numbers, call it a large number of times and see if it makes a difference:

虽然我认为你总是可以使用整数和浮点数来实现画线算法,但请多次调用它,看看它是否有所作为:

http://en.wikipedia.org/wiki/Bresenham's_algorithm

http://en.wikipedia.org/wiki/Bresenham's_algorithm

回答by dan04

I ran a test that just added 1 to the number instead of rand(). Results (on an x86-64) were:

我运行了一个测试,只是在数字上加了 1 而不是 rand()。结果(在 x86-64 上)是:

  • short: 4.260s
  • int: 4.020s
  • long long: 3.350s
  • float: 7.330s
  • double: 7.210s
  • 短:4.260s
  • 整数:4.020s
  • 长长:3.350s
  • 浮动:7.330s
  • 双倍:7.210s