C++ 使用 double 比 float 快吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3426165/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 12:48:58  来源:igfitidea点击:

Is using double faster than float?

c++performancex86intelosx-snow-leopard

提问by Brent Faust

Double values store higher precision and are double the size of a float, but are Intel CPUs optimized for floats?

双精度值存储更高的精度并且是浮点数的两倍,但英特尔 CPU 是否针对浮点数进行了优化?

That is, are double operations just as fast or faster than float operations for +, -, *, and /?

也就是说,对于 +、-、* 和 / 而言,double 运算是否与浮点运算一样快或更快?

Does the answer change for 64-bit architectures?

64 位架构的答案会改变吗?

回答by Alex Martelli

There isn't a single "intel CPU", especially in terms of what operations are optimized with respect to others!, but most of them, at CPU level (specifically within the FPU), are such that the answer to your question:

没有一个“英特尔 CPU”,尤其是在相对于其他人优化了哪些操作方面!但其中大多数在 CPU 级别(特别是在 FPU 内)都是这样的,可以回答您的问题:

are double operations just as fast or faster than float operations for +, -, *, and /?

对于 +、-、* 和 /,double 运算是否与浮点运算一样快或更快?

is "yes" -- within the CPU, except for division and sqrt which are somewhat slower for doublethan for float. (Assuming your compiler uses SSE2 for scalar FP math, like all x86-64 compilers do, and some 32-bit compilers depending on options. Legacy x87 doesn't have different widths in registers, only in memory (it converts on load/store), so historically even sqrt and division were just as slow for double).

是“是”——在 CPU 中,除了比 for慢一些的doublefloat除法和 sqrt 。(假设您的编译器使用 SSE2 进行标量 FP 数学运算,就像所有 x86-64 编译器一样,以及一些 32 位编译器取决于选项。传统 x87 在寄存器中没有不同的宽度,仅在内存中(它在加载/存储时转换) ),所以从历史上看,即使是 sqrt 和 Division 也一样慢double)。

For example, Haswell has a divsdthroughput of one per 8 to 14 cycles (data-dependent), but a divss(scalar single) throughput of one per 7 cycles. x87 fdivis 8 to 18 cycle throughput. (Numbers from https://agner.org/optimize/. Latency correlates with throughput for division, but is higher than the throughput numbers.)

例如,Haswell 的divsd吞吐量为每 8 到 14 个周期一个divss(取决于数据),但(标量单个)吞吐量为每 7 个周期一个。x87fdiv是 8 到 18 个周期的吞吐量。(来自https://agner.org/optimize/ 的数字。延迟与除法的吞吐量相关,但高于吞吐量数字。)

The floatversions of many library functions like logf(float)and sinf(float)will also be fasterthan log(double)and sin(double), because they have many fewer bits of precision to get right. They can use polynomial approximations with fewer terms to get full precision for floatvs. double

float许多库函数的版本,比如logf(float)sinf(float)将比log(double)sin(double),因为它们的精度要少得多。他们可以使用具有较少项的多项式近似来获得floatvs 的完全精度。double



However, taking up twice the memory for each number clearly implies heavier load on the cache(s) and more memory bandwidthto fill and spill those cache lines from/to RAM; the time you care about performance of a floating-point operation is when you're doing a lotof such operations, so the memory and cache considerations are crucial.

然而,每个数字占用两倍的内存显然意味着缓存上的负载更重,并且需要更多的内存带宽来填充这些缓存行并将这些缓存行从/向 RAM 溢出;您关心浮点运算性能的时候是在执行大量此类操作时,因此内存和缓存注意事项至关重要。

@Richard's answer points out that there are also other ways to perform FP operations (the SSE/ SSE2 instructions; good old MMX was integers-only), especially suitable for simple ops on lot of data ("SIMD", single instruction / multiple data) where each vector register can pack 4 single-precision floats or only 2 double-precision ones, so this effect will be even more marked.

@Richard 的回答指出还有其他方法可以执行 FP 操作(SSE/ SSE2 指令;好老的 MMX 仅是整数),特别适用于大量数据的简单操作(“SIMD”,单指令/多数据) 其中每个向量寄存器可以打包 4 个单精度浮点数或只能打包 2 个双精度浮点数,因此这种效果会更加明显。

In the end, you do have to benchmark, but my prediction is that for reasonable (i.e., large;-) benchmarks, you'll find advantage to sticking with single precision (assuming of course that you don't needthe extra bits of precision!-).

最后,您确实必须进行基准测试,但我的预测是,对于合理的(即大型;-)基准测试,您会发现坚持使用单精度的优势(当然,假设您不需要额外的精确!-)。

回答by Daniel Trebbien

If all floating-point calculations are performed within the FPU, then, no, there is no difference between a doublecalculation and a floatcalculation because the floating point operations are actually performed with 80 bits of precision in the FPU stack. Entries of the FPU stack are rounded as appropriate to convert the 80-bit floating point format to the doubleor floatfloating-point format. Moving sizeof(double)bytes to/from RAM versus sizeof(float)bytes is the only difference in speed.

如果所有浮点计算都在 FPU 内执行,那么,不,double计算和float计算之间没有区别,因为浮点运算实际上是在 FPU 堆栈中以 80 位精度执行的。FPU 堆栈的条目会根据需要进行四舍五入,以将 80 位浮点格式转换为doublefloat浮点格式。将sizeof(double)字节移入/移出 RAM 与sizeof(float)字节是速度的唯一区别。

If, however, you have a vectorizable computation, then you can use the SSE extensions to run four floatcalculations in the same time as two doublecalculations. Therefore, clever use of the SSE instructions and the XMM registers can allow higher throughput on calculations that only use floats.

但是,如果您有一个可矢量化的计算,那么您可以使用 SSE 扩展float在两个double计算的同时运行四个计算。因此,巧妙地使用 SSE 指令和 XMM 寄存器可以在仅使用floats 的计算上实现更高的吞吐量。

回答by Miley

Another point to consider is if you are using GPU(the graphics card). I work with a project that is numerically intensive, yet we do not need the percision that double offers. We use GPU cards to help further speed the processing. CUDA GPU's need a special package to support double, and the amount of local RAM on a GPU is quite fast, but quite scarce. As a result, using float also doubles the amount of data we can store on the GPU.

要考虑的另一点是您是否使用 GPU(显卡)。我从事一个数字密集型项目,但我们不需要双重提供的精确度。我们使用 GPU 卡来帮助进一步加快处理速度。CUDA GPU 需要一个特殊的包来支持 double,GPU 上的本地 RAM 量相当快,但相当稀缺。因此,使用 float 还会使我们可以在 GPU 上存储的数据量增加一倍。

Yet another point is the memory. Floats take half as much RAM as doubles. If you are dealing with VERY large datasets, this can be a really important factor. If using double means you have to cache to disk vs pure ram, your difference will be huge.

还有一点是记忆。浮点数占用的内存是双点数的一半。如果您正在处理非常大的数据集,这可能是一个非常重要的因素。如果使用 double 意味着您必须缓存到磁盘与纯内存,那么您的差异将是巨大的。

So for the application I am working with, the difference is quite important.

因此,对于我正在使用的应用程序,差异非常重要。

回答by bobobobo

I just want to add to the already existing great answers that the __m256?family of same-instruction-multiple-data (SIMD) C++ intrinsic functions operate on either4doubles in parallel (e.g. _mm256_add_pd), or 8floats in parallel (e.g. _mm256_add_ps).

我只想补充到现有的伟大的答案是,__m256?家庭同指令多数据(SIMD)C ++内部函数进行操作或者4个doubleS IN并行(例如_mm256_add_pd),或8个floatS IN并行(例如_mm256_add_ps)。

I'm not sure if this can translate to an actualspeed up, but it seems possibleto process 2x as many floats per instruction when SIMD is used.

我不确定这是否可以转化为实际的加速,但是当使用 SIMD 时,每条指令处理 2倍的浮点数似乎是可能的

回答by Akash Agrawal

In experiments of adding 3.3 for 2000000000 times, results are:

在20亿次加3.3的实验中,结果为:

Summation time in s: 2.82 summed value: 6.71089e+07 // float
Summation time in s: 2.78585 summed value: 6.6e+09 // double
Summation time in s: 2.76812 summed value: 6.6e+09 // long double

So double is faster and default in C and C++. It's more portable and the default across all C and C++ library functions. Alos double has significantly higher precision than float.

所以 double 更快,并且在 C 和 C++ 中是默认的。它更具可移植性,并且是所有 C 和 C++ 库函数的默认值。Alos double 的精度明显高于 float。

Even Stroustrup recommends double over float:

甚至 Stroustrup 也推荐双倍浮动:

"The exact meaning of single-, double-, and extended-precision is implementation-defined. Choosing the right precision for a problem where the choice matters requires significant understanding of floating-point computation. If you don't have that understanding, get advice, take the time to learn, or use double and hope for the best."

“单精度、双精度和扩展精度的确切含义是实现定义的。为选择很重要的问题选择正确的精度需要对浮点计算有深入的了解。如果您不了解,请获取建议,花点时间学习,或者使用双倍并希望最好。”

Perhaps the only case where you should use float instead of double is on 64bit hardware with a modern gcc. Because float is smaller; double is 8 bytes and float is 4 bytes.

也许您应该使用 float 而不是 double 的唯一情况是在具有现代 gcc 的 64 位硬件上。因为浮动更小;double 是 8 个字节,float 是 4 个字节。

回答by Richard

The only really useful answer is: only you can tell. You need to benchmark your scenarios. Small changes in instruction and memory patterns could have a significant impact.

唯一真正有用的答案是:只有你能分辨。您需要对您的场景进行基准测试。指令和内存模式的微小变化可能会产生重大影响。

It will certainly matter if you are using the FPU or SSE type hardware (former does all its work with 80bit extended precision, so double will be closer; later is natively 32bit, i.e. float).

如果您使用的是 FPU 或 SSE 类型的硬件,这当然很重要(前者以 80 位扩展精度完成所有工作,因此 double 会更接近;后者本身是 32 位,即浮点数)。

Update: s/MMX/SSE/ as noted in another answer.

更新:s/MMX/SSE/ 如另一个答案中所述。

回答by doron

Floating point is normally an extension to one's general purpose CPU. The speed will therefore be dependent on the hardware platform used. If the platform has floating point support, I will be surprised if there is any difference.

浮点通常是对通用 CPU 的扩展。因此,速度将取决于所使用的硬件平台。如果平台有浮点支持,如果有什么不同我会很惊讶。

回答by Jedzia

In addition some real data of a benchmark to get a glimpse:

此外,一些基准测试的真实数据可以让您一目了然:

For Intel 3770k, GCC 9.3.0 -O2 [3]
Run on (8 X 3503 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_FloatCreation               0.281 ns        0.281 ns   1000000000
BM_DoubleCreation              0.284 ns        0.281 ns   1000000000
BM_Vector3FCopy                0.558 ns        0.562 ns   1000000000
BM_Vector3DCopy                 5.61 ns         5.62 ns    100000000
BM_Vector3F_CopyDefault        0.560 ns        0.546 ns   1000000000
BM_Vector3D_CopyDefault         5.57 ns         5.56 ns    112178768
BM_Vector3F_Copy123            0.841 ns        0.817 ns    897430145
BM_Vector3D_Copy123             5.59 ns         5.42 ns    112178768
BM_Vector3F_Add                0.841 ns        0.834 ns    897430145
BM_Vector3D_Add                 5.59 ns         5.46 ns    100000000
BM_Vector3F_Mul                0.842 ns        0.782 ns    897430145
BM_Vector3D_Mul                 5.60 ns         5.56 ns    112178768
BM_Vector3F_Compare            0.840 ns        0.800 ns    897430145
BM_Vector3D_Compare             5.61 ns         5.62 ns    100000000
BM_Vector3F_ARRAY_ADD           3.25 ns         3.29 ns    213673844        
BM_Vector3D_ARRAY_ADD           3.13 ns         3.06 ns    224357536        

where operations on 3 float(F) or 3 double(D) are compared and - BM_Vector3XCopy is the pure copy of a (1,2,3) initialized vector not repeated before copy, - BM_Vector3X_CopyDefault with default initialization repeated every copy, - BM_Vector3X_Copy123 with repeated initialization of (1,2,3),

其中对 3 个浮点 (F) 或 3 个双精度 (D) 的操作进行比较,并且 - BM_Vector3XCopy 是 (1,2,3) 初始化向量的纯副本,在复制之前不重复, - BM_Vector3X_CopyDefault 与默认初始化重复每个副本, - BM_Vector3X_Copy123重复初始化(1,2,3),

  • Add/Mul Each initialize 3 vectors(1,2,3) and add/multiplicate the first and second into the third,
  • Compare Checks for equality of two initialized vectors,

  • ARRAY_ADD Sums up vector(1,2,3) + vector(3,4,5) + vector(6,7,8) via std::valarray what in my case leads to SSE instructions.

  • Add/Mul 每个初始化 3 个向量(1,2,3)并将第一个和第二个添加/乘以/乘以第三个,
  • 比较检查两个初始化向量的相等性,

  • ARRAY_ADD 通过 std::valarray 总结 vector(1,2,3) + vector(3,4,5) + vector(6,7,8) 在我的情况下导致 SSE 指令。

Remember that these are isolated tests and the results differ with compiler settings, from machine to machine or architecture to architecture. With caching (issues) and real world use-cases this may be completely different. So the theory can greatly differ from reality. The only way to find out is a practical test such as with google-benchmark[1] and checking the result of the compiler output for your particular problem solution[2].

请记住,这些是孤立的测试,结果因编译器设置而异,从机器到机器或架构到架构。对于缓存(问题)和现实世界的用例,这可能完全不同。所以理论可能与现实有很大的不同。找出答案的唯一方法是进行实际测试,例如使用 google-benchmark[1] 并检查编译器输出的结果以获取您的特定问题解决方案 [2]。

  1. https://github.com/google/benchmark
  2. https://sourceware.org/binutils/docs/binutils/objdump.html-> objdump -S
  3. https://github.com/Jedzia/oglTemplate/blob/dd812b72d846ae888238d6f726d503485b796b68/benchmark/Playground/BM_FloatingPoint.cpp
  1. https://github.com/google/benchmark
  2. https://sourceware.org/binutils/docs/binutils/objdump.html-> objdump -S
  3. https://github.com/Jedzia/oglTemplate/blob/dd812b72d846ae888238d6f726d503485b796b68/benchmark/Playground/BM_FloatingPoint.cpp