C++ float 和 double 有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2386772/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 23:15:58  来源:igfitidea点击:

What is the difference between float and double?

c++cfloating-pointprecision

提问by VaioIsBorn

I've read about the difference between double precision and single precision. However, in most cases, floatand doubleseem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?

我已经阅读了双精度和单精度之间的区别。然而,在大多数情况下,float并且double似乎是可以互换的,即使用一种或另一种似乎不会影响结果。真的是这样吗?什么时候浮动和双打可以互换?它们之间有什么区别?

回答by kennytm

Huge difference.

巨大的差异。

As the name implies, a doublehas 2x the precision of float[1]. In general a doublehas 15 decimal digits of precision, while floathas 7.

顾名思义, adouble的精度是[1] 的2 倍。一般来说,a有 15 位十进制数字的精度,而有 7 位。floatdoublefloat

Here's how the number of digits are calculated:

以下是计算位数的方法:

doublehas 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits

floathas 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits

double有 52 个尾数位 + 1 个隐藏位:log(2 53)÷log(10) = 15.95 位

float有 23 个尾数位 + 1 个隐藏位:log(2 24)÷log(10) = 7.22 位

This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.

当重复计算时,这种精度损失可能导致更大的截断误差累积,例如

float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
    b += a;
printf("%.7g\n", b); // prints 9.000023

while

尽管

double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
    b += a;
printf("%.15g\n", b); // prints 8.99999999999996

Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using floatcan hit "infinity" (i.e. a special floating-point number) much more easily than doublefor something simple, e.g. computing the factorial of 60.

此外, float 的最大值是 about 3e38,但 double 是 about 1.7e308,因此使用float可以比double简单的东西更容易达到“无穷大”(即特殊的浮点数),例如计算 60 的阶乘。

During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.

在测试过程中,可能一些测试用例包含这些巨大的数字,如果您使用浮点数,可能会导致您的程序失败。



Of course, sometimes, even doubleisn't accurate enough, hence we sometimes have long double[1](the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use intor a fraction class.

当然,有时甚至double不够准确,因此我们有时有long double[1](上面的例子在 Mac 上给出了 9.000000000000000066),但所有浮点类型都存在舍入误差,所以如果精度非常重要(例如货币处理)您应该使用int或分数类。



Furthermore, don't use +=to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.

此外,不要用于+=对大量浮点数求和,因为错误会迅速累积。如果您使用的是 Python,请使用fsum. 否则,尝试实现Kahan 求和算法



[1]: The C and C++ standards do not specify the representation of float, doubleand long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) floatisindeed a IEEE single-precision floating point number (binary32), and doubleisa IEEE double-precision floating point number (binary64).

[1]:C 和 C++ 标准没有指定float,double和的表示long double。有可能所有三个都实现为 IEEE 双精度。然而,对于大多数的架构(GCC,MSVC; 86,64,ARM)float确实是一个IEEE单精度浮点数(binary32),并且double一个IEEE双精度浮点数(binary64)。

回答by Gregory Pakosz

Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:

以下是标准 C99 (ISO-IEC 9899 6.2.5 §10) 或 C++2003 (ISO-IEC 14882-2003 3.1.9 §8) 标准所说的内容:

There are three floating point types: float, double, and long double. The type doubleprovides at least as much precision as float, and the type long doubleprovides at least as much precision as double. The set of values of the type floatis a subset of the set of values of the type double; the set of values of the type doubleis a subset of the set of values of the type long double.

有三种浮点类型:floatdouble,和long double。类型double至少提供与 一样多的精度float,类型long double至少提供与 一样多的精度double。类型值的集合是类型值集合的float子集double;类型值的集合是类型值集合的double子集long double

The C++ standard adds:

C++ 标准增加了:

The value representation of floating-point types is implementation-defined.

浮点类型的值表示是实现定义的。

I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmeticthat covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.

我建议看一看优秀的What Every Computer Scientist Should Know About Floating-Point Arithmetic,其中深入介绍了 IEEE 浮点标准。您将了解表示的细节,并且您将意识到幅度和精度之间存在权衡。浮点表示的精度随着幅度的减小而增加,因此 -1 和 1 之间的浮点数是最精确的。

回答by Alok Singhal

Given a quadratic equation: x2 − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r1 = 2.000316228 and r2 = 1.999683772.

给定一个二次方程:x 2 − 4.0000000  x + 3.9999999 = 0,10 个有效数字的精确根是,r 1 = 2.000316228 和r 2 = 1.999683772。

Using floatand double, we can write a test program:

使用floatand double,我们可以编写一个测试程序:

#include <stdio.h>
#include <math.h>

void dbl_solve(double a, double b, double c)
{
    double d = b*b - 4.0*a*c;
    double sd = sqrt(d);
    double r1 = (-b + sd) / (2.0*a);
    double r2 = (-b - sd) / (2.0*a);
    printf("%.5f\t%.5f\n", r1, r2);
}

void flt_solve(float a, float b, float c)
{
    float d = b*b - 4.0f*a*c;
    float sd = sqrtf(d);
    float r1 = (-b + sd) / (2.0f*a);
    float r2 = (-b - sd) / (2.0f*a);
    printf("%.5f\t%.5f\n", r1, r2);
}   

int main(void)
{
    float fa = 1.0f;
    float fb = -4.0000000f;
    float fc = 3.9999999f;
    double da = 1.0;
    double db = -4.0000000;
    double dc = 3.9999999;
    flt_solve(fa, fb, fc);
    dbl_solve(da, db, dc);
    return 0;
}  

Running the program gives me:

运行程序给了我:

2.00000 2.00000
2.00032 1.99968

Note that the numbers aren't large, but still you get cancellation effects using float.

请注意,数字并不大,但您仍然可以使用float.

(In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)

(实际上,无论是使用单精度还是双精度浮点数,以上都不是求解二次方程的最佳方法,但即使使用更稳定的方法,答案也保持不变。)

回答by graham.reeds

  • A double is 64 and single precision (float) is 32 bits.
  • The double has a bigger mantissa (the integer bits of the real number).
  • Any inaccuracies will be smaller in the double.
  • 双精度为 64,单精度(浮点数)为 32 位。
  • double 有一个更大的尾数(实数的整数位)。
  • 任何不准确的地方都会更小。

回答by Dolbz

The size of the numbers involved in the float-point calculations is not the most relevant thing. It's the calculation that is being performed that is relevant.

浮点计算中涉及的数字的大小并不是最相关的事情。相关的是正在执行的计算。

In essence, if you're performing a calculation and the result is an irrational number or recurring decimal, then there will be rounding errors when that number is squashed into the finite size data structure you're using. Since double is twice the size of float then the rounding error will be a lot smaller.

本质上,如果您正在执行计算并且结果是一个无理数或循环小数,那么当该数字被压缩到您正在使用的有限大小数据结构中时,将会出现舍入错误。由于 double 是 float 大小的两倍,因此舍入误差会小得多。

The tests may specifically use numbers which would cause this kind of error and therefore tested that you'd used the appropriate type in your code.

测试可能专门使用会导致此类错误的数字,因此测试您是否在代码中使用了适当的类型。

回答by Elliscope Fang

I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.

我刚刚遇到了一个错误,我花了很长时间才弄清楚,并且可能会给您一个很好的浮点精度示例。

#include <iostream>
#include <iomanip>

int main(){
  for(float t=0;t<1;t+=0.01){
     std::cout << std::fixed << std::setprecision(6) << t << std::endl;
  }
}

The output is

输出是

0.000000
0.010000
0.020000
0.030000
0.040000
0.050000
0.060000
0.070000
0.080000
0.090000
0.100000
0.110000
0.120000
0.130000
0.140000
0.150000
0.160000
0.170000
0.180000
0.190000
0.200000
0.210000
0.220000
0.230000
0.240000
0.250000
0.260000
0.270000
0.280000
0.290000
0.300000
0.310000
0.320000
0.330000
0.340000
0.350000
0.360000
0.370000
0.380000
0.390000
0.400000
0.410000
0.420000
0.430000
0.440000
0.450000
0.460000
0.470000
0.480000
0.490000
0.500000
0.510000
0.520000
0.530000
0.540000
0.550000
0.560000
0.570000
0.580000
0.590000
0.600000
0.610000
0.620000
0.630000
0.640000
0.650000
0.660000
0.670000
0.680000
0.690000
0.700000
0.710000
0.720000
0.730000
0.740000
0.750000
0.760000
0.770000
0.780000
0.790000
0.800000
0.810000
0.820000
0.830000
0.839999
0.849999
0.859999
0.869999
0.879999
0.889999
0.899999
0.909999
0.919999
0.929999
0.939999
0.949999
0.959999
0.969999
0.979999
0.989999
0.999999

As you can see after 0.83, the precision runs down significantly.

正如您在 0.83 之后所看到的,精度显着下降。

However, if I set up tas double, such an issue won't happen.

但是,如果我设置t为double,则不会发生这样的问题。

It took me five hours to realize this minor error, which ruined my program.

我花了五个小时才意识到这个小错误,它毁了我的程序。

回答by Zain Ali

Type float, 32 bits long, has a precision of 7 digits. While it may store values with very large or very small range (+/- 3.4 * 10^38 or * 10^-38), it has only 7 significant digits.

类型 float,32 位长,精度为 7 位。虽然它可以存储非常大或非常小的范围(+/- 3.4 * 10^38 或 * 10^-38)的值,但它只有 7 个有效数字。

Type double, 64 bits long, has a bigger range (*10^+/-308) and 15 digits precision.

类型 double,64 位长,具有更大的范围 (*10^+/-308) 和 15 位精度。

Type long double is nominally 80 bits, though a given compiler/OS pairing may store it as 12-16 bytes for alignment purposes. The long double has an exponent that just ridiculously huge and should have 19 digits precision. Microsoft, in their infinite wisdom, limits long double to 8 bytes, the same as plain double.

类型 long double 名义上是 80 位,尽管给定的编译器/操作系统配对可能会将其存储为 12-16 字节以进行对齐。long double 的指数非常大,应该有 19 位精度。微软以其无穷的智慧将 long double 限制为 8 个字节,与普通 double 相同。

Generally speaking, just use type double when you need a floating point value/variable. Literal floating point values used in expressions will be treated as doubles by default, and most of the math functions that return floating point values return doubles. You'll save yourself many headaches and typecastings if you just use double.

一般来说,当您需要浮点值/变量时,只需使用 double 类型。默认情况下,表达式中使用的文字浮点值将被视为双精度值,并且大多数返回浮点值的数学函数都返回双精度值。如果你只使用 double,你会为自己省去很多麻烦和类型转换。

回答by N 1.1

Floats have less precision than doubles. Although you already know, read What WE Should Know About Floating-Point Arithmeticfor better understanding.

浮点数的精度低于双精度数。尽管您已经知道,请阅读我们应该知道的关于浮点运算的知识以更好地理解。

回答by Tuomas Pelkonen

When using floating point numbers you cannot trust that your local tests will be exactly the same as the tests that are done on the server side. The environment and the compiler are probably different on you local system and where the final tests are run. I have seen this problem many times before in some TopCoder competitions especially if you try to compare two floating point numbers.

使用浮点数时,您不能相信您的本地测试与在服务器端完成的测试完全相同。您本地系统上的环境和编译器可能不同,并且运行最终测试的位置也不同。我之前在一些 TopCoder 比赛中多次看到这个问题,尤其是当你尝试比较两个浮点数时。

回答by Johnathan Lau

The built-in comparison operations differ as in when you compare 2 numbers with floating point, the difference in data type (i.e. float or double) may result in different outcomes.

内置比较操作的不同之处在于,当您将 2 个数字与浮点数进行比较时,数据类型(即浮点数或双精度数)的差异可能会导致不同的结果。