C++ 浮点除法和精度

Question

提问by Nick Gotch

I know that 511 divided by 512 actually equals 0.998046875. I also know that the precision of floats is 7 digits. My question is, when I do this math in C++ (GCC) the result I get is 0.998047, which is a rounded value. I'd prefer to just get the truncated value of 0.998046, how can I do that?

我知道 511 除以 512 实际上等于 0.998046875。我也知道浮点数的精度是 7 位。我的问题是，当我在 C++ (GCC) 中做这个数学运算时，我得到的结果是 0.998047，这是一个四舍五入的值。我宁愿只得到 0.998046 的截断值，我该怎么做？

  float a = 511.0f;
  float b = 512.0f;
  float c = a / b;

Answer 1

回答by Dietrich Epp

Well, here's one problem. The value of 511/512, as a float, is exact. No rounding is done. You can check this by asking for more than seven digits:

嗯，这里有一个问题。的值511/512作为 afloat是精确的。不进行四舍五入。您可以通过要求超过七位数字来检查这一点：

#include <stdio.h>
int main(int argc, char *argv[])
{
    float x = 511.0f, y = 512.0f;
    printf("%.15f\n", x/y);
    return 0;
}

Output:

输出：

0.998046875000000

A floatis stored not as a decimal number, but binary. If you divide a number by a power of 2, such as 512, the result will almost always be exact. What's going on is the precision of a floatis not simply 7 digits, it is really 23 bitsof precision.

Afloat不是以十进制数的形式存储，而是以二进制形式存储。如果将一个数除以 2 的幂，例如 512，结果几乎总是精确的。发生的事情是 a 的精度float不仅仅是 7 位数字，它实际上是 23位精度。

See What Every Computer Scientist Should Know About Floating-Point Arithmetic.

请参阅每个计算机科学家应该了解的有关浮点运算的知识。

Answer 2

回答by AProgrammer

I also know that the precision of floats is 7 digits.

我也知道浮点数的精度是 7 位。

No. The most common floating point format is binary and has a precision of 24 bits. It is somewhere between 6 and 7 decimal digits but you can't think in decimal if you want to understand how rounding work.

不是。最常见的浮点格式是二进制，精度为 24 位。它介于 6 到 7 位十进制数字之间，但如果您想了解舍入是如何工作的，则无法以十进制进行思考。

As b is a power of 2, c is exactly representable. It is during the conversion in a decimal representation that rounding will occurs. The standard ways of getting a decimal representation don't offer the possibility to use truncation instead of rounding. One way would be to ask for one more digit and ignore it.

由于 b 是 2 的幂，因此 c 是完全可表示的。在十进制表示的转换过程中会发生舍入。获得十进制表示的标准方法不提供使用截断而不是舍入的可能性。一种方法是再要求一位数字并忽略它。

But note that the fact that c is exactly representable is a property of its value. SOme apparently simpler values (like 0.1) don't have an exact representation in binary FP formats.

但请注意，c 可精确表示的事实是其值的一个属性。一些明显更简单的值（如 0.1）在二进制 FP 格式中没有精确的表示。

Answer 3

回答by Clifford

That 'rounded' value is most likley what is displayed through some output method rather than what is actually stored. Check the actual value in your debugger.

该“四舍五入”值最有可能是通过某种输出方法显示的内容，而不是实际存储的内容。检查调试器中的实际值。

With iostream and stdio, you can specify the precision of the output. If you specify 7 significant digits, convert it to a string, then truncate the string before display you will get the output without rounding.

使用 iostream 和 stdio，您可以指定输出的精度。如果指定 7 位有效数字，将其转换为字符串，然后在显示前截断字符串，您将获得不四舍五入的输出。

Can't think of one reason why you would want to do that however, and given the subseqent explanation of teh application, you'd be better off using double precision, though that will most likely simply shobe problems to somewhere else.

但是，想不出为什么要这样做的一个原因，并且鉴于应用程序的后续解释，您最好使用双精度，尽管这很可能只是将问题转移到其他地方。

Answer 4

回答by Olof Forshell

Your question is not unique, it has been answered numerous times before. This is not a simple topic and just because answers are posted doesn't necessarily mean they'll be of good quality. If you browse a little you'll find the really good stuff. And it will take you less time.

你的问题不是唯一的，之前已经回答过很多次了。这不是一个简单的话题，仅仅因为发布了答案并不一定意味着它们的质量很好。如果你稍微浏览一下，你会发现真正的好东西。它会花费你更少的时间。

I bet someone will -1 me for commenting and not answering.

我打赌有人会-1 我评论而不回答。

_____ Edit _____

_____ 编辑 _____

What is fundamental to understanding floating point is to realize that everything is displayed in binary digits. Because most people have trouble grasping this they try to see it from the point of view of decimal digits.

理解浮点数的基础是要意识到一切都以二进制数字显示。因为大多数人都难以理解这一点，所以他们试图从十进制数字的角度来看待它。

On the subject of 511/512 you can start by looking at the value 1.0. In floating point this could be expressed as i.000000... * 2^0 or implicit bit set (to 1) multiplied by 2^0 ie equals 1. Since 511/512 is less than 1 you need to start with the next lower power -1 giving i.000000... * 2^-1 i e 0.5. Notice that the only thing that has changed is the exponent. If we want to express 511 in binary we get 9 ones - 111111111 or in floating point with implicit bit i.11111111 - which we can divide by 512 and put together with the exponent of -1 giving i.1111111100... * 2^-1.

关于 511/512，您可以从查看值 1.0 开始。在浮点数中，这可以表示为 i.000000... * 2^0 或隐式位集（为 1）乘以 2^0 即等于 1。由于 511/512 小于 1，因此您需要从下一个开始较低的功率 -1 给出 i.000000... * 2^-1 即 0.5。请注意，唯一改变的是指数。如果我们想用二进制表示 511，我们会得到 9 个 1 - 111111111 或带有隐式位 i.11111111 的浮点数 - 我们可以除以 512 并与 -1 的指数放在一起给出 i.1111111100... * 2^ -1.

How does this translate to 0.998046875?

这如何转换为 0.998046875？

Well to begin with the implicit bit represents 0.5 (or 2^-1), the first explicit bit 0.25 (2^-2), the next explicit bit 0.125 (2^-3), 0.0625, 0.03125 and so on until you've represented the ninth bit (eighth explicit). Sum them up and you get 0.998046875. From the i.11111111 we find that this number represents 9 binary digits of precision and, coincidentally, 9 decimal precision.

首先，隐式位表示 0.5（或 2^-1），第一个显式位 0.25（2^-2），下一个显式位 0.125（2^-3）、0.0625、0.03125 等等，直到你' ve 代表第九位（第八位显式）。把它们加起来，你得到 0.998046875。从 i.11111111 我们发现这个数字代表 9 位二进制精度，巧合的是，9 位十进制精度。

If you multiply 511/512 by 512 you will get i1111111100... * 2^8. Here there are the same nine binary digits of precision but only three decimal digits (for 511).

如果您将 511/512 乘以 512，您将得到 i1111111100... * 2^8。这里有相同的九个二进制精度，但只有三个十进制数字（对于 511）。

Consider i.11111111111111111111111 (i + 23 ones) * 2^-1. We will get a fraction (2^(24-1)^/(2^24))with 24 binary and 24 decimal digits of precision. Given an appropriate printf formatting all 24 decimal digits will be displayed. Multiply it by 2^24 and you still have 24 binary digits of precision but only 8 decimal (for 16777215).

考虑 i.11111111111111111111111 (i + 23 个) * 2^-1。我们将得到一个分数 (2^(24-1)^/(2^24)) 具有 24 个二进制和 24 个十进制数字的精度。给定适当的 printf 格式，将显示所有 24 位十进制数字。将其乘以 2^24，您仍然有 24 位二进制精度，但只有 8 位十进制数（对于 16777215）。

Now consider i.1111100... * 2^2 which comes out to 7.875. i11 is the integer part and 111 the fraction part (111/1000 or 7/8ths). 6 binary digits of precision and 4 decimal.

现在考虑 i.1111100... * 2^2，结果为 7.875。i11 是整数部分，111 是小数部分（111/1000 或 7/8ths）。6 位二进制精度和 4 位十进制数。

Thinking decimal when doing floating-point is utterly detrimental to understanding it. Free yourself!

在进行浮点运算时考虑十进制对理解它是完全有害的。释放自己！

Answer 5

回答by Shamim Hafiz

If you are just interested in the value, you could use double and then multiply the result by 10^6 and floor it. Divide again by 10^6 and you will get the truncated value.

如果您只对值感兴趣，您可以使用 double，然后将结果乘以 10^6 并将其取整。再次除以 10^6，您将得到截断值。

C++ 浮点除法和精度

提问by Nick Gotch

回答by Dietrich Epp

回答by AProgrammer

回答by Clifford

回答by Olof Forshell

回答by Shamim Hafiz

相关推荐

最近更新

标签

C++ 浮点除法和精度

提问by Nick Gotch

回答by Dietrich Epp

回答by AProgrammer

回答by Clifford

回答by Olof Forshell

回答by Shamim Hafiz

相关推荐

C++ 错误：使用已删除的函数

C++ 我应该从 main() 返回 EXIT_SUCCESS 还是 0？

C++ 无法打开包含文件：'graphics.h'：没有这样的文件或目录

C++ 字符串::查找复杂度

相关推荐

最近更新

标签