C++ 为什么没有 2 字节浮点数并且已经存在实现?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5766882/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 18:50:37  来源:igfitidea点击:

Why is there no 2-byte float and does an implementation already exist?

c++floating-point16-bithalf-precision-float

提问by Samaursa

Assuming I am really pressed for memory and want a smaller range (similar to shortvs int). Shader languages already support halffor a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and 1, that is, return a float like this: shortComingIn / maxRangeOfShort). Is there an implementation that already exists for a 2-byte float?

假设我真的很需要内存并且想要一个更小的范围(类似于shortvs int)。着色器语言已经支持half精度减半的浮点类型(不仅仅是来回转换值在 -1 和 1 之间,即返回一个像这样的浮点数:)shortComingIn / maxRangeOfShort。是否已经存在 2 字节浮点数的实现?

I am also interested to know any (historical?) reasons as to why there is no 2-byte float.

我也很想知道为什么没有 2 字节浮点数的任何(历史?)原因。

采纳答案by T.J. Crowder

Re: Implementations: Someone has apparently written halffor C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c

回复:实现:有人显然是half为 C编写的,它(当然)可以在 C++ 中工作:https: //storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets /half.c

Re: Why is floatfour bytes: Probably because below that, their precision is so limited.

回复:为什么是float四个字节:可能是因为在此之下,它们的精度如此有限

回答by Kira M. Backes

if you have low memory, did you consider dropping the float concept? Floats use up a lot of bits just for saving where the decimal point is.. You can work around this if you knowwhere you need the decimal point, let's say you want to save a Dollar value, you could just save it in Cents:

如果您的内存不足,您是否考虑过放弃浮动概念?浮点数使用了很多位只是为了保存小数点所在的位置。如果你知道你需要小数点的位置,你可以解决这个问题,假设你想保存一个美元值,你可以把它保存在美分中:

uint16_t cash = 50000;
std::cout << "Cash: $" << (cash / 100) << "." << ((cash % 100) < 10 ? "0" : "") << (cash % 100) << std::endl;

That is of course only an option if it's possible for you to predetermine the position of the decimal point. But if you can, always prefer it, because this also speeds up all calculations!

如果您可以预先确定小数点的位置,那当然只是一种选择。但是,如果可以,请始终选择它,因为这也会加快所有计算速度!

rgds, Kira :-)

rgds,基拉 :-)

回答by phuclv

TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations

TL;DR:16 位浮点数确实存在,并且有各种软件和硬件实现

There are currently 2 common standard 16-bit float formats: IEEE-754 binary16and Google's bfloat16. Since they're standardized, obviously if anyone who knows the spec can write an implementation. Some examples:

目前有 2 种常见的标准 16 位浮点格式:IEEE-754 binary16和 Google 的bfloat16。由于它们是标准化的,显然如果任何了解规范的人都可以编写实现。一些例子:

Or if you don't want to use them, you can also design a different 16-bit float format and implement it

或者如果你不想使用它们,你也可以设计不同的16位浮点格式并实现它



2-byte floats are generally not used, because even float's precision is not enough for normal operationsand doubleshould always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also doublewhen using without a suffix in C and C-like languages. See

通常不使用 2 字节浮点数,因为即使是浮点数的精度也不足以进行正常操作double除非您受到带宽或缓存大小的限制,否则应始终默认使用。double在 C 和类 C 语言中不使用后缀时也可以使用浮点文字。看

However less-than-32-bit floats do exist. They're mainly used for storagepurposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float typesexist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollerslike PICwhere 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encodingfor small floating-point immediates.

但是小于 32 位的浮点数确实存在。它们主要用于存储目的,例如在图形中,每像素 96 位(每通道 32 位 * 3 通道)太浪费了,将转换为普通的 32 位浮点数进行计算(某些特殊硬件除外) )。OpenGL 中存在各种 10、11、14位浮点类型。许多 HDR 格式为每个通道使用 16 位浮点数,Direct3D 9.0 以及一些 GPU(如 Radeon R300 和 R420)具有 24 位浮点数格式。某些 8 位微控制器(PIC)中的编译器也支持 24 位浮点数其中 32 位浮点支持成本太高。8 位或更窄的浮点类型不太有用,但由于它们的简单性,它们经常在计算机科学课程中教授。此外,ARM 的指令编码中也使用了小浮点数,用于小浮点立即数。

The IEEE 754-2008 revisionofficially added a 16-bit float format, A.K.A binary16or half-precision, with a 5-bit exponent and an 11-bit mantissa

IEEE 754-2008修订正式加入一个16位浮点格式,AKA binary16半精度,用一个5位指数和11位尾数

Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized oprations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCCand Clangalso support __fp16along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64

一些编译器支持 IEEE-754 binary16,但主要用于转换或矢量化操作,而不是用于计算(因为它们不够精确)。例如,ARM 的工具链__fp16可以在 2 个变体之间进行选择:IEEE 和替代方案,具体取决于您是想要更多范围还是 NaN/inf 表示。GCCClang也支持__fp16标准化名称_Float16。请参阅如何在 gcc 上为 x86_64 启用 __fp16 类型

Recently due to the rise of AI, another format called bfloat16(brain floating-point format) which is a simple truncationof the top 16 bits of IEEE-754 binary32 became common

最近由于人工智能的兴起,另一种称为bfloat16大脑浮点格式)的格式变得普遍,它是IEEE-754 binary32 的前 16 位的简单截断

The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.

  • float32: 242=576 (100%)
  • float16: 112=121 (21%)
  • bfloat16: 82=64 (11%)

减少尾数背后的动机来自谷歌的实验,该实验表明,只要减少尾数就可以了,只要它仍然可以表示接近零的微小值,作为训练期间小差异总和的一部分。较小的尾数带来了许多其他优势,例如降低乘法器功率和物理硅面积。

  • float32: 24 2=576 (100%)
  • float16: 11 2=121 (21%)
  • bfloat16: 8 2=64 (11%)

Many compilers like GCCand ICCnow also gained the ability to support bfloat16

许多编译器,如GCCICC现在也获得了支持 bfloat16 的能力

More information about bfloat16:

有关 bfloat16 的更多信息:

回答by dan04

There isan IEEE 754 standard for 16-bit floats.

这里一个IEEE 754标准为16位浮点

It's a new format, having been standardized in 2008 based on a GPU released in 2002.

这是一种新格式,已于 2008 年标准化,基于 2002 年发布的 GPU。

回答by Phil H

To go a bit further than Kiralein on switching to integers, we could define a range and permit the integer values of a short to represent equal divisions over the range, with some symmetry if straddling zero:

为了在切换到整数方面比 Kiralein 更进一步,我们可以定义一个范围并允许 short 的整数值表示该范围内的相等划分,如果跨零,则具有一些对称性:

short mappedval = (short)(val/range);

Differences between these integer versions and using half precision floats:

这些整数版本和使用半精度浮点数之间的差异:

  1. Integers are equally spaced over the range, whereas floats are more densely packed near zero
  2. Using integers will use integer math in the CPU rather than floating-point. That is often faster because integer operations are simpler. Having said that, mapping the values onto an asymmetric range would require extra additions etc to retrieve the value at the end.
  3. The absolute precision loss is more predictable; you know the error in each value so the total loss can be calculated in advance, given the range. Conversely, the relative error is more predictable using floating point.
  4. There may be a small selection of operations which you can do using pairs of values, particularly bitwise operations, by packing two shorts into an int. This can halve the number of cycles needed (or more, if short operations involve a cast to int) and maintains 32-bit width. This is just a diluted version of bit-slicing where 32 bits are acted on in parallel, which is used in crypto.
  1. 整数在范围内等距分布,而浮点数在零附近更密集
  2. 使用整数将在 CPU 中使用整数数学而不是浮点数。这通常更快,因为整数运算更简单。话虽如此,将值映射到非对称范围将需要额外的添加等以在最后检索值。
  3. 绝对精度损失更可预测;您知道每个值的误差,因此可以在给定范围的情况下提前计算总损失。相反,使用浮点数更容易预测相对误差。
  4. 可能有一小部分操作可以使用值对进行,特别是按位操作,方法是将两个 short 打包到一个 int 中。这可以将所需的周期数减半(或更多,如果短操作涉及到 int 的强制转换)并保持 32 位宽度。这只是位切片的稀释版本,其中 32 位并行操作,用于加密。

回答by Alan Corey

There are probably a variety of types in different implementations. A float equivalent of stdint.h seems like a good idea. Call (alias?) the types by their sizes. (float16_t?) A float being 4 bytes is only right now, but it probably won't get smaller. Terms like half and long mostly become meaningless with time. With 128 or 256-bit computers they could come to mean anything.

在不同的实现中可能有多种类型。等效于 stdint.h 的浮点数似乎是个好主意。按大小调用(别名?)类型。( float16_t?) 4 个字节的浮点数只是现在,但它可能不会变小。随着时间的推移,像 half 和 long 这样的术语大多变得毫无意义。对于 128 或 256 位计算机,它们可能意味着任何事情。

I'm working with images (1+1+1 byte/pixel) and I want to express each pixel's value relative to the average. So floating point or carefully fixed point, but not 4 times as big as the raw data please. A 16-bit float sounds about right.

我正在处理图像(1+1+1 字节/像素),我想表达每个像素相对于平均值的值。所以浮点或仔细定点,但不要是原始数据的 4 倍。16 位浮点数听起来差不多。

This GCC 7.3 doesn't know "half", maybe in a C++ context.

这个 GCC 7.3 不知道“一半”,也许在 C++ 上下文中。

回答by robthebloke

If your CPU supports F16C, then you can get something up and running fairly quickly with something such as:

如果您的 CPU 支持 F16C,那么您可以通过以下方式快速启动并运行:

// needs to be compiled with -mf16c enabled
#include <immintrin.h>
#include <cstdint>

struct float16
{
private:
  uint16_t _value;
public:

  inline float16() : _value(0) {}
  inline float16(const float16&) = default;
  inline float16(float16&&) = default;
  inline float16(const float f) : _value(_cvtss_sh(f, _MM_FROUND_CUR_DIRECTION)) {}

  inline float16& operator = (const float16&) = default;
  inline float16& operator = (float16&&) = default;
  inline float16& operator = (const float f) { _value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION); return *this; }

  inline operator float () const 
    { return _cvtsh_ss(_value); }

  inline friend std::istream& operator >> (std::istream& input, float16& h) 
  { 
    float f = 0;
    input >> f;
    h._value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION);
    return input;
  }
};

Maths is still performed using 32-bit floats (the F16C extensions only provides conversions between 16/32-bit floats - no instructions exist to compute arithmetic with 16-bit floats).

数学仍然使用 32 位浮点数执行(F16C 扩展仅提供 16/32 位浮点数之间的转换 - 不存在使用 16 位浮点数计算算术的指令)。