C++ 32 位到 16 位浮点转换

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1659440/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 20:48:02  来源:igfitidea点击:

32-bit to 16-bit Floating Point Conversion

c++networkingieee-754

提问by Matt Fichman

I need a cross-platform library/algorithm that will convert between 32-bit and 16-bit floating point numbers. I don't need to perform math with the 16-bit numbers; I just need to decrease the size of the 32-bit floats so they can be sent over the network. I am working in C++.

我需要一个跨平台的库/算法来在 32 位和 16 位浮点数之间进行转换。我不需要用 16 位数字进行数学运算;我只需要减小 32 位浮点数的大小,以便它们可以通过网络发送。我在 C++ 中工作。

I understand how much precision I would be losing, but that's OK for my application.

我知道我会失去多少精度,但这对我的应用程序来说没问题。

The IEEE 16-bit format would be great.

IEEE 16 位格式会很棒。

采纳答案by Alex Martelli

std::frexpextracts the significand and exponent from normal floats or doubles -- then you need to decide what to do with exponents that are too large to fit in a half-precision float (saturate...?), adjust accordingly, and put the half-precision number together. This articlehas C source code to show you how to perform the conversion.

std::frexp从普通浮点数或双精度数中提取有效数和指数 - 然后您需要决定如何处理太大而无法放入半精度浮点数(饱和......?)的指数,相应地调整,然后将一半 -精度数在一起。 本文有 C 源代码向您展示如何执行转换。

回答by Phernost

Complete conversion from single precision to half precision. This is a direct copy from my SSE version, so it's branch-less. It makes use of the fact that -true == ~0to preform branchless selections (GCC converts ifstatements into an unholy mess of conditional jumps, while Clang just converts them to conditional moves.)

完成从单精度到半精度的转换。这是我的 SSE 版本的直接副本,因此它没有分支。它利用这样一个事实来-true == ~0执行无分支选择(GCC 将if语句转换为条件跳转的混乱混乱,而 Clang 只是将它们转换为条件移动。)

Update (2019-11-04):reworked to support single anddouble precision values with fully correct rounding. I also put a corresponding ifstatement above each branchless select as a comment for clarity. All incoming NaNs are converted to the base quiet NaN for speed and sanity, as there is no way to reliably convert an embedded NaN message between formats.

更新 (2019-11-04):重新设计以支持具有完全正确舍入的单精度双精度值。if为了清楚起见,我还在每个无分支选择上方放置了相应的语句作为注释。为了速度和完整性,所有传入的 NaN 都被转换为基本的安静 NaN,因为没有办法在格式之间可靠地转换嵌入的 NaN 消息。

#include <cstdint> // uint32_t, uint64_t, etc.
#include <cstring> // memcpy
#include <climits> // CHAR_BIT
#include <limits>  // numeric_limits
#include <utility> // is_integral_v, is_floating_point_v, forward

namespace std
{
  template< typename T , typename U >
  T bit_cast( U&& u ) {
    static_assert( sizeof( T ) == sizeof( U ) );
    union { T t; }; // prevent construction
    std::memcpy( &t, &u, sizeof( t ) );
    return t;
  }
} // namespace std

template< typename T > struct native_float_bits;
template<> struct native_float_bits< float >{ using type = std::uint32_t; };
template<> struct native_float_bits< double >{ using type = std::uint64_t; };
template< typename T > using native_float_bits_t = typename native_float_bits< T >::type;

static_assert( sizeof( float ) == sizeof( native_float_bits_t< float > ) );
static_assert( sizeof( double ) == sizeof( native_float_bits_t< double > ) );

template< typename T, int SIG_BITS, int EXP_BITS >
struct raw_float_type_info {
  using raw_type = T;

  static constexpr int sig_bits = SIG_BITS;
  static constexpr int exp_bits = EXP_BITS;
  static constexpr int bits = sig_bits + exp_bits + 1;

  static_assert( std::is_integral_v< raw_type > );
  static_assert( sig_bits >= 0 );
  static_assert( exp_bits >= 0 );
  static_assert( bits <= sizeof( raw_type ) * CHAR_BIT );

  static constexpr int exp_max = ( 1 << exp_bits ) - 1;
  static constexpr int exp_bias = exp_max >> 1;

  static constexpr raw_type sign = raw_type( 1 ) << ( bits - 1 );
  static constexpr raw_type inf = raw_type( exp_max ) << sig_bits;
  static constexpr raw_type qnan = inf | ( inf >> 1 );

  static constexpr auto abs( raw_type v ) { return raw_type( v & ( sign - 1 ) ); }
  static constexpr bool is_nan( raw_type v ) { return abs( v ) > inf; }
  static constexpr bool is_inf( raw_type v ) { return abs( v ) == inf; }
  static constexpr bool is_zero( raw_type v ) { return abs( v ) == 0; }
};
using raw_flt16_type_info = raw_float_type_info< std::uint16_t, 10, 5 >;
using raw_flt32_type_info = raw_float_type_info< std::uint32_t, 23, 8 >;
using raw_flt64_type_info = raw_float_type_info< std::uint64_t, 52, 11 >;
//using raw_flt128_type_info = raw_float_type_info< uint128_t, 112, 15 >;

template< typename T, int SIG_BITS = std::numeric_limits< T >::digits - 1,
  int EXP_BITS = sizeof( T ) * CHAR_BIT - SIG_BITS - 1 >
struct float_type_info 
: raw_float_type_info< native_float_bits_t< T >, SIG_BITS, EXP_BITS > {
  using flt_type = T;
  static_assert( std::is_floating_point_v< flt_type > );
};

template< typename E >
struct raw_float_encoder
{
  using enc = E;
  using enc_type = typename enc::raw_type;

  template< bool DO_ROUNDING, typename F >
  static auto encode( F value )
  {
    using flt = float_type_info< F >;
    using raw_type = typename flt::raw_type;
    static constexpr auto sig_diff = flt::sig_bits - enc::sig_bits;
    static constexpr auto bit_diff = flt::bits - enc::bits;
    static constexpr auto do_rounding = DO_ROUNDING && sig_diff > 0;
    static constexpr auto bias_mul = raw_type( enc::exp_bias ) << flt::sig_bits;
    if constexpr( !do_rounding ) { // fix exp bias
      // when not rounding, fix exp first to avoid mixing float and binary ops
      value *= std::bit_cast< F >( bias_mul );
    }
    auto bits = std::bit_cast< raw_type >( value );
    auto sign = bits & flt::sign; // save sign
    bits ^= sign; // clear sign
    auto is_nan = flt::inf < bits; // compare before rounding!!
    if constexpr( do_rounding ) {
      static constexpr auto min_norm = raw_type( flt::exp_bias - enc::exp_bias + 1 ) << flt::sig_bits;
      static constexpr auto sub_rnd = enc::exp_bias < sig_diff
        ? raw_type( 1 ) << ( flt::sig_bits - 1 + enc::exp_bias - sig_diff )
        : raw_type( enc::exp_bias - sig_diff ) << flt::sig_bits;
      static constexpr auto sub_mul = raw_type( flt::exp_bias + sig_diff ) << flt::sig_bits;
      bool is_sub = bits < min_norm;
      auto norm = std::bit_cast< F >( bits );
      auto subn = norm;
      subn *= std::bit_cast< F >( sub_rnd ); // round subnormals
      subn *= std::bit_cast< F >( sub_mul ); // correct subnormal exp
      norm *= std::bit_cast< F >( bias_mul ); // fix exp bias
      bits = std::bit_cast< raw_type >( norm );
      bits += ( bits >> sig_diff ) & 1; // add tie breaking bias
      bits += ( raw_type( 1 ) << ( sig_diff - 1 ) ) - 1; // round up to half
      //if( is_sub ) bits = std::bit_cast< raw_type >( subn );
      bits ^= -is_sub & ( std::bit_cast< raw_type >( subn ) ^ bits );
    }
    bits >>= sig_diff; // truncate
    //if( enc::inf < bits ) bits = enc::inf; // fix overflow
    bits ^= -( enc::inf < bits ) & ( enc::inf ^ bits );
    //if( is_nan ) bits = enc::qnan;
    bits ^= -is_nan & ( enc::qnan ^ bits );
    bits |= sign >> bit_diff; // restore sign
    return enc_type( bits );
  }

  template< typename F >
  static F decode( enc_type value )
  {
    using flt = float_type_info< F >;
    using raw_type = typename flt::raw_type;
    static constexpr auto sig_diff = flt::sig_bits - enc::sig_bits;
    static constexpr auto bit_diff = flt::bits - enc::bits;
    static constexpr auto bias_mul = raw_type( 2 * flt::exp_bias - enc::exp_bias ) << flt::sig_bits;
    raw_type bits = value;
    auto sign = bits & enc::sign; // save sign
    bits ^= sign; // clear sign
    auto is_norm = bits < enc::inf;
    bits = ( sign << bit_diff ) | ( bits << sig_diff );
    auto val = std::bit_cast< F >( bits ) * std::bit_cast< F >( bias_mul );
    bits = std::bit_cast< raw_type >( val );
    //if( !is_norm ) bits |= flt::inf;
    bits |= -!is_norm & flt::inf;
    return std::bit_cast< F >( bits );
  }
};

using flt16_encoder = raw_float_encoder< raw_flt16_type_info >;

template< typename F >
auto quick_encode_flt16( F && value )
{ return flt16_encoder::encode< false >( std::forward< F >( value ) ); }

template< typename F >
auto encode_flt16( F && value )
{ return flt16_encoder::encode< true >( std::forward< F >( value ) ); }

template< typename F = float, typename X >
auto decode_flt16( X && value )
{ return flt16_encoder::decode< F >( std::forward< X >( value ) ); }

Of course full IEEE support isn't always needed. If your values don't require logarithmic resolution approaching zero, then linearizing them to a fixed point format is much faster, as was already mentioned.

当然,并不总是需要完全的 IEEE 支持。如果您的值不需要接近零的对数分辨率,那么正如已经提到的那样,将它们线性化为定点格式要快得多。

回答by user2459387

Half to float:
float f = ((h&0x8000)<<16) | (((h&0x7c00)+0x1C000)<<13) | ((h&0x03FF)<<13);

Float to half:
uint32_t x = *((uint32_t*)&f);
uint16_t h = ((x>>16)&0x8000)|((((x&0x7f800000)-0x38000000)>>13)&0x7c00)|((x>>13)&0x03ff);

一半浮动:
float f = ((h&0x8000)<<16) | (((h&0x7c00)+0x1C000)<<13) | ((h&0x03FF)<<13);

浮动一半:
uint32_t x = *((uint32_t*)&f);
uint16_t h = ((x>>16)&0x8000)|((((x&0x7f800000)-0x38000000)>>13)&0x7c00)|((x>>13)&0x03ff);

回答by Artelius

Given your needs (-1000, 1000), perhaps it would be better to use a fixed-point representation.

鉴于您的需求 (-1000, 1000),也许使用定点表示会更好。

//change to 20000 to SHORT_MAX if you don't mind whole numbers
//being turned into fractional ones
const int compact_range = 20000;

short compactFloat(double input) {
    return round(input * compact_range / 1000);
}
double expandToFloat(short input) {
    return ((double)input) * 1000 / compact_range;
}

This will give you accuracy to the nearest 0.05. If you change 20000 to SHORT_MAX you'll get a bit more accuracy but some whole numbers will end up as decimals on the other end.

这将为您提供最接近 0.05 的准确度。如果您将 20000 更改为 SHORT_MAX,您将获得更高的准确性,但某些整数将在另一端以小数形式结束。

回答by tsalter

If you're sending a stream of information across, you could probably do better than this, especially if everything is in a consistent range, as your application seems to have.

如果您正在发送信息流,您可能会做得比这更好,特别是如果一切都在一致的范围内,就像您的应用程序一样。

Send a small header, that just consists of a float32 minimum and maximum, then you can send across your information as a 16 bit interpolation value between the two. As you also say that precision isn't much of an issue, you could even send 8bits at a time.

发送一个小标题,它只包含一个 float32 最小值和最大值,然后您可以将您的信息作为两者之间的 16 位插值值发送。正如您还说精度不是什么大问题,您甚至可以一次发送 8 位。

Your value would be something like, at reconstruction time:

在重建时,您的价值将类似于:

float t = _t / numeric_limits<unsigned short>::max();  // With casting, naturally ;)
float val = h.min + t * (h.max - h.min);

Hope that helps.

希望有帮助。

-Tom

-汤姆

回答by Christian Rau

This question is already a bit old, but for the sake of completeness, you might also take a look at this paperfor half-to-float and float-to-half conversion.

这个问题已经有点老了,但为了完整起见,您也可以看看this paperfor half-to-float和float-to-half conversion。

They use a branchless table-driven approach with relatively small look-up tables. It is completely IEEE-conformant and even beats Phernost's IEEE-conformant branchless conversion routines in performance (at least on my machine). But of course his code is much better suited to SSE and is not that prone to memory latency effects.

他们使用具有相对较小查找表的无分支表驱动方法。它完全符合 IEEE 标准,甚至在性能上击败了 Phernost 符合 IEEE 标准的无分支转换例程(至少在我的机器上)。但当然,他的代码更适合 SSE,并且不太容易受到内存延迟的影响。

回答by awdz9nld

This conversion for 16-to-32-bit floating point is quite fast for cases where you do not have to account for infinities or NaNs, and can accept denormals-as-zero (DAZ). I.e. it is suitable for performance-sensitive calculations, but you should beware of division by zero if you expect to encounter denormals.

对于不必考虑无穷大或 NaN 并且可以接受非正规数为零 (DAZ) 的情况,这种 16 位到 32 位浮点数的转换速度非常快。即它适用于对性能敏感的计算,但如果您希望遇到非正规数,则应注意被零除。

Note that this is most suitable for x86 or other platforms that have conditional moves or "set if" equivalents.

请注意,这最适合 x86 或其他具有条件移动或“set if”等价物的平台。

  1. Strip the sign bit off the input
  2. Align the most significant bit of the mantissa to the 22nd bit
  3. Adjust the exponent bias
  4. Set bits to all-zero if the input exponent is zero
  5. Re-insert sign bit
  1. 从输入中去除符号位
  2. 将尾数的最高位与第 22 位对齐
  3. 调整指数偏差
  4. 如果输入指数为零,则将位设置为全零
  5. 重新插入符号位

The reverse applies for single-to-half-precision, with some additions.

相反的情况适用于单精度到半精度,并添加一些内容。

void float32(float* __restrict out, const uint16_t in) {
    uint32_t t1;
    uint32_t t2;
    uint32_t t3;

    t1 = in & 0x7fff;                       // Non-sign bits
    t2 = in & 0x8000;                       // Sign bit
    t3 = in & 0x7c00;                       // Exponent

    t1 <<= 13;                              // Align mantissa on MSB
    t2 <<= 16;                              // Shift sign bit into position

    t1 += 0x38000000;                       // Adjust bias

    t1 = (t3 == 0 ? 0 : t1);                // Denormals-as-zero

    t1 |= t2;                               // Re-insert sign bit

    *((uint32_t*)out) = t1;
};

void float16(uint16_t* __restrict out, const float in) {
    uint32_t inu = *((uint32_t*)&in);
    uint32_t t1;
    uint32_t t2;
    uint32_t t3;

    t1 = inu & 0x7fffffff;                 // Non-sign bits
    t2 = inu & 0x80000000;                 // Sign bit
    t3 = inu & 0x7f800000;                 // Exponent

    t1 >>= 13;                             // Align mantissa on MSB
    t2 >>= 16;                             // Shift sign bit into position

    t1 -= 0x1c000;                         // Adjust bias

    t1 = (t3 > 0x38800000) ? 0 : t1;       // Flush-to-zero
    t1 = (t3 < 0x8e000000) ? 0x7bff : t1;  // Clamp-to-max
    t1 = (t3 == 0 ? 0 : t1);               // Denormals-as-zero

    t1 |= t2;                              // Re-insert sign bit

    *((uint16_t*)out) = t1;
};

Note that you can change the constant 0x7bffto 0x7c00for it to overflow to infinity.

请注意,您可以更改常量0x7bff0x7c00为它溢出到无穷远。

See GitHubfor source code.

请参阅GitHub以获取源代码。

回答by Ian Ollmann

Most of the approaches described in the other answers here either do not round correctly on conversion from float to half, throw away subnormals which is a problem since 2**-14 becomes your smallest non-zero number, or do unfortunate things with Inf / NaN. Inf is also a problem because the largest finite number in half is a bit less than 2^16. OpenEXR was unnecessarily slow and complicated, last I looked at it. A fast correct approach will use the FPU to do the conversion, either as a direct instruction, or using the FPU rounding hardware to make the right thing happen. Any half to float conversion should be no slower than a 2^16 element lookup table.

此处其他答案中描述的大多数方法要么在从浮点数转换为一半时没有正确舍入,丢弃次正规数这是一个问题,因为 2**-14 成为您最小的非零数,或者使用 Inf /南。Inf 也是一个问题,因为一半的最大有限数比 2^16 小一点。OpenEXR 是不必要的缓慢和复杂,最后我看着它。快速正确的方法将使用 FPU 进行转换,或者作为直接指令,或者使用 FPU 舍入硬件来使正确的事情发生。任何半浮点数转换都不应比 2^16 元素查找表慢。

The following are hard to beat:

以下是难以击败的:

On OS X / iOS, you can use vImageConvert_PlanarFtoPlanar16F and vImageConvert_Planar16FtoPlanarF. See Accelerate.framework.

在 OS X / iOS 上,您可以使用 vImageConvert_PlanarFtoPlanar16F 和 vImageConvert_Planar16FtoPlanarF。请参阅 Accelerate.framework。

Intel ivybridge added SSE instructions for this. See f16cintrin.h. Similar instructions were added to the ARM ISA for Neon. See vcvt_f32_f16 and vcvt_f16_f32 in arm_neon.h. On iOS you will need to use the arm64 or armv7s arch to get access to them.

Intel ivybridge 为此添加了 SSE 说明。参见 f16cintrin.h。Neon 的 ARM ISA 中添加了类似的指令。请参阅 arm_neon.h 中的 vcvt_f32_f16 和 vcvt_f16_f32。在 iOS 上,您需要使用 arm64 或 armv7s arch 来访问它们。

回答by Ond?ej ?ertík

This code converts a 32-bit floating point number to 16-bits and back.

此代码将 32 位浮点数转换为 16 位并返回。

#include <x86intrin.h>
#include <iostream>

int main()
{
    float f32;
    unsigned short f16;
    f32 = 3.14159265358979323846;
    f16 = _cvtss_sh(f32, 0);
    std::cout << f32 << std::endl;
    f32 = _cvtsh_ss(f16);
    std::cout << f32 << std::endl;
    return 0;
}

I tested with the Intel icpc 16.0.2:

我使用 Intel icpc 16.0.2 进行了测试:

$ icpc a.cpp

g++ 7.3.0:

g++ 7.3.0:

$ g++ -march=native a.cpp

and clang++ 6.0.0:

和 clang++ 6.0.0:

$ clang++ -march=native a.cpp

It prints:

它打印:

$ ./a.out
3.14159
3.14062

Documentation about these intrinsics is available at:

有关这些内在函数的文档可在以下位置获得:

https://software.intel.com/en-us/node/524287

https://software.intel.com/en-us/node/524287

https://clang.llvm.org/doxygen/f16cintrin_8h.html

https://clang.llvm.org/doxygen/f16cintrin_8h.html

回答by ErmIg

I have found an implementationof conversion from half-float to single-float format and back with using of AVX2. There are much more faster than software implementation of these algorithms. I hope it will be useful.

我发现了一个实现从半浮点单浮点格式和背部转换使用AVX2的。这些算法的实现速度比软件实现要快得多。我希望它会很有用。

32-bit float to 16-bit float conversion:

32 位浮点数到 16 位浮点数的转换:

#include <immintrin.h"

inline void Float32ToFloat16(const float * src, uint16_t * dst)
{
    _mm_storeu_si128((__m128i*)dst, _mm256_cvtps_ph(_mm256_loadu_ps(src), 0));
}

void Float32ToFloat16(const float * src, size_t size, uint16_t * dst)
{
    assert(size >= 8);

    size_t fullAlignedSize = size&~(32-1);
    size_t partialAlignedSize = size&~(8-1);

    size_t i = 0;
    for (; i < fullAlignedSize; i += 32)
    {
        Float32ToFloat16(src + i + 0, dst + i + 0);
        Float32ToFloat16(src + i + 8, dst + i + 8);
        Float32ToFloat16(src + i + 16, dst + i + 16);
        Float32ToFloat16(src + i + 24, dst + i + 24);
    }
    for (; i < partialAlignedSize; i += 8)
        Float32ToFloat16(src + i, dst + i);
    if(partialAlignedSize != size)
        Float32ToFloat16(src + size - 8, dst + size - 8);
}

16-bit float to 32-bit float conversion:

16 位浮点数到 32 位浮点数的转换:

#include <immintrin.h"

inline void Float16ToFloat32(const uint16_t * src, float * dst)
{
    _mm256_storeu_ps(dst, _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)src)));
}

void Float16ToFloat32(const uint16_t * src, size_t size, float * dst)
{
    assert(size >= 8);

    size_t fullAlignedSize = size&~(32-1);
    size_t partialAlignedSize = size&~(8-1);

    size_t i = 0;
    for (; i < fullAlignedSize; i += 32)
    {
        Float16ToFloat32<align>(src + i + 0, dst + i + 0);
        Float16ToFloat32<align>(src + i + 8, dst + i + 8);
        Float16ToFloat32<align>(src + i + 16, dst + i + 16);
        Float16ToFloat32<align>(src + i + 24, dst + i + 24);
    }
    for (; i < partialAlignedSize; i += 8)
        Float16ToFloat32<align>(src + i, dst + i);
    if (partialAlignedSize != size)
        Float16ToFloat32<false>(src + size - 8, dst + size - 8);
}