java Java中的半精度浮点数

Question

提问by finnw

Is there a Java library anywhere that can perform computations on IEEE 754 half-precisionnumbers or convert them to and from double-precision?

是否有任何地方可以对IEEE 754 半精度数执行计算或将它们转换为双精度数的 Java 库？

Either of these approaches would be suitable:

这些方法中的任何一种都适用：

Keep the numbers in half-precision format and compute using integer arithmetic & bit-twiddling (as MicroFloatdoes for single- and double-precision)
Perform all computations in single or double precision, converting to/from half precision for transmission (in which case what I need is well-tested conversion functions.)

将数字保持为半精度格式，并使用整数算法和位运算进行计算（就像MicroFloat对单精度和双精度所做的那样）
以单精度或双精度执行所有计算，转换为/从半精度进行传输（在这种情况下，我需要的是经过充分测试的转换函数。）

Edit: conversion needs to be 100% accurate - there arelots of NaNs, infinities and subnormals in the input files.

编辑：转换需要 100% 准确 -输入文件中有很多 NaN、无穷大和次正规。

Related question but for JavaScript: Decompressing Half Precision Floats in Javascript

相关问题，但对于 JavaScript：在 Javascript 中解压缩半精度浮点数

Answer 1

回答by x4u

You can Use Float.intBitsToFloat()and Float.floatToIntBits()to convert them to and from primitive float values. If you can live with truncated precision (as opposed to rounding) the conversion should be possible to implement with just a few bit shifts.

您可以使用Float.intBitsToFloat()和Float.floatToIntBits()将它们与原始浮点值相互转换。如果您可以接受截断的精度（而不是舍入），则只需进行一些位移就可以实现转换。

I have now put a little more effort into it and it turned out not quite as simple as I expected at the beginning. This version is now tested and verified in every aspect I could imagine and I'm very confident that it produces the exact results for all possible input values. It supports exact rounding and subnormal conversion in either direction.

我现在付出了更多的努力，结果并不像我一开始预期的那么简单。这个版本现在在我能想象的每个方面都经过了测试和验证，我非常有信心它为所有可能的输入值产生准确的结果。它支持任一方向的精确舍入和次正规转换。

// ignores the higher 16 bits
public static float toFloat( int hbits )
{
    int mant = hbits & 0x03ff;            // 10 bits mantissa
    int exp =  hbits & 0x7c00;            // 5 bits exponent
    if( exp == 0x7c00 )                   // NaN/Inf
        exp = 0x3fc00;                    // -> NaN/Inf
    else if( exp != 0 )                   // normalized value
    {
        exp += 0x1c000;                   // exp - 15 + 127
        if( mant == 0 && exp > 0x1c400 )  // smooth transition
            return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16
                                            | exp << 13 | 0x3ff );
    }
    else if( mant != 0 )                  // && exp==0 -> subnormal
    {
        exp = 0x1c400;                    // make it normal
        do {
            mant <<= 1;                   // mantissa * 2
            exp -= 0x400;                 // decrease exp by 1
        } while( ( mant & 0x400 ) == 0 ); // while not normal
        mant &= 0x3ff;                    // discard subnormal bit
    }                                     // else +/-0 -> +/-0
    return Float.intBitsToFloat(          // combine all parts
        ( hbits & 0x8000 ) << 16          // sign  << ( 31 - 15 )
        | ( exp | mant ) << 13 );         // value << ( 23 - 10 )
}

// returns all higher 16 bits as 0 for all results
public static int fromFloat( float fval )
{
    int fbits = Float.floatToIntBits( fval );
    int sign = fbits >>> 16 & 0x8000;          // sign only
    int val = ( fbits & 0x7fffffff ) + 0x1000; // rounded value

    if( val >= 0x47800000 )               // might be or become NaN/Inf
    {                                     // avoid Inf due to rounding
        if( ( fbits & 0x7fffffff ) >= 0x47800000 )
        {                                 // is or must become NaN/Inf
            if( val < 0x7f800000 )        // was value but too large
                return sign | 0x7c00;     // make it +/-Inf
            return sign | 0x7c00 |        // remains +/-Inf or NaN
                ( fbits & 0x007fffff ) >>> 13; // keep NaN (and Inf) bits
        }
        return sign | 0x7bff;             // unrounded not quite Inf
    }
    if( val >= 0x38800000 )               // remains normalized value
        return sign | val - 0x38000000 >>> 13; // exp - 127 + 15
    if( val < 0x33000000 )                // too small for subnormal
        return sign;                      // becomes +/-0
    val = ( fbits & 0x7fffffff ) >>> 23;  // tmp exp for subnormal calc
    return sign | ( ( fbits & 0x7fffff | 0x800000 ) // add subnormal bit
         + ( 0x800000 >>> val - 102 )     // round depending on cut off
      >>> 126 - val );   // div by 2^(1-(exp-127+15)) and >> 13 | exp=0
}

I implemented two small extensions compared to the bookbecause the general precision for 16 bit floats is rather low which could make the inherent anomalies of floating point formats visually perceivable compared to larger floating point types where they are usually not noticed due to the ample precision.

与本书相比，我实现了两个小的扩展，因为 16 位浮点数的一般精度相当低，与较大的浮点类型相比，浮点格式的固有异常可以在视觉上被感知，由于足够的精度，它们通常不会被注意到。

The first one are these two lines in the toFloat()function:

第一个是toFloat()函数中的这两行：

if( mant == 0 && exp > 0x1c400 )  // smooth transition
    return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16 | exp << 13 | 0x3ff );

Floating point numbers in the normal range of the type size adopt the exponent and thus the precision to the magnitude of the value. But this is not a smooth adoption, it happens in steps: switching to the next higher exponent results in half the precision. The precision now remains the same for all values of the mantissa until the next jump to the next higher exponent. The extension code above makes these transitions smoother by returning a value that is in the geographical center of the covered 32 bit float range for this particular half float value. Every normal half float value maps to exactly 8192 32 bit float values. The returned value is supposed to be exactly in the middle of these values. But at the transition of the half float exponent the lower 4096 values have twice the precision as the upper 4096 values and thus cover a number space that is only half as large as on the other side. All these 8192 32 bit float values map to the same half float value, so converting a half float to 32 bit and back results in the same half float value regardless of which of the 8192 intermediate32 bit values was chosen. The extension now results in something like a smoother half step by a factor of sqrt(2) at the transition as shown at the right picturebelow while the left pictureis supposed to visualize the sharp step by a factor of two without anti aliasing. You can safely remove these two lines from the code to get the standard behavior.

类型大小正常范围内的浮点数采用指数，从而精确到值的大小。但这并不是一个顺利的采用，它是逐步发生的：切换到下一个更高的指数会导致精度减半。现在，所有尾数值的精度保持不变，直到下一次跳转到下一个更高的指数为止。上面的扩展代码通过为这个特定的半浮点值返回一个位于覆盖的 32 位浮点范围的地理中心的值，使这些转换更加平滑。每个正常的半浮点值正好映射到 8192 个 32 位浮点值。返回值应该正好位于这些值的中间。但是在半浮点指数的过渡处，较低的 4096 值的精度是较高的 4096 值的两倍，因此覆盖的数字空间仅为另一侧的一半。所有这些 8192 32 位浮点值都映射到相同的半浮点值，因此将半浮点数转换为 32 位并返回相同的半浮点值，无论 8192选择了中间32 位值。扩展现在导致在过渡像更平滑的半一步SQRT（2）的一个因素，因为在正确的显示图象下面而左画面应该以可视化的尖锐步骤由两个因素不用抗混叠。您可以安全地从代码中删除这两行以获得标准行为。

covered number space on either side of the returned value:
       6.0E-8             #######                  ##########
       4.5E-8             |                       #
       3.0E-8     #########               ########

The second extension is in the fromFloat()function:

第二个扩展在fromFloat()函数中：

    {                                     // avoid Inf due to rounding
        if( ( fbits & 0x7fffffff ) >= 0x47800000 )
...
        return sign | 0x7bff;             // unrounded not quite Inf
    }

This extension slightly extends the number range of the half float format by saving some 32 bit values form getting promoted to Infinity. The affected values are those that would have been smaller than Infinity without rounding and would become Infinity only due to the rounding. You can safely remove the lines shown above if you don't want this extension.

这个扩展通过保存一些 32 位值来稍微扩展半浮点格式的数字范围，从而提升为 Infinity。受影响的值是那些在没有舍入的情况下会小于 Infinity 并且仅由于舍入而变成 Infinity 的值。如果您不想要此扩展名，可以安全地删除上面显示的行。

I tried to optimize the path for normal values in the fromFloat()function as much as possible which made it a bit less readable due to the use of precomputed and unshifted constants. I didn't put as much effort into 'toFloat()' since it would not exceed the performance of a lookup table anyway. So if speed really matters could use the toFloat()function only to fill a static lookup table with 0x10000 elements and than use this table for the actual conversion. This is about 3 times faster with a current x64 server VM and about 5 times faster with the x86 client VM.

我试图尽可能地优化fromFloat()函数中正常值的路径，由于使用了预先计算和未移位的常量，这使得它的可读性稍差。我没有在 'toFloat()' 上投入太多精力，因为它无论如何都不会超过查找表的性能。因此，如果速度真的很重要，可以使用该toFloat()函数仅使用 0x10000 元素填充静态查找表，而不是使用此表进行实际转换。使用当前的 x64 服务器虚拟机大约快 3 倍，使用 x86 客户端虚拟机大约快 5 倍。

I put the code hereby into public domain.

我特此将代码放入公共领域。

Answer 2

回答by buttonius

The code by x4u encodes the value 1 correctly as 0x3c00 (ref: https://en.wikipedia.org/wiki/Half-precision_floating-point_format). But the decoder with smoothness improvements decodes that into 1.000122. The wikipedia entry says that integer values 0..2048 can be represented exactly. Not nice...
Removing the "| 0x3ff"from the toFloat code ensures that toFloat(fromFloat(k)) == kfor integer k in the range -2048..2048, probably at the cost of a bit less smoothness.

x4u 的代码将值 1 正确编码为 0x3c00（参考：https://en.wikipedia.org/wiki/Half-precision_floating-point_format ）。但是具有平滑度改进的解码器将其解码为 1.000122。维基百科条目说整数值 0..2048 可以精确表示。不好...从 toFloat 代码中
删除"| 0x3ff"可确保toFloat(fromFloat(k)) == k对于 -2048..2048 范围内的整数 k，可能会降低平滑度。

Answer 3

回答by dgestep

I created a java class called, HalfPrecisionFloat, which uses x4u's solution. The class has convenience methods and error checking. It goes further and has methods for returning a Double and Float from the 2 byte half-precision value.

我创建了一个名为HalfPrecisionFloat的 java 类，它使用 x4u 的解决方案。该类具有方便的方法和错误检查。它更进一步，并具有从 2 字节半精度值返回 Double 和 Float 的方法。

Hopefully this will help someone.

希望这会帮助某人。

==>

import java.nio.ByteBuffer;

/**
 * Accepts various forms of a floating point half-precision (2 byte) number 
 * and contains methods to convert to a
 * full-precision floating point number Float and Double instance.
 * <p>
 * This implemention was inspired by x4u who is a user contributing 
 * to stackoverflow.com.
 * (https://stackoverflow.com/users/237321/x4u).
 *
 * @author dougestep
 */
public class HalfPrecisionFloat {
    private short halfPrecision;
    private Float fullPrecision;

    /**
     * Creates an instance of the class from the supplied the supplied 
     * byte array.  The byte array must be exactly two bytes in length.
     *
     * @param bytes the two-byte byte array.
     */
    public HalfPrecisionFloat(byte[] bytes) {
        if (bytes.length != 2) {
            throw new IllegalArgumentException("The supplied byte array " +
              "must be exactly two bytes in length");
        }

        final ByteBuffer buffer = ByteBuffer.wrap(bytes);
        this.halfPrecision = buffer.getShort();
    }

    /**
     * Creates an instance of this class from the supplied short number.
     *
     * @param number the number defined as a short.
     */
    public HalfPrecisionFloat(final short number) {
        this.halfPrecision = number;
        this.fullPrecision = toFullPrecision();
    }

    /**
     * Creates an instance of this class from the supplied 
     * full-precision floating point number.
     *
     * @param number the float number.
     */
    public HalfPrecisionFloat(final float number) {
        if (number > Short.MAX_VALUE) {
            throw new IllegalArgumentException("The supplied float is too "
              + "large for a two byte representation");
        }
        if (number < Short.MIN_VALUE) {
            throw new IllegalArgumentException("The supplied float is too "
              + "small for a two byte representation");
        }

        final int val = fromFullPrecision(number);
        this.halfPrecision = (short) val;
        this.fullPrecision = number;
    }

    /**
     * Returns the half-precision float as a number defined as a short.
     *
     * @return the short.
     */
    public short getHalfPrecisionAsShort() {
        return halfPrecision;
    }

    /**
     * Returns a full-precision floating pointing number from the 
     * half-precision value assigned on this instance.
     *
     * @return the full-precision floating pointing number.
     */
    public float getFullFloat() {
        if (fullPrecision == null) {
            fullPrecision = toFullPrecision();
        }
        return fullPrecision;
    }

    /**
     * Returns a full-precision double floating point number from the 
     * half-precision value assigned on this instance.
     *
     * @return the full-precision double floating pointing number.
     */
    public double getFullDouble() {
        return new Double(getFullFloat());
    }

    /**
     * Returns the full-precision float number from the half-precision 
     * value assigned on this instance.
     *
     * @return the full-precision floating pointing number.
     */
    private float toFullPrecision() {
        int mantisa = halfPrecision & 0x03ff;
        int exponent = halfPrecision & 0x7c00;

        if (exponent == 0x7c00) {
            exponent = 0x3fc00;
        } else if (exponent != 0) {
            exponent += 0x1c000;
            if (mantisa == 0 && exponent > 0x1c400) {
                return Float.intBitsToFloat(
                  (halfPrecision & 0x8000) << 16 | exponent << 13 | 0x3ff);
            }
        } else if (mantisa != 0) {
            exponent = 0x1c400;
            do {
                mantisa <<= 1;
                exponent -= 0x400;
            } while ((mantisa & 0x400) == 0);
            mantisa &= 0x3ff;
        }

        return Float.intBitsToFloat(
         (halfPrecision & 0x8000) << 16 | (exponent | mantisa) << 13);
    }

    /**
     * Returns the integer representation of the supplied 
     * full-precision floating pointing number.
     *
     * @param number the full-precision floating pointing number.
     * @return the integer representation.
     */
    private int fromFullPrecision(final float number) {
        int fbits = Float.floatToIntBits(number);
        int sign = fbits >>> 16 & 0x8000;

        int val = (fbits & 0x7fffffff) + 0x1000;

        if (val >= 0x47800000) {
            if ((fbits & 0x7fffffff) >= 0x47800000) {
                if (val < 0x7f800000) {
                    return sign | 0x7c00;
                }
                return sign | 0x7c00 | (fbits & 0x007fffff) >>> 13;
            }
            return sign | 0x7bff;
        }
        if (val >= 0x38800000) {
            return sign | val - 0x38000000 >>> 13;
        }
        if (val < 0x33000000) {
            return sign;
        }
        val = (fbits & 0x7fffffff) >>> 23;
        return sign | ((fbits & 0x7fffff | 0x800000) 
         + (0x800000 >>> val - 102) >>> 126 - val);
    }

And here's the unit tests

这是单元测试

import org.junit.Assert;
import org.junit.Test;

import java.nio.ByteBuffer;

public class TestHalfPrecision {

  private byte[] simulateBytes(final float fullPrecision) {
    HalfPrecisionFloat halfFloat = new HalfPrecisionFloat(fullPrecision);
    short halfShort = halfFloat.getHalfPrecisionAsShort();

    ByteBuffer buffer = ByteBuffer.allocate(2);
    buffer.putShort(halfShort);
    return buffer.array();
  }

  @Test
  public void testHalfPrecisionToFloatApproach() {
    final float startingValue = 1.2f;
    final float closestValue = 1.2001953f;
    final short shortRepresentation = (short) 15565;

    byte[] bytes = simulateBytes(startingValue);
    HalfPrecisionFloat halfFloat = new HalfPrecisionFloat(bytes);
    final float retFloat = halfFloat.getFullFloat();
    Assert.assertEquals(new Float(closestValue), new Float(retFloat));

    HalfPrecisionFloat otherWay = new HalfPrecisionFloat(retFloat);
    final short shrtValue = otherWay.getHalfPrecisionAsShort();
    Assert.assertEquals(new Short(shortRepresentation), new Short(shrtValue));

    HalfPrecisionFloat backAgain = new HalfPrecisionFloat(shrtValue);
    final float backFlt = backAgain.getFullFloat();
    Assert.assertEquals(new Float(closestValue), new Float(backFlt));

    HalfPrecisionFloat dbl = new HalfPrecisionFloat(startingValue);
    final double retDbl = dbl.getFullDouble();
    Assert.assertEquals(new Double(startingValue), new Double(retDbl));
  }

  @Test(expected = IllegalArgumentException.class)
  public void testInvalidByteArray() {
    ByteBuffer buffer = ByteBuffer.allocate(4);
    buffer.putFloat(Float.MAX_VALUE);
    byte[] bytes = buffer.array();

    new HalfPrecisionFloat(bytes);
  }

  @Test(expected = IllegalArgumentException.class)
  public void testInvalidMaxFloat() {
    new HalfPrecisionFloat(Float.MAX_VALUE);
  }

  @Test(expected = IllegalArgumentException.class)
  public void testInvalidMinFloat() {
    new HalfPrecisionFloat(-35000);
  }

  @Test
  public void testCreateWithShort() {
    HalfPrecisionFloat sut = new HalfPrecisionFloat(Short.MAX_VALUE);
    Assert.assertEquals(Short.MAX_VALUE, sut.getHalfPrecisionAsShort());
  }
}

Answer 4

回答by Martijn Courteaux

I was interested in small positive floating point numbers, so I built this variant with 12 bits mantissa, no sign bit, and 4 bits exponent, with bias 15, such that it can represent numbers between 0 and 1.00 (exclusive) quite okay. It has 2 bits of resolution in the mantissa extra, but the same exponent low.

我对小的正浮点数很感兴趣，所以我用12 位尾数、无符号位和 4 位指数构建了这个变体，偏差为 15，这样它就可以表示 0 到 1.00（不包括）之间的数字。它在尾数额外有 2 位分辨率，但相同的指数低。

public static float toFloat(int hbits) {
    int mant = hbits & 0x0fff;            // 12 bits mantissa
    int exp =  (hbits & 0xf000) >>> 12;   // 4 bits exponent
    if (exp == 0xf) {
        exp = 0xff;
    } else {
        if (exp != 0) { // normal value
            exp += 127 - 15;
        } else { // subnormal value
            if (mant != 0) { // not zero
                exp += 127 - 15;
                // make it noral
                exp++;
                do {
                    mant <<= 1;
                    exp--;
                } while ((mant & 0x1000) == 0);
                mant &= 0x0fff;
            }
        }
    }
    return Float.intBitsToFloat(exp << 23 | mant << 11);
}

public static int fromFloat(float fval) {
    int fbits = Float.floatToIntBits( fval );
    int val = ( fbits & 0x7fffffff ) + 0x400; // rounded value
    if( val < 0x32000000 )                // too small for subnormal or negative
        return 0;                         // becomes 0

    if( val >= 0x47800000 )               // might be or become NaN/Inf
    {                                     // avoid Inf due to rounding
        if( ( fbits & 0x7fffffff ) >= 0x47800000 )
        {                                 // is or must become NaN/Inf
            if( val < 0x7f800000 )        // was value but too large
                return 0xf000;            // make it +/-Inf
            return 0xf000 |               // remains +/-Inf or NaN
                ( fbits & 0x007fffff ) >>> 11; // keep NaN (and Inf) bits
        }
        return 0x7fff;                    // unrounded not quite Inf
    }
    if( val >= 0x38800000 )               // remains normalized value
        return val - 0x38000000 >>> 11;   // exp - 127 + 15

    val = ( fbits & 0x7fffffff ) >>> 23;  // tmp exp for subnormal calc
    return ( ( fbits & 0x7f_ffff | 0x80_0000 ) // add subnormal bit
            + ( 0x800000 >>> val - 100 )     // round depending on cut off
            >>> 124 - val );   // div by 2^(1-(exp-127+15)) and >> 11 | exp=0
}

Testing gives:

测试给出：

Smallest subnormal float      : 0.0000000149
Largest  subnormal float      : 0.0000610203
Smallest    normal float      : 0.0000610352
Smallest    normal float + ups: 0.0000610501
E=1, M=fff (max)              : 0.0001220554
Largest     normal float      : 0.0078115463

Normals:

法线：

0.9990000129  => 3f7fbe77 => eff8  => 0.9990234375  | error: 0.002%
0.8991000056  => 3f662b6b => ecc5  => 0.8990478516  | error: 0.006%
0.8091899753  => 3f4f2713 => e9e5  => 0.8092041016  | error: 0.002%
0.7282709479  => 3f3a6ff7 => e74e  => 0.7282714844  | error: 0.000%
0.6554438472  => 3f27cb2b => e4f9  => 0.6553955078  | error: 0.007%
0.5898994207  => 3f1703a6 => e2e0  => 0.5898437500  | error: 0.009%
0.5309094787  => 3f07e9af => e0fd  => 0.5308837891  | error: 0.005%
0.4778185189  => 3ef4a4a1 => de95  => 0.4778442383  | error: 0.005%
0.4300366640  => 3edc2dc4 => db86  => 0.4300537109  | error: 0.004%
0.3870329857  => 3ec62930 => d8c5  => 0.3870239258  | error: 0.002%
0.3483296633  => 3eb25844 => d64b  => 0.3483276367  | error: 0.001%
0.3134966791  => 3ea082a3 => d410  => 0.3134765625  | error: 0.006%
0.2821469903  => 3e907592 => d20f  => 0.2821655273  | error: 0.007%
0.2539322972  => 3e82036a => d040  => 0.2539062500  | error: 0.010%
0.2285390645  => 3e6a0625 => cd41  => 0.2285461426  | error: 0.003%
0.2056851536  => 3e529f21 => ca54  => 0.2056884766  | error: 0.002%
0.1851166338  => 3e3d8f37 => c7b2  => 0.1851196289  | error: 0.002%
0.1666049659  => 3e2a9a7e => c553  => 0.1665954590  | error: 0.006%
0.1499444693  => 3e198b0b => c331  => 0.1499328613  | error: 0.008%
0.1349500120  => 3e0a3056 => c146  => 0.1349487305  | error: 0.001%
0.1214550063  => 3df8bd67 => bf18  => 0.1214599609  | error: 0.004%
0.1093095019  => 3ddfdda9 => bbfc  => 0.1093139648  | error: 0.004%
0.0983785465  => 3dc97ab1 => b92f  => 0.0983734131  | error: 0.005%
0.0885406882  => 3db554d2 => b6ab  => 0.0885467529  | error: 0.007%
0.0796866193  => 3da332bd => b466  => 0.0796813965  | error: 0.007%
0.0717179552  => 3d92e0dd => b25c  => 0.0717163086  | error: 0.002%
0.0645461604  => 3d8430c7 => b086  => 0.0645446777  | error: 0.002%
0.0580915436  => 3d6df166 => adbe  => 0.0580902100  | error: 0.002%
0.0522823893  => 3d56260f => aac5  => 0.0522842407  | error: 0.004%
0.0470541492  => 3d40bbda => a817  => 0.0470504761  | error: 0.008%
0.0423487313  => 3d2d75dd => a5af  => 0.0423507690  | error: 0.005%
0.0381138586  => 3d1c1d47 => a384  => 0.0381164551  | error: 0.007%
0.0343024731  => 3d0c80c0 => a190  => 0.0343017578  | error: 0.002%
0.0308722258  => 3cfce7c0 => 9f9d  => 0.0308723450  | error: 0.000%
0.0277850032  => 3ce39d60 => 9c74  => 0.0277862549  | error: 0.005%
0.0250065029  => 3cccda70 => 999b  => 0.0250053406  | error: 0.005%
0.0225058515  => 3cb85e31 => 970c  => 0.0225067139  | error: 0.004%
0.0202552658  => 3ca5ee5f => 94be  => 0.0202560425  | error: 0.004%
0.0182297379  => 3c955688 => 92ab  => 0.0182304382  | error: 0.004%
0.0164067633  => 3c86677a => 90cd  => 0.0164070129  | error: 0.002%
0.0147660868  => 3c71ed75 => 8e3e  => 0.0147666931  | error: 0.004%
0.0132894777  => 3c59bc1c => 8b38  => 0.0132904053  | error: 0.007%
0.0119605297  => 3c43f619 => 887f  => 0.0119609833  | error: 0.004%
0.0107644768  => 3c305d7d => 860c  => 0.0107650757  | error: 0.006%
0.0096880291  => 3c1eba8a => 83d7  => 0.0096874237  | error: 0.006%
0.0087192263  => 3c0edb16 => 81db  => 0.0087184906  | error: 0.008%
0.0078473035  => 3c0091fa => 8012  => 0.0078468323  | error: 0.006%
0.0070625730  => 3be76d28 => 7cee  => 0.0070629120  | error: 0.005%
0.0063563157  => 3bd048a4 => 7a09  => 0.0063562393  | error: 0.001%
0.0057206838  => 3bbb7493 => 776f  => 0.0057210922  | error: 0.007%
0.0051486152  => 3ba8b5b7 => 7517  => 0.0051488876  | error: 0.005%
0.0046337536  => 3b97d6be => 72fb  => 0.0046339035  | error: 0.003%
0.0041703782  => 3b88a7ab => 7115  => 0.0041704178  | error: 0.001%
0.0037533403  => 3b75fa9a => 6ebf  => 0.0037531853  | error: 0.004%
0.0033780062  => 3b5d618a => 6bac  => 0.0033779144  | error: 0.003%
0.0030402055  => 3b473e2f => 68e8  => 0.0030403137  | error: 0.004%
0.0027361847  => 3b335190 => 666a  => 0.0027360916  | error: 0.003%
0.0024625661  => 3b216301 => 642c  => 0.0024623871  | error: 0.007%
0.0022163095  => 3b113f81 => 6228  => 0.0022163391  | error: 0.001%
0.0019946785  => 3b02b927 => 6057  => 0.0019946098  | error: 0.003%
0.0017952106  => 3aeb4d46 => 5d6a  => 0.0017952919  | error: 0.005%
0.0016156895  => 3ad3c58b => 5a79  => 0.0016157627  | error: 0.005%
0.0014541205  => 3abe9830 => 57d3  => 0.0014541149  | error: 0.000%
0.0013087085  => 3aab88f8 => 5571  => 0.0013086796  | error: 0.002%
0.0011778376  => 3a9a61ac => 534c  => 0.0011777878  | error: 0.004%
0.0010600538  => 3a8af181 => 515e  => 0.0010600090  | error: 0.004%
0.0009540484  => 3a7a191b => 4f43  => 0.0009540319  | error: 0.002%
0.0008586436  => 3a611698 => 4c23  => 0.0008586645  | error: 0.002%
0.0007727792  => 3a4a9455 => 4953  => 0.0007728338  | error: 0.007%
0.0006955012  => 3a36524c => 46ca  => 0.0006954670  | error: 0.005%
0.0006259511  => 3a2416de => 4483  => 0.0006259680  | error: 0.003%
0.0005633560  => 3a13ae2e => 4276  => 0.0005633831  | error: 0.005%
0.0005070204  => 3a04e990 => 409d  => 0.0005069971  | error: 0.005%
0.0004563183  => 39ef3e03 => 3de8  => 0.0004563332  | error: 0.003%
0.0004106865  => 39d75169 => 3aea  => 0.0004106760  | error: 0.003%
0.0003696179  => 39c1c945 => 3839  => 0.0003696084  | error: 0.003%
0.0003326561  => 39ae6857 => 35cd  => 0.0003326535  | error: 0.001%
0.0002993904  => 399cf781 => 339f  => 0.0002993941  | error: 0.001%
0.0002694514  => 398d4527 => 31a9  => 0.0002694726  | error: 0.008%
0.0002425062  => 397e4946 => 2fc9  => 0.0002425015  | error: 0.002%
0.0002182556  => 3964db8b => 2c9b  => 0.0002182424  | error: 0.006%
0.0001964300  => 394df8ca => 29bf  => 0.0001964271  | error: 0.001%
0.0001767870  => 39395fe9 => 272c  => 0.0001767874  | error: 0.000%
0.0001591083  => 3926d651 => 24db  => 0.0001591146  | error: 0.004%
0.0001431975  => 39162749 => 22c5  => 0.0001432002  | error: 0.002%
0.0001288777  => 3907235b => 20e4  => 0.0001288652  | error: 0.010%
0.0001159900  => 38f33fa3 => 1e68  => 0.0001159906  | error: 0.001%
0.0001043910  => 38daec79 => 1b5e  => 0.0001043975  | error: 0.006%
0.0000939519  => 38c50806 => 18a1  => 0.0000939518  | error: 0.000%
0.0000845567  => 38b15405 => 162b  => 0.0000845641  | error: 0.009%
0.0000761010  => 389f986b => 13f3  => 0.0000761002  | error: 0.001%
0.0000684909  => 388fa2c6 => 11f4  => 0.0000684857  | error: 0.008%
0.0000616418  => 388145b2 => 1029  => 0.0000616461  | error: 0.007%

And for the subnormal tests:

对于次正常测试：

0.0000554776  => 3868b0a6 => 0e8b  => 0.0000554770  | error: 0.001%
0.0000499299  => 38516bc8 => 0d17  => 0.0000499338  | error: 0.008%
0.0000449369  => 383c7a9a => 0bc8  => 0.0000449419  | error: 0.011%
0.0000404432  => 3829a18a => 0a9a  => 0.0000404418  | error: 0.004%
0.0000363989  => 3818aafc => 098b  => 0.0000364035  | error: 0.013%
0.0000327590  => 380966af => 0896  => 0.0000327528  | error: 0.019%
0.0000294831  => 37f7526e => 07bb  => 0.0000294894  | error: 0.021%
0.0000265348  => 37de96fc => 06f5  => 0.0000265390  | error: 0.016%
0.0000238813  => 37c854af => 0643  => 0.0000238866  | error: 0.022%
0.0000214932  => 37b44c37 => 05a2  => 0.0000214875  | error: 0.026%
0.0000193438  => 37a24498 => 0512  => 0.0000193417  | error: 0.011%
0.0000174095  => 37920a89 => 0490  => 0.0000174046  | error: 0.028%
0.0000156685  => 37836fe1 => 041b  => 0.0000156611  | error: 0.047%
0.0000141017  => 376c962e => 03b2  => 0.0000140965  | error: 0.037%
0.0000126915  => 3754ed8f => 0354  => 0.0000126958  | error: 0.034%
0.0000114223  => 373fa29a => 02ff  => 0.0000114292  | error: 0.060%
0.0000102801  => 372c78be => 02b2  => 0.0000102818  | error: 0.016%
0.0000092521  => 371b3978 => 026d  => 0.0000092536  | error: 0.016%
0.0000083269  => 370bb3b9 => 022f  => 0.0000083297  | error: 0.034%
0.0000074942  => 36fb76b3 => 01f7  => 0.0000074953  | error: 0.014%
0.0000067448  => 36e2513a => 01c5  => 0.0000067502  | error: 0.081%
0.0000060703  => 36cbaf81 => 0197  => 0.0000060648  | error: 0.091%
0.0000054633  => 36b75127 => 016f  => 0.0000054687  | error: 0.100%
0.0000049169  => 36a4fc3c => 014a  => 0.0000049174  | error: 0.009%
0.0000044253  => 36947c9c => 0129  => 0.0000044256  | error: 0.009%
0.0000039827  => 3685a359 => 010b  => 0.0000039786  | error: 0.103%
0.0000035845  => 36708c6d => 00f1  => 0.0000035912  | error: 0.188%
0.0000032260  => 36587e62 => 00d8  => 0.0000032187  | error: 0.228%
0.0000029034  => 3642d825 => 00c3  => 0.0000029057  | error: 0.080%
0.0000026131  => 362f5c21 => 00af  => 0.0000026077  | error: 0.205%
0.0000023518  => 361dd2ea => 009e  => 0.0000023544  | error: 0.112%
0.0000021166  => 360e0a9f => 008e  => 0.0000021160  | error: 0.029%
0.0000019049  => 35ffacb7 => 0080  => 0.0000019073  | error: 0.127%
0.0000017144  => 35e61b71 => 0073  => 0.0000017136  | error: 0.047%
0.0000015430  => 35cf18b2 => 0068  => 0.0000015497  | error: 0.436%
0.0000013887  => 35ba6306 => 005d  => 0.0000013858  | error: 0.208%
0.0000012498  => 35a7bf85 => 0054  => 0.0000012517  | error: 0.150%
0.0000011248  => 3596f92b => 004b  => 0.0000011176  | error: 0.645%
0.0000010124  => 3587e040 => 0044  => 0.0000010133  | error: 0.091%
0.0000009111  => 357493a6 => 003d  => 0.0000009090  | error: 0.236%
0.0000008200  => 355c1e7b => 0037  => 0.0000008196  | error: 0.054%
0.0000007380  => 35461b6e => 0032  => 0.0000007451  | error: 0.955%
0.0000006642  => 35324be3 => 002d  => 0.0000006706  | error: 0.955%
0.0000005978  => 3520777f => 0028  => 0.0000005960  | error: 0.291%
0.0000005380  => 35106b8c => 0024  => 0.0000005364  | error: 0.291%
0.0000004842  => 3501fa64 => 0020  => 0.0000004768  | error: 1.522%
0.0000004358  => 34e9f5e7 => 001d  => 0.0000004321  | error: 0.838%
0.0000003922  => 34d29083 => 001a  => 0.0000003874  | error: 1.218%
0.0000003530  => 34bd820f => 0018  => 0.0000003576  | error: 1.315%
0.0000003177  => 34aa8ea7 => 0015  => 0.0000003129  | error: 1.499%
0.0000002859  => 34998063 => 0013  => 0.0000002831  | error: 0.978%
0.0000002573  => 348a26bf => 0011  => 0.0000002533  | error: 1.557%
0.0000002316  => 3478ac24 => 0010  => 0.0000002384  | error: 2.947%
0.0000002084  => 345fce20 => 000e  => 0.0000002086  | error: 0.087%
0.0000001876  => 34496cb6 => 000d  => 0.0000001937  | error: 3.264%
0.0000001688  => 3435483d => 000b  => 0.0000001639  | error: 2.914%
0.0000001519  => 3423276a => 000a  => 0.0000001490  | error: 1.933%
0.0000001368  => 3412d6ac => 0009  => 0.0000001341  | error: 1.933%
0.0000001231  => 3404279b => 0008  => 0.0000001192  | error: 3.144%
0.0000001108  => 33ede0e3 => 0007  => 0.0000001043  | error: 5.834%
0.0000000997  => 33d61732 => 0007  => 0.0000001043  | error: 4.629%
0.0000000897  => 33c0ae79 => 0006  => 0.0000000894  | error: 0.354%
0.0000000808  => 33ad69d3 => 0005  => 0.0000000745  | error: 7.735%
0.0000000727  => 339c1271 => 0005  => 0.0000000745  | error: 2.517%
0.0000000654  => 338c76ff => 0004  => 0.0000000596  | error: 8.874%
0.0000000589  => 337cd631 => 0004  => 0.0000000596  | error: 1.251%
0.0000000530  => 33638d92 => 0004  => 0.0000000596  | error: 12.501%
0.0000000477  => 334ccc36 => 0003  => 0.0000000447  | error: 6.249%
0.0000000429  => 33385163 => 0003  => 0.0000000447  | error: 4.168%
0.0000000386  => 3325e2d9 => 0003  => 0.0000000447  | error: 15.742%
0.0000000348  => 33154c29 => 0002  => 0.0000000298  | error: 14.265%
0.0000000313  => 33065e25 => 0002  => 0.0000000298  | error: 4.739%
0.0000000282  => 32f1dca9 => 0002  => 0.0000000298  | error: 5.846%
0.0000000253  => 32d9acfe => 0002  => 0.0000000298  | error: 17.606%
0.0000000228  => 32c3e87e => 0002  => 0.0000000298  | error: 30.673%
0.0000000205  => 32b0513e => 0001  => 0.0000000149  | error: 27.404%
0.0000000185  => 329eaf84 => 0001  => 0.0000000149  | error: 19.337%
0.0000000166  => 328ed12a => 0001  => 0.0000000149  | error: 10.375%
0.0000000150  => 3280890c => 0001  => 0.0000000149  | error: 0.416%
0.0000000135  => 32675d15 => 0001  => 0.0000000149  | error: 10.648%
0.0000000121  => 32503a2c => 0001  => 0.0000000149  | error: 22.943%
0.0000000109  => 323b678e => 0001  => 0.0000000149  | error: 36.603%
0.0000000098  => 3228aa00 => 0001  => 0.0000000149  | error: 51.781%
0.0000000088  => 3217cc33 => 0001  => 0.0000000149  | error: 68.646%
0.0000000080  => 32089e2e => 0001  => 0.0000000149  | error: 87.384%
0.0000000072  => 31f5e986 => 0000  => 0.0000000000  | error: 100.000%

Answer 5

回答by cpurdy

Before I saw the solution posted here, I had whipped up something simple:

在我看到这里发布的解决方案之前，我已经做了一些简单的事情：

public static float toFloat(int nHalf)
    {
    int S = (nHalf >>> 15) & 0x1;                                                             
    int E = (nHalf >>> 10) & 0x1F;                                                            
    int T = (nHalf       ) & 0x3FF;                                                           

    E = E == 0x1F                                                                            
            ? 0xFF  // it's 2^w-1; it's all 1's, so keep it all 1's for the 32-bit float       
            : E - 15 + 127;     // adjust the exponent from the 16-bit bias to the 32-bit bias

    // sign S is now bit 31                                                                    
    // exp E is from bit 30 to bit 23                                                          
    // scale T by 13 binary digits (it grew from 10 to 23 bits)                                
    return Float.intBitsToFloat(S << 31 | E << 23 | T << 13);                               
    }

I do like the approach in the other posted solution, though. For reference:

不过，我确实喜欢另一个发布的解决方案中的方法。以供参考：

    // notes from the IEEE-754 specification:

    // left to right bits of a binary floating point number:
    // size        bit ids       name  description
    // ----------  ------------  ----  ---------------------------
    // 1 bit                       S   sign
    // w bits      E[0]..E[w-1]    E   biased exponent
    // t=p-1 bits  d[1]..d[p-1]    T   trailing significant field

    // The range of the encoding's biased exponent E shall include:
    // ― every integer between 1 and 2^w ? 2, inclusive, to encode normal numbers
    // ― the reserved value 0 to encode ±0 and subnormal numbers
    // ― the reserved value 2w ? 1 to encode +/-infinity and NaN

    // The representation r of the floating-point datum, and value v of the floating-point datum
    // represented, are inferred from the constituent fields as follows:
    // a) If E == 2^w?1 and T != 0, then r is qNaN or sNaN and v is NaN regardless of S
    // b) If E == 2^w?1 and T == 0, then r=v=(?1)^S * (+infinity)
    // c) If 1 <= E <= 2^w?2, then r is (S, (E?bias), (1 + 2^(1?p) * T))
    //    the value of the corresponding floating-point number is
    //        v = (?1)^S * 2^(E?bias) * (1 + 2^(1?p) * T)
    //    thus normal numbers have an implicit leading significand bit of 1
    // d) If E == 0 and T != 0, then r is (S, emin, (0 + 2^(1?p) * T))
    //    the value of the corresponding floating-point number is
    //        v = (?1)^S * 2^emin * (0 + 2^(1?p) * T)
    //    thus subnormal numbers have an implicit leading significand bit of 0
    // e) If E == 0 and T ==0, then r is (S, emin, 0) and v = (?1)^S * (+0)

    // parameter                                      bin16  bin32
    // --------------------------------------------   -----  -----
    // k, storage width in bits                         16     32
    // p, precision in bits                             11     24
    // emax, maxiumum exponent e                        15    127
    // bias, E-e                                        15    127
    // sign bit                                          1      1
    // w, exponent field width in bits                   5      8
    // t, trailing significant field width in bits      10     23

java Java中的半精度浮点数

提问by finnw

回答by x4u

回答by buttonius

回答by dgestep

回答by Martijn Courteaux

回答by cpurdy

相关推荐

最近更新

标签

java Java中的半精度浮点数

提问by finnw

回答by x4u

回答by buttonius

回答by dgestep

回答by Martijn Courteaux

回答by cpurdy

相关推荐

如何从 Java 应用程序将 JTable 数据发送到打印作业？

java 无法从 String[] 转换为 String

从 java 方法返回一个字节数组

java 我们可以将 SimpleDateFormat 对象声明为静态对象吗

相关推荐

最近更新

标签