Skip to content

Commit fc4a616

Browse files
committed
Add PrecisionFromfloat32() and IsQuietNaN()
PrecisionFromfloat32() can be inlined and runs < 0.5 ns/op. It indicates exact, inexact, underflow and overflow if the specified float32 is converted to float16 (IEEE 754 binary16). IsQuietNaN() indicates whether the specified NaN has nan-quiet-bit set. Closes: #3
1 parent ec86454 commit fc4a616

File tree

4 files changed

+216
-22
lines changed

4 files changed

+216
-22
lines changed

README.md

Lines changed: 24 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,37 +5,38 @@
55
[![Release](https://img.shields.io/github/release/cbor-go/float16.svg?style=flat-square)](https://github.com/cbor-go/float16/releases)
66
[![License](http://img.shields.io/badge/license-mit-blue.svg?style=flat-square)](https://raw.githubusercontent.com/cbor-go/float16/master/LICENSE)
77

8-
`float16` package provides [IEEE 754 half-precision floating-point format](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) with IEEE 754 default rounding for conversions. IEEE 754-2008 refers to this 16-bit floating-point format as binary16.
8+
`float16` package provides [IEEE 754 half-precision floating-point format (binary16)](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) with IEEE 754 default rounding for conversions. IEEE 754-2008 refers to this 16-bit floating-point format as binary16.
99

1010
IEEE 754 default rounding ("Round-to-Nearest RoundTiesToEven") is considered the most accurate and statistically unbiased estimate of the true result.
1111

1212
All possible 4+ billion floating-point conversions with this library are verified to be correct.
1313

14+
This library uses the lowercase word "float16" to refer to IEEE 754 binary16. And uses capitalized "Float16" to export a Go data type representing float16.
15+
1416
## Features
1517
Current features include:
1618

1719
* float16 to float32 conversions use lossless conversion.
1820
* float32 to float16 conversions use IEEE 754-2008 "Round-to-Nearest RoundTiesToEven".
1921
* conversions use __zero allocs__ and are about __2.65 ns/op__ (in pure Go) on a desktop amd64.
2022
* unit tests provide 100% code coverage and check all possible 4+ billion conversions.
21-
* other functions include: IsFinite(), IsInf(), IsNaN(), IsNormal(), Signbit(), and String().
23+
* other functions include: IsInf(), IsNaN(), IsNormal(), PrecisionFromfloat32(), String(), etc.
2224
* all functions in this library use zero allocs except String().
2325

2426
## Status
25-
This library is used by [fxamacker/cbor](https://github.com/fxamacker/cbor) and is ready for production use on supported platforms.
27+
This library is used by [fxamacker/cbor](https://github.com/fxamacker/cbor) and is ready for production use on supported platforms. The version number < 1.0 indicates more functions and options are planned but not yet published.
2628

2729
Current status:
2830

29-
* core API is done and breaking API changes are unlikely except Fromfloat32() to add options.
31+
* core API is done and breaking API changes are unlikely.
3032
* 100% of unit tests pass:
31-
* short mode (`go test -short`) tests around 65763 conversions in 0.005s.
32-
* normal mode (`go test`) tests all possible 4+ billion conversions in about 45s.
33+
* short mode (`go test -short`) tests around 65765 conversions in 0.005s.
34+
* normal mode (`go test`) tests all possible 4+ billion conversions in about 75s.
3335
* 100% code coverage with both short mode and normal mode.
3436
* tested on amd64 but it should work on all little-endian platforms supported by Go.
3537

3638
Roadmap:
3739

38-
* add a function to both convert and report precision issues in one call.
3940
* add functions for fast batch conversions.
4041
* speed up unit test when verifying all possible 4+ billion conversions.
4142
* test on additional platforms.
@@ -48,9 +49,9 @@ Unit tests take a fraction of a second to check all 65536 expected values for fl
4849
## Float32 to Float16 Conversion
4950
Conversions from float32 to float16 use IEEE 754 default rounding ("Round-to-Nearest RoundTiesToEven"). All 4294967296 possible float32 to float16 conversions (in pure Go) are confirmed to be correct.
5051

51-
Unit tests in normal mode take about 35-55 seconds to check all 4+ billion expected values for float32 to float16 conversions.
52+
Unit tests in normal mode take about 60-90 seconds to check all 4+ billion expected values for float32 to float16 conversions as well as PrecisionFromfloat32() for each.
5253

53-
Unit tests in short mode use a small subset (65763) of expected values and finish in under 1 second while still reaching 100% code coverage.
54+
Unit tests in short mode use a small subset (65765) of expected values and finish in under 0.01 second while still reaching 100% code coverage.
5455

5556
## Usage
5657
Install with `go get github.com/cbor-go/float16`.
@@ -61,6 +62,12 @@ pi16 := float16.Fromfloat32(pi)
6162
6263
// Convert float16 to float32
6364
pi32 := pi16.Float32()
65+
66+
// Only convert if there's no data loss (useful for CBOR encoders)
67+
// PrecisionFromfloat32() is faster than the overhead of calling a function
68+
if float16.Precision(pi) == float16.PrecisionExact {
69+
pi16 := float16.Fromfloat32(pi)
70+
}
6471
```
6572

6673
## Float16 Type and API
@@ -77,10 +84,13 @@ Frombits(b16 uint16) Float16 // Float16 number corresponding to b16 (IEEE
7784
NaN() Float16 // Float16 of IEEE 754 binary16 not-a-number
7885
Inf(sign int) Float16 // Float16 of IEEE 754 binary16 infinity according to sign
7986
87+
PrecisionFromfloat32(f32 float32) Precision // quickly indicates exact, inexact, overflow, underflow
88+
// (inline and < 1 ns/op)
8089
// Exported methods
8190
(f Float16) Float32() float32 // float32 number converted from f16 using lossless conversion
8291
(f Float16) Bits() uint16 // the IEEE 754 binary16 representation of f
8392
(f Float16) IsNaN() bool // true if f is not-a-number (NaN)
93+
(f Float16) IsQuietNaN() bool // true if f is a quiet not-a-number (NaN)
8494
(f Float16) IsInf(sign int) bool // true if f is infinite based on sign (-1=NegInf, 0=any, 1=PosInf)
8595
(f Float16) IsFinite() bool // true if f is not infinite or NaN
8696
(f Float16) IsNormal() bool // true if f is not zero, infinite, subnormal, or NaN.
@@ -90,15 +100,17 @@ Inf(sign int) Float16 // Float16 of IEEE 754 binary16 infinity acc
90100
See [API](https://godoc.org/github.com/cbor-go/float16) at godoc.org for more info.
91101

92102
## Benchmarks
93-
Conversions (in pure Go) are around 2.65 ns/op for float16 to Float32 as well as Float32 to float16 on amd64.
103+
Conversions (in pure Go) are around 2.65 ns/op for float16 to Float32 as well as Float32 to float16 on amd64. And speeds can vary depending on input value.
94104

95-
Frombits is included as a canary to catch overoptimized benchmarks. Frombits should be faster than all other functions.
105+
Frombits is included as a canary to catch overoptimized benchmarks. It should be faster than all other functions except PrecisionFromfloat32.
96106
```
97107
All functions have zero allocations except float16.String().
98108
99109
FromFloat32pi-2 2.59ns ± 0% // speed using Fromfloat32() to convert a float32 of math.Pi to Float16
100110
ToFloat32pi-2 2.69ns ± 0% // speed using Float32() to convert a float16 of math.Pi to float32
101-
Frombits-2 0.36ns ± 8% // speed using Frombits() to cast a uint16 to Float16
111+
Frombits-2 0.29ns ± 5% // speed using Frombits() to cast a uint16 to Float16
112+
113+
PrecisionFromFloat32-2 0.29ns ± 1% // speed using PrecisionFromfloat32() to check for overflows, etc.
102114
```
103115

104116
## System Requirements

float16.go

Lines changed: 66 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,60 @@ import (
1313
// Float16 represents IEEE 754 half-precision floating-point numbers (binary16).
1414
type Float16 uint16
1515

16+
// Precision indicates whether the conversion to Float16 is
17+
// exact, inexact, underflow, or overflow.
18+
type Precision int
19+
20+
const (
21+
PrecisionExact Precision = iota
22+
PrecisionInexact
23+
PrecisionUnderflow
24+
PrecisionOverflow
25+
)
26+
27+
// PrecisionFromFloat32 returns Precision without performing
28+
// the conversion. Conversions from both Infinity and NaN
29+
// values will always report PrecisionExact even if NaN payload
30+
// or NaN-Quiet-Bit is lost. This function is kept simple to
31+
// allow inlining and run < 0.5 ns/op.
32+
func PrecisionFromfloat32(f32 float32) Precision {
33+
u32 := math.Float32bits(f32)
34+
35+
if u32 == 0 || u32 == 0x80000000 {
36+
// +- zero will always be exact conversion
37+
return PrecisionExact
38+
}
39+
40+
const COEFMASK uint32 = 0x7fffff // 23 least significant bits
41+
const EXPSHIFT uint32 = 23
42+
const EXPBIAS uint32 = 127
43+
const EXPMASK uint32 = uint32(0xff) << EXPSHIFT
44+
const DROPMASK uint32 = COEFMASK >> 10
45+
46+
exp := int32(((u32 & EXPMASK) >> EXPSHIFT) - EXPBIAS)
47+
coef := u32 & COEFMASK
48+
49+
if exp == 128 {
50+
// +- infinity or NaN
51+
// apps may want to do extra checks for NaN separately
52+
return PrecisionExact
53+
}
54+
if exp < -14 {
55+
// There are 2046 values out of 4+ billion that can round-trip back
56+
// to original value with IEEE default rounding despite this underflow.
57+
return PrecisionUnderflow
58+
}
59+
if exp > 15 {
60+
return PrecisionOverflow
61+
}
62+
if (coef & DROPMASK) == uint32(0) {
63+
// floats within half-precision exponent range won't drop bits
64+
return PrecisionExact
65+
}
66+
67+
return PrecisionInexact
68+
}
69+
1670
// Frombits returns the float16 number corresponding to the IEEE 754 binary16
1771
// representation u16, with the sign bit of u16 and the result in the same bit
1872
// position. Frombits(Bits(x)) == x.
@@ -26,9 +80,12 @@ func Fromfloat32(f32 float32) Float16 {
2680
return Float16(f32bitsToF16bits(math.Float32bits(f32)))
2781
}
2882

29-
// NaN returns a Float16 with Not-a-Number (NaN) value.
83+
// NaN returns a Float16 of IEEE 754 binary16 not-a-number (NaN).
84+
// Returned NaN value 0x7e01 has all exponent bits = 1 with the
85+
// first and last bits = 1 in the significand. This is consistent
86+
// with Go's 64-bit math.NaN(). Canonical CBOR in RFC 7049 uses 0x7e00.
3087
func NaN() Float16 {
31-
return Float16(0x7c00 | 0x03ff)
88+
return Float16(0x7e01)
3289
}
3390

3491
// Inf returns a Float16 with an infinity value with the specified sign.
@@ -54,11 +111,17 @@ func (f Float16) Bits() uint16 {
54111
return uint16(f)
55112
}
56113

57-
// IsNaN reports whether f is an IEEE 754 “not-a-number” value.
114+
// IsNaN reports whether f is an IEEE 754 binary16 “not-a-number” value.
58115
func (f Float16) IsNaN() bool {
59116
return (f&0x7c00 == 0x7c00) && (f&0x03ff != 0)
60117
}
61118

119+
// IsQuietNaN reports whether f is a quiet (non-signaling) IEEE 754 binary16
120+
// “not-a-number” value.
121+
func (f Float16) IsQuietNaN() bool {
122+
return (f&0x7c00 == 0x7c00) && (f&0x03ff != 0) && (f&0x0200 != 0)
123+
}
124+
62125
// IsInf reports whether f is an infinity (inf).
63126
// A sign > 0 reports whether f is positive inf.
64127
// A sign < 0 reports whether f is negative inf.

float16_bench_test.go

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ import (
1313
var resultF16 float16.Float16
1414
var resultF32 float32
1515
var resultStr string
16+
var pcn float16.Precision
1617

1718
func BenchmarkFloat32pi(b *testing.B) {
1819
result := float32(0)
@@ -65,6 +66,17 @@ func BenchmarkFromFloat32subnorm(b *testing.B) {
6566
resultF16 = result
6667
}
6768

69+
func BenchmarkPrecisionFromFloat32(b *testing.B) {
70+
var result float16.Precision
71+
72+
//pi := float32(math.Pi)
73+
for i := 0; i < b.N; i++ {
74+
f32 := float32(0.00001) + float32(0.00001)
75+
result = float16.PrecisionFromfloat32(f32)
76+
}
77+
pcn = result
78+
}
79+
6880
func BenchmarkString(b *testing.B) {
6981
result := "1.5"
7082

0 commit comments

Comments
 (0)