Add PrecisionFromfloat32() and IsQuietNaN()

x448 · x448 · commit fc4a616a45e0 · 2020-01-05T11:13:05.000-06:00
PrecisionFromfloat32() can be inlined and runs < 0.5 ns/op. It indicates exact, inexact, underflow and overflow if the specified float32 is converted to float16 (IEEE 754 binary16). IsQuietNaN() indicates whether the specified NaN has nan-quiet-bit set. Closes: #3
diff --git a/README.md b/README.md
@@ -5,37 +5,38 @@
 [![Release](https://img.shields.io/github/release/cbor-go/float16.svg?style=flat-square)](https://github.com/cbor-go/float16/releases)
 [![License](http://img.shields.io/badge/license-mit-blue.svg?style=flat-square)](https://raw.githubusercontent.com/cbor-go/float16/master/LICENSE)
 
-`float16` package provides [IEEE 754 half-precision floating-point format](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) with IEEE 754 default rounding for conversions. IEEE 754-2008 refers to this 16-bit floating-point format as binary16.
+`float16` package provides [IEEE 754 half-precision floating-point format (binary16)](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) with IEEE 754 default rounding for conversions. IEEE 754-2008 refers to this 16-bit floating-point format as binary16.
 
 IEEE 754 default rounding ("Round-to-Nearest RoundTiesToEven") is considered the most accurate and statistically unbiased estimate of the true result.
 
 All possible 4+ billion floating-point conversions with this library are verified to be correct.
 
+This library uses the lowercase word "float16" to refer to IEEE 754 binary16. And uses capitalized "Float16" to export a Go data type representing float16.
+
 ## Features
 Current features include:
 
 * float16 to float32 conversions use lossless conversion.
 * float32 to float16 conversions use IEEE 754-2008 "Round-to-Nearest RoundTiesToEven".
 * conversions use __zero allocs__ and are about __2.65 ns/op__ (in pure Go) on a desktop amd64.
 * unit tests provide 100% code coverage and check all possible 4+ billion conversions.
-* other functions include: IsFinite(), IsInf(), IsNaN(), IsNormal(), Signbit(), and String().
+* other functions include: IsInf(), IsNaN(), IsNormal(), PrecisionFromfloat32(), String(), etc.
 * all functions in this library use zero allocs except String().
 
 ## Status
-This library is used by [fxamacker/cbor](https://github.com/fxamacker/cbor) and is ready for production use on supported platforms.
+This library is used by [fxamacker/cbor](https://github.com/fxamacker/cbor) and is ready for production use on supported platforms. The version number < 1.0 indicates more functions and options are planned but not yet published.
 
 Current status:
 
-* core API is done and breaking API changes are unlikely except Fromfloat32() to add options.
+* core API is done and breaking API changes are unlikely.
 * 100% of unit tests pass:
-  * short mode (`go test -short`) tests around 65763 conversions in 0.005s.  
-  * normal mode (`go test`) tests all possible 4+ billion conversions in about 45s.  
+  * short mode (`go test -short`) tests around 65765 conversions in 0.005s.  
+  * normal mode (`go test`) tests all possible 4+ billion conversions in about 75s.  
 * 100% code coverage with both short mode and normal mode.  
 * tested on amd64 but it should work on all little-endian platforms supported by Go.
  
 Roadmap:
 
-* add a function to both convert and report precision issues in one call.
 * add functions for fast batch conversions.
 * speed up unit test when verifying all possible 4+ billion conversions.
 * test on additional platforms.
@@ -48,9 +49,9 @@ Unit tests take a fraction of a second to check all 65536 expected values for fl
 ## Float32 to Float16 Conversion
 Conversions from float32 to float16 use IEEE 754 default rounding ("Round-to-Nearest RoundTiesToEven").  All 4294967296 possible float32 to float16 conversions (in pure Go) are confirmed to be correct.  
 
-Unit tests in normal mode take about 35-55 seconds to check all 4+ billion expected values for float32 to float16 conversions.  
+Unit tests in normal mode take about 60-90 seconds to check all 4+ billion expected values for float32 to float16 conversions as well as PrecisionFromfloat32() for each.
 
-Unit tests in short mode use a small subset (65763) of expected values and finish in under 1 second while still reaching 100% code coverage.
+Unit tests in short mode use a small subset (65765) of expected values and finish in under 0.01 second while still reaching 100% code coverage.
 
 ## Usage
 Install with `go get github.com/cbor-go/float16`.
@@ -61,6 +62,12 @@ pi16 := float16.Fromfloat32(pi)
 
 // Convert float16 to float32
 pi32 := pi16.Float32()
+
+// Only convert if there's no data loss (useful for CBOR encoders)
+// PrecisionFromfloat32() is faster than the overhead of calling a function
+if float16.Precision(pi) == float16.PrecisionExact {
+	pi16 := float16.Fromfloat32(pi)
+}
 ```
 
 ## Float16 Type and API
@@ -77,10 +84,13 @@ Frombits(b16 uint16) Float16        // Float16 number corresponding to b16 (IEEE
 NaN() Float16                       // Float16 of IEEE 754 binary16 not-a-number
 Inf(sign int) Float16               // Float16 of IEEE 754 binary16 infinity according to sign
 
+PrecisionFromfloat32(f32 float32) Precision  // quickly indicates exact, inexact, overflow, underflow
+                                             // (inline and < 1 ns/op)
 // Exported methods
 (f Float16) Float32() float32       // float32 number converted from f16 using lossless conversion
 (f Float16) Bits() uint16           // the IEEE 754 binary16 representation of f
 (f Float16) IsNaN() bool            // true if f is not-a-number (NaN)
+(f Float16) IsQuietNaN() bool       // true if f is a quiet not-a-number (NaN)
 (f Float16) IsInf(sign int) bool    // true if f is infinite based on sign (-1=NegInf, 0=any, 1=PosInf)
 (f Float16) IsFinite() bool         // true if f is not infinite or NaN
 (f Float16) IsNormal() bool         // true if f is not zero, infinite, subnormal, or NaN.
@@ -90,15 +100,17 @@ Inf(sign int) Float16               // Float16 of IEEE 754 binary16 infinity acc
 See [API](https://godoc.org/github.com/cbor-go/float16) at godoc.org for more info.
 
 ## Benchmarks
-Conversions (in pure Go) are around 2.65 ns/op for float16 to Float32 as well as Float32 to float16 on amd64.
+Conversions (in pure Go) are around 2.65 ns/op for float16 to Float32 as well as Float32 to float16 on amd64. And speeds can vary depending on input value.
 
-Frombits is included as a canary to catch overoptimized benchmarks. Frombits should be faster than all other functions.
+Frombits is included as a canary to catch overoptimized benchmarks. It should be faster than all other functions except PrecisionFromfloat32.
 ```
 All functions have zero allocations except float16.String().
 
 FromFloat32pi-2  2.59ns ± 0%    // speed using Fromfloat32() to convert a float32 of math.Pi to Float16
 ToFloat32pi-2    2.69ns ± 0%    // speed using Float32() to convert a float16 of math.Pi to float32
-Frombits-2       0.36ns ± 8%    // speed using Frombits() to cast a uint16 to Float16
+Frombits-2       0.29ns ± 5%    // speed using Frombits() to cast a uint16 to Float16
+
+PrecisionFromFloat32-2  0.29ns ± 1%  // speed using PrecisionFromfloat32() to check for overflows, etc.
 ```
 
 ## System Requirements
diff --git a/float16.go b/float16.go
@@ -13,6 +13,60 @@ import (
 // Float16 represents IEEE 754 half-precision floating-point numbers (binary16).
 type Float16 uint16
 
+// Precision indicates whether the conversion to Float16 is
+// exact, inexact, underflow, or overflow.
+type Precision int
+
+const (
+	PrecisionExact Precision = iota
+	PrecisionInexact
+	PrecisionUnderflow
+	PrecisionOverflow
+)
+
+// PrecisionFromFloat32 returns Precision without performing
+// the conversion.  Conversions from both Infinity and NaN
+// values will always report PrecisionExact even if NaN payload
+// or NaN-Quiet-Bit is lost. This function is kept simple to
+// allow inlining and run < 0.5 ns/op.
+func PrecisionFromfloat32(f32 float32) Precision {
+	u32 := math.Float32bits(f32)
+
+	if u32 == 0 || u32 == 0x80000000 {
+		// +- zero will always be exact conversion
+		return PrecisionExact
+	}
+
+	const COEFMASK uint32 = 0x7fffff // 23 least significant bits
+	const EXPSHIFT uint32 = 23
+	const EXPBIAS uint32 = 127
+	const EXPMASK uint32 = uint32(0xff) << EXPSHIFT
+	const DROPMASK uint32 = COEFMASK >> 10
+
+	exp := int32(((u32 & EXPMASK) >> EXPSHIFT) - EXPBIAS)
+	coef := u32 & COEFMASK
+
+	if exp == 128 {
+		// +- infinity or NaN
+		// apps may want to do extra checks for NaN separately
+		return PrecisionExact
+	}
+	if exp < -14 {
+		// There are 2046 values out of 4+ billion that can round-trip back
+		// to original value with IEEE default rounding despite this underflow.
+		return PrecisionUnderflow
+	}
+	if exp > 15 {
+		return PrecisionOverflow
+	}
+	if (coef & DROPMASK) == uint32(0) {
+		// floats within half-precision exponent range won't drop bits
+		return PrecisionExact
+	}
+
+	return PrecisionInexact
+}
+
 // Frombits returns the float16 number corresponding to the IEEE 754 binary16
 // representation u16, with the sign bit of u16 and the result in the same bit
 // position. Frombits(Bits(x)) == x.
@@ -26,9 +80,12 @@ func Fromfloat32(f32 float32) Float16 {
 	return Float16(f32bitsToF16bits(math.Float32bits(f32)))
 }
 
-// NaN returns a Float16 with Not-a-Number (NaN) value.
+// NaN returns a Float16 of IEEE 754 binary16 not-a-number (NaN).
+// Returned NaN value 0x7e01 has all exponent bits = 1 with the
+// first and last bits = 1 in the significand. This is consistent
+// with Go's 64-bit math.NaN(). Canonical CBOR in RFC 7049 uses 0x7e00.
 func NaN() Float16 {
-	return Float16(0x7c00 | 0x03ff)
+	return Float16(0x7e01)
 }
 
 // Inf returns a Float16 with an infinity value with the specified sign.
@@ -54,11 +111,17 @@ func (f Float16) Bits() uint16 {
 	return uint16(f)
 }
 
-// IsNaN reports whether f is an IEEE 754 “not-a-number” value.
+// IsNaN reports whether f is an IEEE 754 binary16 “not-a-number” value.
 func (f Float16) IsNaN() bool {
 	return (f&0x7c00 == 0x7c00) && (f&0x03ff != 0)
 }
 
+// IsQuietNaN reports whether f is a quiet (non-signaling) IEEE 754 binary16
+// “not-a-number” value.
+func (f Float16) IsQuietNaN() bool {
+	return (f&0x7c00 == 0x7c00) && (f&0x03ff != 0) && (f&0x0200 != 0)
+}
+
 // IsInf reports whether f is an infinity (inf).
 // A sign > 0 reports whether f is positive inf.
 // A sign < 0 reports whether f is negative inf.
diff --git a/float16_bench_test.go b/float16_bench_test.go
@@ -13,6 +13,7 @@ import (
 var resultF16 float16.Float16
 var resultF32 float32
 var resultStr string
+var pcn float16.Precision
 
 func BenchmarkFloat32pi(b *testing.B) {
 	result := float32(0)
@@ -65,6 +66,17 @@ func BenchmarkFromFloat32subnorm(b *testing.B) {
 	resultF16 = result
 }
 
+func BenchmarkPrecisionFromFloat32(b *testing.B) {
+	var result float16.Precision
+
+	//pi := float32(math.Pi)
+	for i := 0; i < b.N; i++ {
+		f32 := float32(0.00001) + float32(0.00001)
+		result = float16.PrecisionFromfloat32(f32)
+	}
+	pcn = result
+}
+
 func BenchmarkString(b *testing.B) {
 	result := "1.5"
 
diff --git a/float16_test.go b/float16_test.go