33[ ![ Go Reference] ( https://pkg.go.dev/badge/github.com/zerfoo/float16.svg )] ( https://pkg.go.dev/github.com/zerfoo/float16 )
44[ ![ License] ( https://img.shields.io/badge/License-Apache%202.0-blue.svg )] ( https://opensource.org/licenses/Apache-2.0 )
55
6- A comprehensive Go implementation of IEEE 754-2008 16-bit floating-point (half-precision) arithmetic with full support for special values, multiple rounding modes, and high-performance operations.
6+ IEEE 754-2008 half-precision (Float16) and BFloat16 arithmetic library for Go.
7+
8+ Part of the [ Zerfoo] ( https://github.com/zerfoo ) ML ecosystem.
79
810## Features
911
1012- ** Full IEEE 754-2008 compliance** for 16-bit floating-point arithmetic
11- - ** Complete special value support** : ±0, ±∞, NaN (with payload), normalized and subnormal numbers
12- - ** Multiple rounding modes ** : nearest-even, toward zero, toward ±∞, nearest-away
13- - ** Flexible conversion modes** : IEEE standard, strict error handling, fast approximations
14- - ** High-performance operations** with optional fast math optimizations
15- - ** Comprehensive test suite ** with extensive edge case coverage
16- - ** Zero dependencies** - pure Go implementation
13+ - ** BFloat16 support** — Google Brain format for ML training and inference
14+ - ** Special value handling ** — ±0, ±Inf, NaN (with payload), normalized and subnormal numbers
15+ - ** Multiple rounding modes** — nearest-even, toward zero, toward ±Inf, nearest-away
16+ - ** Vectorized operations** — batch add, multiply, and dot product
17+ - ** Fast math mode ** — optional lookup-table acceleration for performance-critical paths
18+ - ** Zero dependencies** — pure Go, no CGo
1719
1820## Installation
1921
2022``` bash
2123go get github.com/zerfoo/float16
2224```
2325
26+ Requires Go 1.26+.
27+
2428## Quick Start
2529
2630``` go
@@ -32,285 +36,61 @@ import (
3236)
3337
3438func main () {
35- // Create float16 values
3639 a := float16.FromFloat32 (3.14159 )
37- b := float16.FromFloat64 (2.71828 )
38-
39- // Basic arithmetic
40+ b := float16.FromFloat32 (2.71828 )
41+
4042 sum := a.Add (b)
4143 product := a.Mul (b)
42-
43- // Convert back to other types
44- fmt.Printf (" Sum: %v (float32: %f )\n " , sum, sum.ToFloat32 ())
45- fmt.Printf (" Product: %v (float64: %f )\n " , product, product.ToFloat64 ())
46-
47- // Work with special values
48- inf := float16.Inf (1 ) // positive infinity
49- nan := float16.NaN () // quiet NaN
50- zero := float16.Zero () // positive zero
51-
52- fmt.Printf (" Infinity: %v \n " , inf)
53- fmt.Printf (" NaN: %v \n " , nan)
54- fmt.Printf (" Zero: %v \n " , zero)
55- }
56- ```
57-
58- ## Core Types and Constants
5944
60- ### Float16 Type
45+ fmt.Printf (" Sum: %f \n " , sum.ToFloat32 ())
46+ fmt.Printf (" Product: %f \n " , product.ToFloat32 ())
6147
62- The ` Float16 ` type represents a 16-bit IEEE 754 half-precision floating-point value:
63-
64- ``` go
65- type Float16 uint16
66- ```
67-
68- ### Special Values
69-
70- ``` go
71- const (
72- PositiveZero Float16 = 0x0000 // +0.0
73- NegativeZero Float16 = 0x8000 // -0.0
74- PositiveInfinity Float16 = 0x7C00 // +∞
75- NegativeInfinity Float16 = 0xFC00 // -∞
76- MaxValue Float16 = 0x7BFF // ~65504
77- MinValue Float16 = 0xFBFF // ~-65504
78- )
48+ // Special values
49+ inf := float16.Inf (1 )
50+ fmt.Printf (" Inf: %v , IsInf: %v \n " , inf, inf.IsInf (0 ))
51+ }
7952```
8053
81- ## Conversion Functions
82-
83- ### From Other Types
54+ ## Conversion
8455
8556``` go
8657// From float32/float64
87- f16 := float16.FromFloat32 (3.14159 )
88- f16 := float16.FromFloat64 (2.71828 )
58+ f16 := float16.FromFloat32 (3.14 )
59+ f16 := float16.FromFloat64 (2.718 )
8960
9061// From bit representation
9162f16 := float16.FromBits (0x4200 ) // 3.0
9263
93- // From string
94- f16 , err := float16.ParseFloat (" 3.14159" , 32 )
95- ```
96-
97- ### To Other Types
98-
99- ``` go
64+ // Back to native types
10065f32 := f16.ToFloat32 ()
10166f64 := f16.ToFloat64 ()
102- bits := f16.Bits ()
103- str := f16.String ()
104- ```
105-
106- ## Arithmetic Operations
107-
108- ``` go
109- a := float16.FromFloat32 (5.0 )
110- b := float16.FromFloat32 (3.0 )
111-
112- // Basic arithmetic
113- sum := a.Add (b) // 8.0
114- diff := a.Sub (b) // 2.0
115- product := a.Mul (b) // 15.0
116- quotient := a.Div (b) // 1.666...
117-
118- // Mathematical functions
119- sqrt := a.Sqrt () // √5
120- abs := a.Abs () // |a|
121- neg := a.Neg () // -a
12267```
12368
12469## Rounding Modes
12570
126- Configure rounding behavior for conversions:
127-
12871``` go
129- import " github.com/zerfoo/float16"
130-
131- // Set global rounding mode
13272config := float16.GetConfig ()
13373config.DefaultRoundingMode = float16.RoundTowardZero
13474float16.Configure (config)
13575
136- // Available rounding modes:
137- // - RoundNearestEven (default)
138- // - RoundTowardZero
139- // - RoundTowardPositive
140- // - RoundTowardNegative
141- // - RoundNearestAway
142- ```
143-
144- ## Conversion Modes
145-
146- Control conversion behavior and error handling:
147-
148- ``` go
149- config := float16.GetConfig ()
150- config.DefaultConversionMode = float16.ModeStrict
151- float16.Configure (config)
152-
153- // Available modes:
154- // - ModeIEEE: Standard IEEE 754 behavior
155- // - ModeStrict: Returns errors for overflow/underflow
156- // - ModeFast: Optimized for performance
157- ```
158-
159- ## Special Value Handling
160-
161- ``` go
162- f := float16.FromFloat32 (math.Inf (1 ))
163-
164- // Check value types
165- if f.IsInf (0 ) {
166- fmt.Println (" Value is infinity" )
167- }
168- if f.IsNaN () {
169- fmt.Println (" Value is NaN" )
170- }
171- if f.IsFinite () {
172- fmt.Println (" Value is finite" )
173- }
174- if f.IsNormal () {
175- fmt.Println (" Value is normalized" )
176- }
177- if f.IsSubnormal () {
178- fmt.Println (" Value is subnormal" )
179- }
180-
181- // IEEE 754 classification
182- class := f.Class ()
183- switch class {
184- case float16.ClassPositiveInfinity :
185- fmt.Println (" Positive infinity" )
186- case float16.ClassQuietNaN :
187- fmt.Println (" Quiet NaN" )
188- // ... other classes
189- }
190- ```
191-
192- ## Performance Features
193-
194- ### Fast Math Operations
195-
196- ``` go
197- // Enable fast math for better performance (may sacrifice precision)
198- config := float16.GetConfig ()
199- config.EnableFastMath = true
200- float16.Configure (config)
201-
202- // Use fast operations
203- result := float16.FastAdd (a, b)
204- result := float16.FastMul (a, b)
205- ```
206-
207- ### Vectorized Operations
208-
209- ``` go
210- // Vectorized operations (optimized for SIMD when available)
211- a := []float16.Float16 {...}
212- b := []float16.Float16 {...}
213-
214- sum := float16.VectorAdd (a, b)
215- product := float16.VectorMul (a, b)
216- ```
217-
218- ## Error Handling
219-
220- ``` go
221- // Strict mode returns errors for exceptional conditions
222- config := float16.GetConfig ()
223- config.DefaultConversionMode = float16.ModeStrict
224- float16.Configure (config)
225-
226- f16 , err := float16.FromFloat32WithMode (1e10 , float16.ModeStrict )
227- if err != nil {
228- if float16Err , ok := err.(*float16.Float16Error ); ok {
229- switch float16Err.Code {
230- case float16.ErrOverflow :
231- fmt.Println (" Value too large for float16" )
232- case float16.ErrUnderflow :
233- fmt.Println (" Value too small for float16" )
234- }
235- }
236- }
237- ```
238-
239- ## Utilities
240-
241- ### Statistics for Slices
242-
243- ``` go
244- values := []float16.Float16 {
245- float16.FromFloat32 (1.0 ),
246- float16.FromFloat32 (2.0 ),
247- float16.FromFloat32 (3.0 ),
248- }
249-
250- stats := float16.ComputeSliceStats (values)
251- fmt.Printf (" Min: %v , Max: %v , Mean: %v \n " , stats.Min , stats.Max , stats.Mean )
252- ```
253-
254- ### Debugging and Monitoring
255-
256- ``` go
257- // Get memory usage
258- usage := float16.GetMemoryUsage ()
259- fmt.Printf (" Memory usage: %d bytes\n " , usage)
260-
261- // Get debug information
262- debug := float16.DebugInfo ()
263- fmt.Printf (" Debug info: %+v \n " , debug)
264- ```
265-
266- ## Benchmarking
267-
268- The package includes built-in benchmarking utilities:
269-
270- ``` go
271- ops := float16.GetBenchmarkOperations ()
272- for name , op := range ops {
273- // Benchmark operation
274- fmt.Printf (" Benchmarking %s \n " , name)
275- }
76+ // RoundNearestEven (default), RoundTowardZero, RoundTowardPositive,
77+ // RoundTowardNegative, RoundNearestAway
27678```
27779
27880## Range and Precision
27981
280- Float16 has the following characteristics:
281-
282- - ** Range** : ±6.55×10⁴ (approximately ±65,504)
283- - ** Precision** : ~ 3-4 decimal digits
284- - ** Smallest positive normal** : ~ 6.10×10⁻⁵
285- - ** Smallest positive subnormal** : ~ 5.96×10⁻⁸
286- - ** Machine epsilon** : ~ 9.77×10⁻⁴
287-
288- ## Use Cases
289-
290- Float16 is ideal for:
82+ | Property | Value |
83+ | ----------| -------|
84+ | Range | ±65,504 |
85+ | Precision | ~ 3-4 decimal digits |
86+ | Smallest normal | ~ 6.10 × 10⁻⁵ |
87+ | Smallest subnormal | ~ 5.96 × 10⁻⁸ |
88+ | Machine epsilon | ~ 9.77 × 10⁻⁴ |
29189
292- - ** Machine Learning** : Reduced memory usage and faster training
293- - ** Graphics Programming** : Color values, texture coordinates
294- - ** Scientific Computing** : Large datasets where precision can be traded for memory
295- - ** Embedded Systems** : Memory-constrained environments
296- - ** Data Compression** : Storing floating-point data more efficiently
90+ ## Used By
29791
298- ## Performance Considerations
299-
300- - Conversions between float16 and float32/float64 have computational overhead
301- - Native float16 arithmetic is generally faster than conversion-based approaches
302- - Enable fast math mode for performance-critical applications where precision can be sacrificed
303- - Use vectorized operations for bulk processing
304-
305- ## Contributing
306-
307- We welcome contributions! Please see [ CONTRIBUTING.md] ( CONTRIBUTING.md ) for guidelines.
92+ - [ ztensor] ( https://github.com/zerfoo/ztensor ) — GPU-accelerated tensor library
30893
30994## License
31095
311- This project is licensed under the Apache License 2.0 - see the [ LICENSE] ( LICENSE ) file for details.
312-
313- ## References
314-
315- - [ IEEE 754-2008 Standard] ( https://ieeexplore.ieee.org/document/4610935 )
316- - [ Half-precision floating-point format] ( https://en.wikipedia.org/wiki/Half-precision_floating-point_format )
96+ Apache 2.0
0 commit comments