Skip to content

Commit 019e58c

Browse files
committed
docs: polish README — trim verbose API docs, add ecosystem links
1 parent a847fc7 commit 019e58c

File tree

1 file changed

+35
-255
lines changed

1 file changed

+35
-255
lines changed

README.md

Lines changed: 35 additions & 255 deletions
Original file line numberDiff line numberDiff line change
@@ -3,24 +3,28 @@
33
[![Go Reference](https://pkg.go.dev/badge/github.com/zerfoo/float16.svg)](https://pkg.go.dev/github.com/zerfoo/float16)
44
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
55

6-
A comprehensive Go implementation of IEEE 754-2008 16-bit floating-point (half-precision) arithmetic with full support for special values, multiple rounding modes, and high-performance operations.
6+
IEEE 754-2008 half-precision (Float16) and BFloat16 arithmetic library for Go.
7+
8+
Part of the [Zerfoo](https://github.com/zerfoo) ML ecosystem.
79

810
## Features
911

1012
- **Full IEEE 754-2008 compliance** for 16-bit floating-point arithmetic
11-
- **Complete special value support**: ±0, ±∞, NaN (with payload), normalized and subnormal numbers
12-
- **Multiple rounding modes**: nearest-even, toward zero, toward ±∞, nearest-away
13-
- **Flexible conversion modes**: IEEE standard, strict error handling, fast approximations
14-
- **High-performance operations** with optional fast math optimizations
15-
- **Comprehensive test suite** with extensive edge case coverage
16-
- **Zero dependencies** - pure Go implementation
13+
- **BFloat16 support** — Google Brain format for ML training and inference
14+
- **Special value handling** — ±0, ±Inf, NaN (with payload), normalized and subnormal numbers
15+
- **Multiple rounding modes** — nearest-even, toward zero, toward ±Inf, nearest-away
16+
- **Vectorized operations** — batch add, multiply, and dot product
17+
- **Fast math mode** — optional lookup-table acceleration for performance-critical paths
18+
- **Zero dependencies** pure Go, no CGo
1719

1820
## Installation
1921

2022
```bash
2123
go get github.com/zerfoo/float16
2224
```
2325

26+
Requires Go 1.26+.
27+
2428
## Quick Start
2529

2630
```go
@@ -32,285 +36,61 @@ import (
3236
)
3337

3438
func main() {
35-
// Create float16 values
3639
a := float16.FromFloat32(3.14159)
37-
b := float16.FromFloat64(2.71828)
38-
39-
// Basic arithmetic
40+
b := float16.FromFloat32(2.71828)
41+
4042
sum := a.Add(b)
4143
product := a.Mul(b)
42-
43-
// Convert back to other types
44-
fmt.Printf("Sum: %v (float32: %f)\n", sum, sum.ToFloat32())
45-
fmt.Printf("Product: %v (float64: %f)\n", product, product.ToFloat64())
46-
47-
// Work with special values
48-
inf := float16.Inf(1) // positive infinity
49-
nan := float16.NaN() // quiet NaN
50-
zero := float16.Zero() // positive zero
51-
52-
fmt.Printf("Infinity: %v\n", inf)
53-
fmt.Printf("NaN: %v\n", nan)
54-
fmt.Printf("Zero: %v\n", zero)
55-
}
56-
```
57-
58-
## Core Types and Constants
5944

60-
### Float16 Type
45+
fmt.Printf("Sum: %f\n", sum.ToFloat32())
46+
fmt.Printf("Product: %f\n", product.ToFloat32())
6147

62-
The `Float16` type represents a 16-bit IEEE 754 half-precision floating-point value:
63-
64-
```go
65-
type Float16 uint16
66-
```
67-
68-
### Special Values
69-
70-
```go
71-
const (
72-
PositiveZero Float16 = 0x0000 // +0.0
73-
NegativeZero Float16 = 0x8000 // -0.0
74-
PositiveInfinity Float16 = 0x7C00 // +∞
75-
NegativeInfinity Float16 = 0xFC00 // -∞
76-
MaxValue Float16 = 0x7BFF // ~65504
77-
MinValue Float16 = 0xFBFF // ~-65504
78-
)
48+
// Special values
49+
inf := float16.Inf(1)
50+
fmt.Printf("Inf: %v, IsInf: %v\n", inf, inf.IsInf(0))
51+
}
7952
```
8053

81-
## Conversion Functions
82-
83-
### From Other Types
54+
## Conversion
8455

8556
```go
8657
// From float32/float64
87-
f16 := float16.FromFloat32(3.14159)
88-
f16 := float16.FromFloat64(2.71828)
58+
f16 := float16.FromFloat32(3.14)
59+
f16 := float16.FromFloat64(2.718)
8960

9061
// From bit representation
9162
f16 := float16.FromBits(0x4200) // 3.0
9263

93-
// From string
94-
f16, err := float16.ParseFloat("3.14159", 32)
95-
```
96-
97-
### To Other Types
98-
99-
```go
64+
// Back to native types
10065
f32 := f16.ToFloat32()
10166
f64 := f16.ToFloat64()
102-
bits := f16.Bits()
103-
str := f16.String()
104-
```
105-
106-
## Arithmetic Operations
107-
108-
```go
109-
a := float16.FromFloat32(5.0)
110-
b := float16.FromFloat32(3.0)
111-
112-
// Basic arithmetic
113-
sum := a.Add(b) // 8.0
114-
diff := a.Sub(b) // 2.0
115-
product := a.Mul(b) // 15.0
116-
quotient := a.Div(b) // 1.666...
117-
118-
// Mathematical functions
119-
sqrt := a.Sqrt() // √5
120-
abs := a.Abs() // |a|
121-
neg := a.Neg() // -a
12267
```
12368

12469
## Rounding Modes
12570

126-
Configure rounding behavior for conversions:
127-
12871
```go
129-
import "github.com/zerfoo/float16"
130-
131-
// Set global rounding mode
13272
config := float16.GetConfig()
13373
config.DefaultRoundingMode = float16.RoundTowardZero
13474
float16.Configure(config)
13575

136-
// Available rounding modes:
137-
// - RoundNearestEven (default)
138-
// - RoundTowardZero
139-
// - RoundTowardPositive
140-
// - RoundTowardNegative
141-
// - RoundNearestAway
142-
```
143-
144-
## Conversion Modes
145-
146-
Control conversion behavior and error handling:
147-
148-
```go
149-
config := float16.GetConfig()
150-
config.DefaultConversionMode = float16.ModeStrict
151-
float16.Configure(config)
152-
153-
// Available modes:
154-
// - ModeIEEE: Standard IEEE 754 behavior
155-
// - ModeStrict: Returns errors for overflow/underflow
156-
// - ModeFast: Optimized for performance
157-
```
158-
159-
## Special Value Handling
160-
161-
```go
162-
f := float16.FromFloat32(math.Inf(1))
163-
164-
// Check value types
165-
if f.IsInf(0) {
166-
fmt.Println("Value is infinity")
167-
}
168-
if f.IsNaN() {
169-
fmt.Println("Value is NaN")
170-
}
171-
if f.IsFinite() {
172-
fmt.Println("Value is finite")
173-
}
174-
if f.IsNormal() {
175-
fmt.Println("Value is normalized")
176-
}
177-
if f.IsSubnormal() {
178-
fmt.Println("Value is subnormal")
179-
}
180-
181-
// IEEE 754 classification
182-
class := f.Class()
183-
switch class {
184-
case float16.ClassPositiveInfinity:
185-
fmt.Println("Positive infinity")
186-
case float16.ClassQuietNaN:
187-
fmt.Println("Quiet NaN")
188-
// ... other classes
189-
}
190-
```
191-
192-
## Performance Features
193-
194-
### Fast Math Operations
195-
196-
```go
197-
// Enable fast math for better performance (may sacrifice precision)
198-
config := float16.GetConfig()
199-
config.EnableFastMath = true
200-
float16.Configure(config)
201-
202-
// Use fast operations
203-
result := float16.FastAdd(a, b)
204-
result := float16.FastMul(a, b)
205-
```
206-
207-
### Vectorized Operations
208-
209-
```go
210-
// Vectorized operations (optimized for SIMD when available)
211-
a := []float16.Float16{...}
212-
b := []float16.Float16{...}
213-
214-
sum := float16.VectorAdd(a, b)
215-
product := float16.VectorMul(a, b)
216-
```
217-
218-
## Error Handling
219-
220-
```go
221-
// Strict mode returns errors for exceptional conditions
222-
config := float16.GetConfig()
223-
config.DefaultConversionMode = float16.ModeStrict
224-
float16.Configure(config)
225-
226-
f16, err := float16.FromFloat32WithMode(1e10, float16.ModeStrict)
227-
if err != nil {
228-
if float16Err, ok := err.(*float16.Float16Error); ok {
229-
switch float16Err.Code {
230-
case float16.ErrOverflow:
231-
fmt.Println("Value too large for float16")
232-
case float16.ErrUnderflow:
233-
fmt.Println("Value too small for float16")
234-
}
235-
}
236-
}
237-
```
238-
239-
## Utilities
240-
241-
### Statistics for Slices
242-
243-
```go
244-
values := []float16.Float16{
245-
float16.FromFloat32(1.0),
246-
float16.FromFloat32(2.0),
247-
float16.FromFloat32(3.0),
248-
}
249-
250-
stats := float16.ComputeSliceStats(values)
251-
fmt.Printf("Min: %v, Max: %v, Mean: %v\n", stats.Min, stats.Max, stats.Mean)
252-
```
253-
254-
### Debugging and Monitoring
255-
256-
```go
257-
// Get memory usage
258-
usage := float16.GetMemoryUsage()
259-
fmt.Printf("Memory usage: %d bytes\n", usage)
260-
261-
// Get debug information
262-
debug := float16.DebugInfo()
263-
fmt.Printf("Debug info: %+v\n", debug)
264-
```
265-
266-
## Benchmarking
267-
268-
The package includes built-in benchmarking utilities:
269-
270-
```go
271-
ops := float16.GetBenchmarkOperations()
272-
for name, op := range ops {
273-
// Benchmark operation
274-
fmt.Printf("Benchmarking %s\n", name)
275-
}
76+
// RoundNearestEven (default), RoundTowardZero, RoundTowardPositive,
77+
// RoundTowardNegative, RoundNearestAway
27678
```
27779

27880
## Range and Precision
27981

280-
Float16 has the following characteristics:
281-
282-
- **Range**: ±6.55×10⁴ (approximately ±65,504)
283-
- **Precision**: ~3-4 decimal digits
284-
- **Smallest positive normal**: ~6.10×10⁻⁵
285-
- **Smallest positive subnormal**: ~5.96×10⁻⁸
286-
- **Machine epsilon**: ~9.77×10⁻⁴
287-
288-
## Use Cases
289-
290-
Float16 is ideal for:
82+
| Property | Value |
83+
|----------|-------|
84+
| Range | ±65,504 |
85+
| Precision | ~3-4 decimal digits |
86+
| Smallest normal | ~6.10 × 10⁻⁵ |
87+
| Smallest subnormal | ~5.96 × 10⁻⁸ |
88+
| Machine epsilon | ~9.77 × 10⁻⁴ |
29189

292-
- **Machine Learning**: Reduced memory usage and faster training
293-
- **Graphics Programming**: Color values, texture coordinates
294-
- **Scientific Computing**: Large datasets where precision can be traded for memory
295-
- **Embedded Systems**: Memory-constrained environments
296-
- **Data Compression**: Storing floating-point data more efficiently
90+
## Used By
29791

298-
## Performance Considerations
299-
300-
- Conversions between float16 and float32/float64 have computational overhead
301-
- Native float16 arithmetic is generally faster than conversion-based approaches
302-
- Enable fast math mode for performance-critical applications where precision can be sacrificed
303-
- Use vectorized operations for bulk processing
304-
305-
## Contributing
306-
307-
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
92+
- [ztensor](https://github.com/zerfoo/ztensor) — GPU-accelerated tensor library
30893

30994
## License
31095

311-
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
312-
313-
## References
314-
315-
- [IEEE 754-2008 Standard](https://ieeexplore.ieee.org/document/4610935)
316-
- [Half-precision floating-point format](https://en.wikipedia.org/wiki/Half-precision_floating-point_format)
96+
Apache 2.0

0 commit comments

Comments
 (0)