Skip to content

Commit 026ced0

Browse files
committed
docs: polish README — trim verbose API docs, add ecosystem links
1 parent d2a511d commit 026ced0

1 file changed

Lines changed: 24 additions & 138 deletions

File tree

README.md

Lines changed: 24 additions & 138 deletions
Original file line numberDiff line numberDiff line change
@@ -3,36 +3,26 @@
33
[![Go Reference](https://pkg.go.dev/badge/github.com/zerfoo/float8.svg)](https://pkg.go.dev/github.com/zerfoo/float8)
44
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
55

6-
A high-performance Go library implementing IEEE 754 FP8 E4M3FN format for 8-bit floating-point arithmetic, commonly used in machine learning applications for reduced-precision computations.
6+
FP8 E4M3FN arithmetic library for Go, commonly used in quantized ML inference.
77

8-
## Features
9-
10-
- **IEEE 754 FP8 E4M3FN Format**: Complete implementation of the 8-bit floating-point format
11-
- **High Performance**: Optimized arithmetic operations with optional fast lookup tables
12-
- **Comprehensive API**: Full support for conversion, arithmetic, and mathematical operations
13-
- **Machine Learning Ready**: Designed for ML workloads requiring reduced precision
14-
- **Zero Dependencies**: Pure Go implementation with no external dependencies
15-
16-
## Format Specification
17-
18-
The Float8 type uses the E4M3FN variant of IEEE 754 FP8:
8+
Part of the [Zerfoo](https://github.com/zerfoo) ML ecosystem.
199

20-
- **1 bit**: Sign (0 = positive, 1 = negative)
21-
- **4 bits**: Exponent (biased by 7, range [-6, 7])
22-
- **3 bits**: Mantissa (3 explicit bits, 1 implicit leading bit for normal numbers)
23-
24-
### Special Values
10+
## Features
2511

26-
- **Zero**: Exponent=0000, Mantissa=000 (both positive and negative)
27-
- **NaN**: Exponent=1111, Mantissa=111
28-
- **No Infinities**: The E4M3FN variant does not support infinity values
12+
- **IEEE 754 FP8 E4M3FN format** — 1 sign, 4 exponent, 3 mantissa bits
13+
- **Fast lookup tables** — optional pre-computed tables for arithmetic and conversion
14+
- **Full arithmetic** — add, subtract, multiply, divide, sqrt, abs, neg
15+
- **No infinities** — the E4M3FN variant uses the infinity encoding for additional finite values
16+
- **Zero dependencies** — pure Go, no CGo
2917

3018
## Installation
3119

3220
```bash
3321
go get github.com/zerfoo/float8
3422
```
3523

24+
Requires Go 1.26+.
25+
3626
## Quick Start
3727

3828
```go
@@ -44,144 +34,40 @@ import (
4434
)
4535

4636
func main() {
47-
// Initialize the package (optional, done automatically)
48-
float8.Initialize()
49-
50-
// Create Float8 values from float32
5137
a := float8.FromFloat32(3.14)
5238
b := float8.FromFloat32(2.71)
53-
54-
// Perform arithmetic operations
39+
5540
sum := a.Add(b)
5641
product := a.Mul(b)
57-
58-
// Convert back to float32
42+
5943
fmt.Printf("a = %f\n", a.ToFloat32())
60-
fmt.Printf("b = %f\n", b.ToFloat32())
6144
fmt.Printf("a + b = %f\n", sum.ToFloat32())
6245
fmt.Printf("a * b = %f\n", product.ToFloat32())
6346
}
6447
```
6548

66-
## Configuration
67-
68-
The library supports various configuration options for performance optimization:
69-
70-
```go
71-
// Configure with custom settings
72-
config := &float8.Config{
73-
EnableFastArithmetic: true, // Enable lookup tables for faster arithmetic
74-
EnableFastConversion: true, // Enable lookup tables for faster conversion
75-
DefaultMode: float8.ModeDefault,
76-
ArithmeticMode: float8.ArithmeticAuto,
77-
}
78-
79-
float8.Configure(config)
80-
```
81-
82-
## API Reference
83-
84-
### Core Types
49+
## Format
8550

86-
- `Float8`: The main 8-bit floating-point type
87-
- `Config`: Configuration options for the package
51+
| Field | Bits | Description |
52+
|-------|------|-------------|
53+
| Sign | 1 | 0 = positive, 1 = negative |
54+
| Exponent | 4 | Biased by 7, range [-6, 7] |
55+
| Mantissa | 3 | 3 explicit + 1 implicit leading bit |
8856

89-
### Conversion Functions
90-
91-
```go
92-
// From other numeric types
93-
func FromFloat32(f float32) Float8
94-
func FromFloat64(f float64) Float8
95-
func FromInt(i int) Float8
96-
97-
// To other numeric types
98-
func (f Float8) ToFloat32() float32
99-
func (f Float8) ToFloat64() float64
100-
func (f Float8) ToInt() int
101-
```
57+
Special values: ±0 (exp=0, mant=0), NaN (exp=1111, mant=111). No infinities.
10258

103-
### Arithmetic Operations
104-
105-
```go
106-
func (f Float8) Add(other Float8) Float8
107-
func (f Float8) Sub(other Float8) Float8
108-
func (f Float8) Mul(other Float8) Float8
109-
func (f Float8) Div(other Float8) Float8
110-
```
111-
112-
### Mathematical Functions
113-
114-
```go
115-
func (f Float8) Abs() Float8
116-
func (f Float8) Neg() Float8
117-
func (f Float8) Sqrt() Float8
118-
// ... and more
119-
```
120-
121-
### Utility Functions
122-
123-
```go
124-
func (f Float8) IsZero() bool
125-
func (f Float8) IsNaN() bool
126-
func (f Float8) IsInf() bool
127-
func (f Float8) String() string
128-
```
129-
130-
## Performance
131-
132-
The library offers two performance modes:
133-
134-
1. **Standard Mode**: Compact implementation with minimal memory usage
135-
2. **Fast Mode**: Uses pre-computed lookup tables for faster operations at the cost of memory
136-
137-
Enable fast mode for performance-critical applications:
59+
## Performance Modes
13860

13961
```go
62+
// Enable lookup tables for faster arithmetic (trades memory for speed)
14063
float8.EnableFastArithmetic()
14164
float8.EnableFastConversion()
14265
```
14366

144-
## Testing
145-
146-
Run the comprehensive test suite:
67+
## Used By
14768

148-
```bash
149-
# Run all tests
150-
go test ./...
151-
152-
# Run tests with coverage
153-
go test -cover ./...
154-
155-
# Generate coverage report
156-
go test -coverprofile=coverage.out ./...
157-
go tool cover -html=coverage.out
158-
```
159-
160-
## Benchmarks
161-
162-
Run performance benchmarks:
163-
164-
```bash
165-
go test -bench=. -benchmem ./...
166-
```
167-
168-
## Use Cases
169-
170-
- **Machine Learning**: Reduced precision training and inference
171-
- **Neural Networks**: Memory-efficient model parameters
172-
- **Scientific Computing**: Applications requiring controlled precision
173-
- **Embedded Systems**: Resource-constrained environments
174-
175-
## Contributing
176-
177-
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
69+
- [ztensor](https://github.com/zerfoo/ztensor) — GPU-accelerated tensor library
17870

17971
## License
18072

181-
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
182-
183-
## Acknowledgments
184-
185-
- IEEE 754 standard for floating-point arithmetic
186-
- The machine learning community for driving FP8 adoption
187-
- Contributors and maintainers of this project
73+
Apache 2.0

0 commit comments

Comments
 (0)