Skip to content

Commit 9af9e96

Browse files
committed
docs: add design.md for float8 v1.0.0
1 parent 026ced0 commit 9af9e96

File tree

1 file changed

+127
-0
lines changed

1 file changed

+127
-0
lines changed

docs/design.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# float8 Design Document
2+
3+
This document describes the design of `github.com/zerfoo/float8`, a pure-Go implementation of the IEEE 754 FP8 E4M3FN format for quantized ML inference.
4+
5+
## 1. FP8 E4M3FN Bit Layout
6+
7+
The E4M3FN format packs a floating-point number into a single byte:
8+
9+
```
10+
Bit 7 Bits 6-3 Bits 2-0
11+
┌──────┬──────────────┬────────────┐
12+
│ Sign │ Exponent │ Mantissa │
13+
│ (1) │ (4) │ (3) │
14+
└──────┴──────────────┴────────────┘
15+
```
16+
17+
| Field | Width | Mask | Description |
18+
|-------|-------|------|-------------|
19+
| Sign | 1 bit | `0x80` | 0 = positive, 1 = negative |
20+
| Exponent | 4 bits | `0x78` | Biased unsigned integer (bias = 7) |
21+
| Mantissa | 3 bits | `0x07` | Explicit significand bits; normal numbers have an implicit leading 1 |
22+
23+
The exponent bias is 7 (`2^(4-1) - 1`), giving an unbiased exponent range of [-6, +8] for stored values 1-15. Stored exponent 0 indicates a subnormal (no implicit leading 1).
24+
25+
**Representable range:** The largest finite value (bit pattern `0x7E` = `0.1111.110`) is 448.0. The smallest positive normal is `0x08` (1.0 x 2^-6 = 0.015625). The smallest positive subnormal is `0x01` (0.001 x 2^-6 = 0.001953125).
26+
27+
**Precision:** With 3 explicit mantissa bits (4 effective bits for normals), relative precision is roughly 2^-3 = 12.5%. This is adequate for storing quantized weights and activations but not for accumulation; ML frameworks accumulate in float16 or float32.
28+
29+
## 2. Lookup Table Strategy
30+
31+
Because the format has only 256 possible bit patterns, exhaustive precomputation is practical:
32+
33+
### Conversion Table
34+
35+
A single 256-entry `[]float32` table maps every `Float8` bit pattern to its exact float32 equivalent. Indexed by `uint8(f)`, a lookup replaces the branch-heavy algorithmic decode path with a single array access. Memory cost: 256 x 4 = **1 KiB**.
36+
37+
### Arithmetic Tables
38+
39+
Each binary operation (add, subtract, multiply, divide) uses a 65,536-entry `[]Float8` table indexed by `uint16(a)<<8 | uint16(b)`. Every (a, b) pair is precomputed once from the algorithmic implementation. Memory cost per table: 65,536 x 1 = **64 KiB** (256 KiB total for all four operations).
40+
41+
### Lazy Initialization
42+
43+
Tables are not allocated at package init. Callers opt in via `EnableFastConversion()` and `EnableFastArithmetic()`, which populate the tables on first call. This keeps the default memory footprint at zero for programs that only need occasional FP8 conversions. Tables can be released with the corresponding `Disable` functions.
44+
45+
### Mode Selection
46+
47+
Three arithmetic modes control dispatch:
48+
49+
| Mode | Behavior |
50+
|------|----------|
51+
| `ArithmeticAuto` (default) | Use table if loaded, otherwise algorithmic |
52+
| `ArithmeticLookup` | Force table path (panics if tables not loaded) |
53+
| `ArithmeticAlgorithmic` | Force algorithmic path regardless of table state |
54+
55+
## 3. Arithmetic Operations
56+
57+
All arithmetic follows a **convert-up, compute, convert-down** pattern:
58+
59+
1. Convert both `Float8` operands to `float32` (exact, since FP8 is a subset of float32).
60+
2. Perform the operation in float32 precision.
61+
3. Convert the float32 result back to `Float8` with round-to-nearest-even.
62+
63+
This strategy inherits float32 IEEE 754 semantics and avoids implementing carry propagation, alignment shifting, or normalization in 8-bit arithmetic.
64+
65+
**Operations provided:** `Add`, `Sub`, `Mul`, `Div`, `Sqrt`, `Pow`, `Exp`, `Log`, `Sin`, `Cos`, `Tan`, `Floor`, `Ceil`, `Round`, `Trunc`, `Fmod`, `Abs`, `Neg`, `Min`, `Max`, `Clamp`, `Lerp`, `CopySign`.
66+
67+
**Comparison operations:** `Equal`, `Less`, `Greater`, `LessEqual`, `GreaterEqual` handle NaN (unordered), signed zeros (+0 == -0), and infinities per IEEE 754 rules.
68+
69+
**Batch operations:** `AddSlice`, `MulSlice`, `ScaleSlice`, `SumSlice` operate element-wise on `[]Float8` slices. `ToSlice8` and `ToSlice32` handle bulk conversion between `[]float32` and `[]Float8`.
70+
71+
## 4. Conversion To/From float32
72+
73+
### float32 to Float8 (`ToFloat8`)
74+
75+
1. Handle special cases first: signed zeros, infinities, NaN.
76+
2. Extract sign, exponent, and mantissa from the float32 IEEE 754 bits.
77+
3. Re-bias the exponent: `exp8 = exp32 - 127 + 7`.
78+
4. Check for overflow (exp8 > 15 -> clamp to infinity) and underflow (exp8 < -7 -> clamp to zero).
79+
5. Truncate the 23-bit mantissa to 3 bits, applying round-to-nearest-even: if the 4th bit is set, round up. Handle mantissa carry into the exponent.
80+
6. Pack sign (1 bit), exponent (4 bits), and mantissa (3 bits) into a `uint8`.
81+
82+
Three conversion modes control edge-case behavior:
83+
84+
| Mode | Overflow | Underflow | NaN |
85+
|------|----------|-----------|-----|
86+
| `ModeDefault` | Saturate to infinity | Saturate to zero | Convert to `0x7F` |
87+
| `ModeStrict` | Return error | Return error | Return error |
88+
| `ModeFast` | Use lookup table | Use lookup table | Use lookup table |
89+
90+
### Float8 to float32 (`ToFloat32`)
91+
92+
The conversion is always exact (no rounding) because every FP8 value is representable in float32. The algorithmic path extracts sign, exponent, and mantissa, re-biases the exponent (`exp32 = exp8 - 7 + 127`), shifts the 3-bit mantissa to float32 position (left-shift by 20), and assembles the 32-bit IEEE 754 pattern. With the conversion table enabled, this reduces to a single array lookup.
93+
94+
## 5. No-Infinities Design Rationale
95+
96+
The E4M3FN format (the "FN" stands for "Finite, NaN") intentionally eliminates infinity encodings to maximize the finite representable range:
97+
98+
- In standard IEEE 754, the all-ones exponent (`1111`) with a zero mantissa encodes infinity. E4M3FN repurposes this encoding as a normal finite value, extending the maximum magnitude from 240 to **448**.
99+
- Only the all-ones exponent with all-ones mantissa (`0x7F`, `0xFF`) is reserved for NaN. This gives exactly two NaN encodings (positive and negative) instead of the usual 14 quiet/signaling NaN patterns.
100+
- ML inference rarely produces or consumes infinities. Overflows during quantized GEMM/GEMV saturate to the maximum representable value rather than propagating infinity, which is more numerically stable for downstream operations like softmax and layer normalization.
101+
102+
**Note:** The current implementation defines `PositiveInfinity` and `NegativeInfinity` constants for API compatibility with IEEE 754 conventions (e.g., overflow from float32 conversion maps to these bit patterns), but in E4M3FN semantics these are finite values equal to +/-448.
103+
104+
## 6. Use in ML Inference
105+
106+
FP8 E4M3FN is the standard quantization format for weights and activations in transformer inference:
107+
108+
### Quantized Storage
109+
110+
Model weights stored in GGUF files use FP8 to reduce memory bandwidth by 4x compared to float32. The `ToSlice8`/`ToSlice32` batch conversion functions support bulk quantization and dequantization with negative-zero preservation.
111+
112+
### GEMM/GEMV Kernels
113+
114+
In the Zerfoo ecosystem, `ztensor` imports `float8` to implement quantized matrix-multiply kernels. Weights are stored as `[]Float8` and dequantized to float16 or float32 in register before the fused multiply-accumulate. Accumulation always occurs in higher precision to avoid catastrophic rounding error.
115+
116+
### Where FP8 Fits in the Precision Hierarchy
117+
118+
| Type | Bits | Use Case |
119+
|------|------|----------|
120+
| float32 | 32 | Accumulation, loss computation, optimizer state |
121+
| float16 / bfloat16 | 16 | Activations, KV cache, intermediate results |
122+
| **float8 (E4M3FN)** | **8** | **Weight storage, activation quantization** |
123+
| int4 (Q4_K_M) | 4 | Aggressive weight quantization (GGUF) |
124+
125+
### Scope
126+
127+
This library covers E4M3FN only. The E5M2 variant (5 exponent bits, 2 mantissa bits, with infinities) is planned for a future release (see T46.4.8) and targets gradient storage in mixed-precision training, where the wider dynamic range matters more than precision.

0 commit comments

Comments
 (0)