|
| 1 | +# float8 Design Document |
| 2 | + |
| 3 | +This document describes the design of `github.com/zerfoo/float8`, a pure-Go implementation of the IEEE 754 FP8 E4M3FN format for quantized ML inference. |
| 4 | + |
| 5 | +## 1. FP8 E4M3FN Bit Layout |
| 6 | + |
| 7 | +The E4M3FN format packs a floating-point number into a single byte: |
| 8 | + |
| 9 | +``` |
| 10 | + Bit 7 Bits 6-3 Bits 2-0 |
| 11 | + ┌──────┬──────────────┬────────────┐ |
| 12 | + │ Sign │ Exponent │ Mantissa │ |
| 13 | + │ (1) │ (4) │ (3) │ |
| 14 | + └──────┴──────────────┴────────────┘ |
| 15 | +``` |
| 16 | + |
| 17 | +| Field | Width | Mask | Description | |
| 18 | +|-------|-------|------|-------------| |
| 19 | +| Sign | 1 bit | `0x80` | 0 = positive, 1 = negative | |
| 20 | +| Exponent | 4 bits | `0x78` | Biased unsigned integer (bias = 7) | |
| 21 | +| Mantissa | 3 bits | `0x07` | Explicit significand bits; normal numbers have an implicit leading 1 | |
| 22 | + |
| 23 | +The exponent bias is 7 (`2^(4-1) - 1`), giving an unbiased exponent range of [-6, +8] for stored values 1-15. Stored exponent 0 indicates a subnormal (no implicit leading 1). |
| 24 | + |
| 25 | +**Representable range:** The largest finite value (bit pattern `0x7E` = `0.1111.110`) is 448.0. The smallest positive normal is `0x08` (1.0 x 2^-6 = 0.015625). The smallest positive subnormal is `0x01` (0.001 x 2^-6 = 0.001953125). |
| 26 | + |
| 27 | +**Precision:** With 3 explicit mantissa bits (4 effective bits for normals), relative precision is roughly 2^-3 = 12.5%. This is adequate for storing quantized weights and activations but not for accumulation; ML frameworks accumulate in float16 or float32. |
| 28 | + |
| 29 | +## 2. Lookup Table Strategy |
| 30 | + |
| 31 | +Because the format has only 256 possible bit patterns, exhaustive precomputation is practical: |
| 32 | + |
| 33 | +### Conversion Table |
| 34 | + |
| 35 | +A single 256-entry `[]float32` table maps every `Float8` bit pattern to its exact float32 equivalent. Indexed by `uint8(f)`, a lookup replaces the branch-heavy algorithmic decode path with a single array access. Memory cost: 256 x 4 = **1 KiB**. |
| 36 | + |
| 37 | +### Arithmetic Tables |
| 38 | + |
| 39 | +Each binary operation (add, subtract, multiply, divide) uses a 65,536-entry `[]Float8` table indexed by `uint16(a)<<8 | uint16(b)`. Every (a, b) pair is precomputed once from the algorithmic implementation. Memory cost per table: 65,536 x 1 = **64 KiB** (256 KiB total for all four operations). |
| 40 | + |
| 41 | +### Lazy Initialization |
| 42 | + |
| 43 | +Tables are not allocated at package init. Callers opt in via `EnableFastConversion()` and `EnableFastArithmetic()`, which populate the tables on first call. This keeps the default memory footprint at zero for programs that only need occasional FP8 conversions. Tables can be released with the corresponding `Disable` functions. |
| 44 | + |
| 45 | +### Mode Selection |
| 46 | + |
| 47 | +Three arithmetic modes control dispatch: |
| 48 | + |
| 49 | +| Mode | Behavior | |
| 50 | +|------|----------| |
| 51 | +| `ArithmeticAuto` (default) | Use table if loaded, otherwise algorithmic | |
| 52 | +| `ArithmeticLookup` | Force table path (panics if tables not loaded) | |
| 53 | +| `ArithmeticAlgorithmic` | Force algorithmic path regardless of table state | |
| 54 | + |
| 55 | +## 3. Arithmetic Operations |
| 56 | + |
| 57 | +All arithmetic follows a **convert-up, compute, convert-down** pattern: |
| 58 | + |
| 59 | +1. Convert both `Float8` operands to `float32` (exact, since FP8 is a subset of float32). |
| 60 | +2. Perform the operation in float32 precision. |
| 61 | +3. Convert the float32 result back to `Float8` with round-to-nearest-even. |
| 62 | + |
| 63 | +This strategy inherits float32 IEEE 754 semantics and avoids implementing carry propagation, alignment shifting, or normalization in 8-bit arithmetic. |
| 64 | + |
| 65 | +**Operations provided:** `Add`, `Sub`, `Mul`, `Div`, `Sqrt`, `Pow`, `Exp`, `Log`, `Sin`, `Cos`, `Tan`, `Floor`, `Ceil`, `Round`, `Trunc`, `Fmod`, `Abs`, `Neg`, `Min`, `Max`, `Clamp`, `Lerp`, `CopySign`. |
| 66 | + |
| 67 | +**Comparison operations:** `Equal`, `Less`, `Greater`, `LessEqual`, `GreaterEqual` handle NaN (unordered), signed zeros (+0 == -0), and infinities per IEEE 754 rules. |
| 68 | + |
| 69 | +**Batch operations:** `AddSlice`, `MulSlice`, `ScaleSlice`, `SumSlice` operate element-wise on `[]Float8` slices. `ToSlice8` and `ToSlice32` handle bulk conversion between `[]float32` and `[]Float8`. |
| 70 | + |
| 71 | +## 4. Conversion To/From float32 |
| 72 | + |
| 73 | +### float32 to Float8 (`ToFloat8`) |
| 74 | + |
| 75 | +1. Handle special cases first: signed zeros, infinities, NaN. |
| 76 | +2. Extract sign, exponent, and mantissa from the float32 IEEE 754 bits. |
| 77 | +3. Re-bias the exponent: `exp8 = exp32 - 127 + 7`. |
| 78 | +4. Check for overflow (exp8 > 15 -> clamp to infinity) and underflow (exp8 < -7 -> clamp to zero). |
| 79 | +5. Truncate the 23-bit mantissa to 3 bits, applying round-to-nearest-even: if the 4th bit is set, round up. Handle mantissa carry into the exponent. |
| 80 | +6. Pack sign (1 bit), exponent (4 bits), and mantissa (3 bits) into a `uint8`. |
| 81 | + |
| 82 | +Three conversion modes control edge-case behavior: |
| 83 | + |
| 84 | +| Mode | Overflow | Underflow | NaN | |
| 85 | +|------|----------|-----------|-----| |
| 86 | +| `ModeDefault` | Saturate to infinity | Saturate to zero | Convert to `0x7F` | |
| 87 | +| `ModeStrict` | Return error | Return error | Return error | |
| 88 | +| `ModeFast` | Use lookup table | Use lookup table | Use lookup table | |
| 89 | + |
| 90 | +### Float8 to float32 (`ToFloat32`) |
| 91 | + |
| 92 | +The conversion is always exact (no rounding) because every FP8 value is representable in float32. The algorithmic path extracts sign, exponent, and mantissa, re-biases the exponent (`exp32 = exp8 - 7 + 127`), shifts the 3-bit mantissa to float32 position (left-shift by 20), and assembles the 32-bit IEEE 754 pattern. With the conversion table enabled, this reduces to a single array lookup. |
| 93 | + |
| 94 | +## 5. No-Infinities Design Rationale |
| 95 | + |
| 96 | +The E4M3FN format (the "FN" stands for "Finite, NaN") intentionally eliminates infinity encodings to maximize the finite representable range: |
| 97 | + |
| 98 | +- In standard IEEE 754, the all-ones exponent (`1111`) with a zero mantissa encodes infinity. E4M3FN repurposes this encoding as a normal finite value, extending the maximum magnitude from 240 to **448**. |
| 99 | +- Only the all-ones exponent with all-ones mantissa (`0x7F`, `0xFF`) is reserved for NaN. This gives exactly two NaN encodings (positive and negative) instead of the usual 14 quiet/signaling NaN patterns. |
| 100 | +- ML inference rarely produces or consumes infinities. Overflows during quantized GEMM/GEMV saturate to the maximum representable value rather than propagating infinity, which is more numerically stable for downstream operations like softmax and layer normalization. |
| 101 | + |
| 102 | +**Note:** The current implementation defines `PositiveInfinity` and `NegativeInfinity` constants for API compatibility with IEEE 754 conventions (e.g., overflow from float32 conversion maps to these bit patterns), but in E4M3FN semantics these are finite values equal to +/-448. |
| 103 | + |
| 104 | +## 6. Use in ML Inference |
| 105 | + |
| 106 | +FP8 E4M3FN is the standard quantization format for weights and activations in transformer inference: |
| 107 | + |
| 108 | +### Quantized Storage |
| 109 | + |
| 110 | +Model weights stored in GGUF files use FP8 to reduce memory bandwidth by 4x compared to float32. The `ToSlice8`/`ToSlice32` batch conversion functions support bulk quantization and dequantization with negative-zero preservation. |
| 111 | + |
| 112 | +### GEMM/GEMV Kernels |
| 113 | + |
| 114 | +In the Zerfoo ecosystem, `ztensor` imports `float8` to implement quantized matrix-multiply kernels. Weights are stored as `[]Float8` and dequantized to float16 or float32 in register before the fused multiply-accumulate. Accumulation always occurs in higher precision to avoid catastrophic rounding error. |
| 115 | + |
| 116 | +### Where FP8 Fits in the Precision Hierarchy |
| 117 | + |
| 118 | +| Type | Bits | Use Case | |
| 119 | +|------|------|----------| |
| 120 | +| float32 | 32 | Accumulation, loss computation, optimizer state | |
| 121 | +| float16 / bfloat16 | 16 | Activations, KV cache, intermediate results | |
| 122 | +| **float8 (E4M3FN)** | **8** | **Weight storage, activation quantization** | |
| 123 | +| int4 (Q4_K_M) | 4 | Aggressive weight quantization (GGUF) | |
| 124 | + |
| 125 | +### Scope |
| 126 | + |
| 127 | +This library covers E4M3FN only. The E5M2 variant (5 exponent bits, 2 mantissa bits, with infinities) is planned for a future release (see T46.4.8) and targets gradient storage in mixed-precision training, where the wider dynamic range matters more than precision. |
0 commit comments