You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A high-performance Go library implementing IEEE 754 FP8 E4M3FN format for 8-bit floating-point arithmetic, commonly used in machine learning applications for reduced-precision computations.
6
+
FP8 E4M3FN arithmetic library for Go, commonly used in quantized ML inference.
7
7
8
-
## Features
9
-
10
-
-**IEEE 754 FP8 E4M3FN Format**: Complete implementation of the 8-bit floating-point format
11
-
-**High Performance**: Optimized arithmetic operations with optional fast lookup tables
12
-
-**Comprehensive API**: Full support for conversion, arithmetic, and mathematical operations
13
-
-**Machine Learning Ready**: Designed for ML workloads requiring reduced precision
14
-
-**Zero Dependencies**: Pure Go implementation with no external dependencies
15
-
16
-
## Format Specification
17
-
18
-
The Float8 type uses the E4M3FN variant of IEEE 754 FP8:
8
+
Part of the [Zerfoo](https://github.com/zerfoo) ML ecosystem.
19
9
20
-
-**1 bit**: Sign (0 = positive, 1 = negative)
21
-
-**4 bits**: Exponent (biased by 7, range [-6, 7])
22
-
-**3 bits**: Mantissa (3 explicit bits, 1 implicit leading bit for normal numbers)
23
-
24
-
### Special Values
10
+
## Features
25
11
26
-
-**Zero**: Exponent=0000, Mantissa=000 (both positive and negative)
27
-
-**NaN**: Exponent=1111, Mantissa=111
28
-
-**No Infinities**: The E4M3FN variant does not support infinity values
0 commit comments