SIMD Optimization Tracking

This document tracks SIMD optimization progress for mozjpeg-oxide.

Current Performance Gap (Dec 2024)

2048x2048 image, 30 iterations, with AVX2 (RUSTFLAGS="-C target-feature=+avx2"):

Mode	Rust	C mozjpeg	Ratio	Notes
Baseline	40.04 ms	8.46 ms	4.74x slower	Entropy encoding bottleneck
Trellis	162.88 ms	181.07 ms	0.90x (10% faster!)	Rust wins with trellis

512x512 image (less accurate due to system noise):

Mode	Rust	C mozjpeg	Ratio	Notes
Baseline	1.73 ms	0.45 ms	3.9x slower	DCT + color optimized
Trellis	11.13 ms	11.72 ms	0.95x (faster!)

Key insight: Larger images give more accurate measurements. The 4.74x gap is the true baseline performance - entropy encoding dominates at scale.

Profiling Results (512x512 image)

Baseline Mode Breakdown (with AVX2)

Stage	Time (µs)	% of Total	Priority
Entropy encoding	1498	58.8%	HIGH
Color conversion	353	13.8%	DONE (i32x8)
Quantization	317	12.4%	LOW
Forward DCT	222	8.7%	DONE (AVX2)
Downsampling	128	5.0%	LOW
MCU expansion	31	1.2%	-

Key Finding: Entropy encoding is the main remaining bottleneck at 59% of baseline time. DCT and color conversion have been optimized.

Trellis Mode Breakdown

Stage	Time (µs)	% of Total
Trellis quantization	6621	77.2%
Prep (color+down)	1046	12.2%
Entropy encoding	656	7.7%
Forward DCT	251	2.9%

Key Finding: Trellis dominates as expected. No optimization needed (already at parity).

Optimization Priorities

Based on profiling, the optimization priority order for baseline mode:

1. Entropy Encoding (49.2% of time) - HIGH PRIORITY

Current issues:

put_bits called for every code and value (many function calls)
BitWriter writes through Write trait (potential virtual dispatch)
emit_byte_stuffed writes 1 byte at a time when 0xFF detected
flush_buffer checks for 0xFF in every 8-byte chunk

Potential optimizations:

Buffer multiple codes before writing (reduce function calls)
Direct Vec access instead of Write trait
SIMD-accelerated 0xFF byte detection and stuffing
Inline put_bits more aggressively

Reference: libjpeg-turbo uses optimized assembly for entropy encoding.

2. Color Conversion (13.8% of time) - DONE

Updated to use i32x8 (8 pixels at a time) for AVX2 width. Further optimization would require AVX2 intrinsics for RGB deinterleaving, but gains would be limited since gather/scatter is the bottleneck.

3. Quantization (10.1% of time) - LOW PRIORITY

Simple division/rounding. Already fast.

4. Forward DCT (8.2% of time) - DONE

AVX2 intrinsics implemented. 60ns per block.

Completed SIMD Optimizations

Forward DCT (`src/dct.rs`)

Status: DONE (Dec 2024)

Three implementations available:

forward_dct_8x8 - Scalar reference (kept for correctness testing)
forward_dct_8x8_simd - Gather-based SIMD with i32x4
forward_dct_8x8_transpose - Transpose-based SIMD with i32x8 (production)

The transpose-based approach avoids expensive gather/scatter by:

Loading rows contiguously (no gather)
Transposing to enable row-parallel processing
Using 8-wide SIMD (i32x8) for full row processing

Optimizations applied:

Pre-computed SIMD constants (const - no per-call allocation)
Pre-negated constants (no runtime negation)
Inlined 1D DCT helper (#[inline(always)])
Contiguous memory loads (no gather operations)

Benchmark Results (single 8x8 block):

Implementation	Time	Throughput	vs Scalar
Scalar	81 ns	786 Melem/s	1.00x
SIMD (gather)	67 ns	955 Melem/s	1.21x
SIMD (transpose)	59 ns	1078 Melem/s	1.37x

Encoder Impact (512x512 image):

Mode	Before SIMD	After SIMD	Improvement
Baseline	2.47ms	2.28ms	8.3% faster
Trellis	12.02ms	11.57ms	3.8% faster

The DCT SIMD reduced the gap vs C mozjpeg from 5.7x to 5.0x slower.

Why only 1.37x speedup instead of 4x?

wide crate uses portable SIMD, not native intrinsics
Transpose overhead (3 transposes per block, done via scalar)
C mozjpeg uses true SSE2/AVX2 with shuffle-based transpose

AVX2 Intrinsics DCT (`src/dct.rs::avx2`)

Status: DONE (Dec 2024)

Uses core::arch::x86_64 intrinsics directly for maximum performance.

Key optimizations over wide-based version:

_mm256_cvtepi16_epi32 for proper load+sign-extend (no vpinsrw gather)
Shuffle-based transpose using _mm256_unpacklo/hi_epi32/64 + _mm256_permute2x128_si256
Proper _mm256_srai_epi32 with const generic shift amounts
_mm_packs_epi32 for efficient i32→i16 packing

Benchmark Results (with -C target-cpu=native):

Implementation	Time	Throughput	vs Scalar
AVX2 intrinsics	40.1 ns	1.59 Gelem/s	1.19x faster
Scalar	47.5 ns	1.35 Gelem/s	1.00x
SIMD transpose (wide)	50.3 ns	1.27 Gelem/s	0.95x (slower!)
SIMD gather (wide)	56.1 ns	1.14 Gelem/s	0.85x (slower!)

Key Finding: The wide crate is slower than optimized scalar when the compiler can auto-vectorize with -C target-cpu=native. Explicit core::arch intrinsics are required for actual speedup.

Color Conversion (`src/color.rs`)

Status: DONE (Dec 2024)

Uses wide::i32x8 to process 8 pixels at a time (AVX2 width).

// SIMD RGB to YCbCr conversion (8 pixels at a time)
let fix_y_r = i32x8::splat(FIX_0_29900);
let fix_y_g = i32x8::splat(FIX_0_58700);
let fix_y_b = i32x8::splat(FIX_0_11400);

let y = (fix_y_r * r + fix_y_g * g + fix_y_b * b + half) >> SCALEBITS;

Functions optimized:

convert_rgb_to_ycbcr() - 8 pixels/iteration (was 4)
convert_rgb_to_gray() - 8 pixels/iteration (was 4)

Benchmark Results:

Image Size	Before (i32x4)	After (i32x8)	Improvement
512x512	375 µs	353 µs	~6% faster

Why only 6% improvement?

The gather operation (extracting R, G, B from interleaved format) dominates
i32x8::new([...]) still uses element-by-element construction
True AVX2 gather (_mm256_i32gather_epi32) might help but interleaved RGB format is inherently unfriendly to SIMD

Failed Optimizations (What Didn't Work)

FastEntropyEncoder (Dec 2024) - REVERTED

Hypothesis: Streaming bitwriter with precomputed code tables would reduce per-symbol overhead and improve entropy encoding performance.

What we tried:

// Direct Vec<u8> access instead of BitWriter trait object
struct FastEntropyEncoder {
    output: Vec<u8>,
    bit_buffer: u64,
    bits_in_buffer: u32,
}

// Precomputed code/size tables per block
let mut code_buf: [u16; 64] = [0; 64];
let mut size_buf: [u8; 64] = [0; 64];

Results:

Mode	Before	After	Change
Full encoder	0.95 ms	2.45 ms	2.6x SLOWER
Isolated entropy	N/A	N/A	1.38x faster

Why it failed:

Micro-optimization hurt macro-performance due to code locality issues
The "fast" path bloated the encoder binary, hurting instruction cache
Extra complexity (precomputed tables) added overhead that offset gains
Isolated benchmarks don't capture instruction cache effects

Lesson learned: Always benchmark the full pipeline, not just isolated functions. Micro-optimizations can destroy macro-performance.

SIMD Ecosystem in Rust (2025)

Based on research from zune-image, libjpeg-turbo, and Nine Rules for SIMD:

Approach	Pros	Cons	Recommendation
`core::simd` (nightly)	Portable, future standard	Nightly only	Good for experimentation
`core::arch` intrinsics	Maximum performance	Arch-specific, unsafe	Best for production perf
`wide` crate	Stable, portable	Slower than autovectorized scalar	Avoid for perf-critical code
`pulp` crate	Runtime multiversioning	Extra dependency	Good for portable binaries
Autovectorization	Zero effort	Unpredictable, often fails for floats	Default fallback

Key Lessons:

wide crate generates vpinsrw (scalar insert) instead of proper SIMD loads
With -C target-cpu=native, scalar code autovectorizes and beats wide
For real speedup, use core::arch intrinsics with proper load/widen patterns
Use _mm256_cvtepi16_epi32 for i16→i32 widening (not element-by-element)
Use shuffle-based transpose (unpack + permute) not array extraction

Future Optimizations

Platform-Specific Intrinsics

Status: PARTIALLY DONE (AVX2 DCT implemented)

To close the remaining 5x gap, platform-specific SIMD is needed:

Option 1: std::simd (nightly)

Rust's experimental portable SIMD
Better codegen than wide
Requires nightly compiler

Option 2: core::arch intrinsics

Direct SSE2/AVX2/NEON intrinsics
Maximum performance, matches C
Architecture-specific code paths
More complex maintenance

Option 3: pulp crate

Higher-level abstraction over intrinsics
Runtime CPU feature detection
Good balance of performance and portability

Downsampling (`src/sample.rs`) - LOW PRIORITY

Status: NOT STARTED

Simple averaging operations. Lower priority because:

Only used for 4:2:0 and 4:2:2 subsampling
Smaller fraction of total encoding time
Simple loop already optimizes reasonably well

Implementation Guidelines

Using the `wide` crate

The crate depends on wide = "1.1". Key types:

use wide::i32x4;

// Create vectors
let v = i32x4::splat(42);           // All lanes same value
let v = i32x4::new([1, 2, 3, 4]);   // From array

// Arithmetic (element-wise)
let sum = a + b;
let prod = a * b;
let shifted = v >> 16;

// Extract
let arr = v.to_array();

Testing SIMD Code

Correctness first: Compare output to scalar version
Edge cases: Test with various input patterns
Benchmark: Use criterion to measure speedup

#[test]
fn test_simd_dct_matches_scalar() {
    let samples = [/* test data */];
    let mut coeffs_scalar = [0i16; 64];
    let mut coeffs_simd = [0i16; 64];

    forward_dct_8x8(&samples, &mut coeffs_scalar);
    forward_dct_8x8_simd(&samples, &mut coeffs_simd);

    assert_eq!(coeffs_scalar, coeffs_simd);
}

Performance Measurement

# Run DCT microbenchmark
cargo bench --bench encode -- dct

# Run full encoder benchmark vs C
cargo bench --bench encode -- rust_vs_c

Key metrics:

DCT time per block (ns) - currently 58.5ns SIMD vs 80ns scalar
Overall encoding time (ms) - currently 2.28ms vs C's 0.46ms
Throughput (Melem/s) - currently 1094 SIMD vs 800 scalar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD Optimization Tracking

Current Performance Gap (Dec 2024)

Profiling Results (512x512 image)

Baseline Mode Breakdown (with AVX2)

Trellis Mode Breakdown

Optimization Priorities

1. Entropy Encoding (49.2% of time) - HIGH PRIORITY

2. Color Conversion (13.8% of time) - DONE

3. Quantization (10.1% of time) - LOW PRIORITY

4. Forward DCT (8.2% of time) - DONE

Completed SIMD Optimizations

Forward DCT (`src/dct.rs`)

AVX2 Intrinsics DCT (`src/dct.rs::avx2`)

Color Conversion (`src/color.rs`)

Failed Optimizations (What Didn't Work)

FastEntropyEncoder (Dec 2024) - REVERTED

SIMD Ecosystem in Rust (2025)

Future Optimizations

Platform-Specific Intrinsics

Downsampling (`src/sample.rs`) - LOW PRIORITY

Implementation Guidelines

Using the `wide` crate

Testing SIMD Code

Performance Measurement

References

FilesExpand file tree

SIMD_TRACKING.md

Latest commit

History

SIMD_TRACKING.md

File metadata and controls

SIMD Optimization Tracking

Current Performance Gap (Dec 2024)

Profiling Results (512x512 image)

Baseline Mode Breakdown (with AVX2)

Trellis Mode Breakdown

Optimization Priorities

1. Entropy Encoding (49.2% of time) - HIGH PRIORITY

2. Color Conversion (13.8% of time) - DONE

3. Quantization (10.1% of time) - LOW PRIORITY

4. Forward DCT (8.2% of time) - DONE

Completed SIMD Optimizations

Forward DCT (src/dct.rs)

AVX2 Intrinsics DCT (src/dct.rs::avx2)

Color Conversion (src/color.rs)

Failed Optimizations (What Didn't Work)

FastEntropyEncoder (Dec 2024) - REVERTED

SIMD Ecosystem in Rust (2025)

Future Optimizations

Platform-Specific Intrinsics

Downsampling (src/sample.rs) - LOW PRIORITY

Implementation Guidelines

Using the wide crate

Testing SIMD Code

Performance Measurement

References

Forward DCT (`src/dct.rs`)

AVX2 Intrinsics DCT (`src/dct.rs::avx2`)

Color Conversion (`src/color.rs`)

Downsampling (`src/sample.rs`) - LOW PRIORITY

Using the `wide` crate