IREE v3.11.0 Release Notes

Release Candidate: iree-3.11.0rc20260316
Commits: ~539 commits since v3.10.0
VMFB Bytecode Version: 17.0 (unchanged from v3.10.0)

Highlights

New async I/O infrastructure: Proactor-based async I/O with causal frontier scheduling, enabling cross-process shared memory support
Streaming tokenizer: Full HuggingFace-compatible tokenizer with tiktoken format support for OpenAI BPE vocabularies (click here for more info)
Python 3.10+ requirement: Minimum Python version bumped to 3.10; Python 3.12+ supported via Stable ABI (abi3).
ROCm flag rename: iree-hip-* compiler flags renamed to iree-rocm-* (old names deprecated with warnings)
Enhanced vector distribution: Refactored 2-phase forward/backward layout analysis with improved transfer_gather support

Breaking Changes

VMFB Compatibility

VMFB bytecode version unchanged (17.0) - VMFBs compiled with v3.10.0 remain compatible with v3.11.0 runtime
- No recompilation needed when upgrading from v3.10.0

Python Version Requirement

Minimum Python version is now 3.10 (#23591)

Compiler Flag Renames

iree-hip-* flags renamed to iree-rocm-* (#23420)
- Old flag names emit deprecation warnings but still work
- CMake: IREE_HIP_TEST_TARGET_CHIP → IREE_ROCM_TEST_TARGET_CHIP

Build System Changes

Minimum CMake version bumped to 3.26 (#23607)
- Required for Python Stable ABI support

API Changes

map_gather/map_scatter ops renamed to map_load/map_store in LinalgExt (#23481)

What's New

1. Compiler

1.1 Async Infrastructure & Tokenizers

Major new infrastructure for async I/O and text processing:

Added proactor-based async I/O with causal frontier scheduling (iree/async/) (#23527)
Added streaming tokenizer with full HuggingFace compatibility (iree/tokenizer/) (#23528)
Graceful degradation for io_uring slab registration on RLIMIT_MEMLOCK (#23654)
Added tiktoken format loader for OpenAI BPE vocabularies (#23663)
Added async infrastructure for cross-process shared memory (#23688)

1.2 Codegen & Vector Distribution

Significant improvements to vector distribution and code generation:

Added support for shape_cast in vector distribution (#23307)
Support for padding integer attention masks (#23430)
Added arg_compare operation to VectorExt (#23386)
Refactored transfer_gather to use unified indexing_maps (#23510)
Added distribution pattern for iree_codegen.inner_tiled (#23483)
Added vectorization support for iree_linalg_ext.arg_compare (#23440)
Added transfer_gather unrolling (#23517)
Support multi-batch gather vectorization to transfer_gather (#23552)
Added transfer_gather canonicalizations for masking (#23565)
Refactored VectorLayoutAnalysis into 2-phase forward/backward design (#23611)
Added TransferScatterOp definition and verifier (#23666)
Introduced VectorizableOpInterface and migrated all ops (#23653, #23656, #23658, #23662, #23712, #23713, #23767)
Added iree_map dialect with PackMapAttr and VectorLayoutInterface (#23671, #23672)
Added TransferScatterOp bufferization support (#23719)
Materialize vector masking on VectorDistribute pipeline (#23679)
Added vectorization of non-projected linalg.generic (#23664)
Implemented ValueBoundsOpInterface for ToLayoutOp (#23766)
Apply bounds to subgroup_id (#23768)

1.3 GPU Codegen Improvements

Added multi-buffering support for gather_to_lds async copy mode (#23354)
Enabled swizzling for scaled matmuls (#23175)
Added CombineSourceLayoutTransformation pass for MapGatherOp (#23165)
Reworked GPUVerifyDistribution to use PreOrder walk with skip (#23502)
Combine CombineBarrierRegionsPass and CombineValueBarrierOps into a single pass GPUCombineValueSemanticsBarriersPass (#23518)
Added async copy mode pipelining for gather_to_lds (#23400)
Move hoisting to interface and add it for barrier ops (#23519)
GPU shared memory allocation based on layout analysis (#23631)
Added iree_gpu.global_subgroup_barrier op (#23451)
Added coalescing to reduction tiling (#23673)
Make VectorReductionToGPU scf.forall-aware (#23686)
Fixed shared memory estimation for multi-buffering (#23736)
Added explicit async markers for multi-buffered async load pipelining (#23648)

1.4 GPU Heuristics

Prefer larger MMA intrinsics for very large compute-bound GEMMs (#23641)
Added min-based tile distribution for imbalanced M/N problems (#23619)
Updated number of VGPRs on gfx1250 (RDNA4) (#23709)
Refactored MMA heuristic seeds to be architecture-specific (#23717)

1.5 CPU Backend

Added CPU optimization level option (#23259)
Configure GatherOp tiling sizes based on semantics (#23419)
Tuning spec support for LLVMCPU (#23424)
New heuristic for AArch64 matmul vector tile sizes (#22932)
Enable masking by default for targets with AVX-512 (#23470)
Dynamic attention support by tiling K1 when needed (#23544)
Initial plumbing for inner_tiled with data-tiled MMA attribute (#23494)
Propagate reduction tile sizes to producers for fusion (#23660)
Use TileSwizzle for inner_tiled layout on CPU (#23705)

1.6 LDS & Memory Access

Only enable coalesced DMA when elements are aligned to minimum transfer size (#23416)
Pre-check to ensure all copies are DMA-convertible before converting any (#23472)
Added in_bounds attribute to CoalescedGatherDMAOp for tensor.pad fusion (#23365)
Added fallback for CoalescedGatherDMA lowering (#23560)

1.7 PCF operations enhancements

Fixed bufferization bugs for generic and loop ops (#23446)
Added producer fusion into pcf.generic/loop ops (#23447)
Added FuseSubgroupConsumers pass to fuse consumers and extract_slice ops into subgroup-scoped pcf.generic/loop ops (#23484)
Added MemoryEffectsOpInterface to WriteSliceOp (#23490)
Added tensor.collapse_shape fusion into pcf.generic/loop (#23491)

1.8 Dispatch Creation

Moved iteration space tracking to LinalgExt (#23221)
Ignore unit dims when comparing iteration spaces (#23362)
Updated split reduction heuristics for GEMM (#23423)
Fixed producer fusion with nested region uses (#23475)
Split reduction sizes set for batch-first conv layouts (#23524)
Fixed fusion of scalar reduction with consumer (#23659)

1.9 Target Backends

ROCm:

Added workgroup reordering for data-tiling ukernels (#23358)
Clear sticky error after hipErrorPeerAccessAlreadyEnabled (#23538)
Added gfx950 f8e4m3fn ukernel (#23581)

SPIR-V:

Enable small float support in SPIR-V pipeline (#23391)
Reworked rootOp selection in kernel config (#23685)
Enable scf.forall-based workgroup distribution (#23684)

VMVX:

Enable sub-byte and small float support in VMVX pipeline (#23375)
Enable scf.forall distribution for VMVX pipelines (#23615)

1.10 Other Compiler Improvements

Added option to enable fp8/fp4 software emulation for all GPU targets (#23238)
Layout for mma.sync.m16n8k16 (#22847)
Generalized ConvertBf16ToUInt16Buffers to support fp8 types (#23389)
Moved Convert1X1FilterConv2DToMatmul pass from GlobalOptimization to Preprocessing (#23445)
Pattern to hoist expand_shape & collapse_shape from scf.for loop (#23572)
Torch: Added flag to enable shape refinement (#23632)
Support for Img2Col Transformation for Conv2D including quantized types (#23278)
Added PipelineAttrInterface and PassPipelineAttr (#23590)
Added HAL pass pipeline caching for executable translation (#23643)
Added tuner SMT ops: constraints, knobs, assert (iree_codegen.smt.*) (#23687, #23742, #23743, #23780)
Deleted LinalgExt::PackOp and LinalgExt::UnPackOp (#23550)
Added MultiPipelineNest for cross-type parallelism (#23620)
Deleted ConvertToDestinationPassingStylePass pass (#23783)

2. Runtime

2.1 VM Improvements

Added sub-byte integer support to ArithToVM (#23372)
Added small float type support to VM conversion (#23373)
Added buffer ops for 8-bit and 16-bit floats in UtilToVM (#23374)

2.2 HAL & Device Infrastructure

Added device topology infrastructure to HAL (#23573)
Added iree_hal_device_group_t to own device topology lifecycle (#23576)
Fixed Vulkan driver crash from UNIMPLEMENTED query_capabilities (#23582)
Added samples/hal/hello: pure HAL buffer fill, copy, and readback (#23645)

2.3 Utilities

Math.h improvements: Use popcount builtins, cleanup in float conversions (#23385)
Added util.string operations for runtime string formatting (#23425)
Added dynamic parameter scope and key with !util.buffer operands (#23426)
Moved flags from iree/base/internal to iree/base/tooling (#23578)
Added MPSC (multi-producer single-consumer) queue (#23700)
Added huge page and NUMA placement support for SHM (#23697)
Replaced libc printf with eyalroz/printf and added streaming status formatting (#23694)
Added status copy allocation and payload inspection APIs (#23698)
Removed 64-operation batch size limit from io_uring submit (#23725)

3. Tools & Bindings

3.1 C API

Added C API support for --iree-codegen-tuning-spec-path flag (#23320)
Exposed XOR shuffle bounds and validation functions in CAPI (#23442)
Added --exclude-libs=ALL to libIREECompiler.so shared library (#23574)

3.2 HAL CTS

Rewrite the HAL CTS to support bazel and scale better (#23644)

4. Infrastructure & CI

Removed MI250 testing (#23352)
Moved linux_x64_clang_debug to postsubmit (#23434)
Disabled internal linkage clang-tidy checks (#23569)
Cleaned up RISC-V toolchain files (#23457)
Added typos pre-commit hook and dictionary (#23606)

5. Documentation

Added lowering config guide (reduction; partial reduction only) (#22250)
Fixed invalid flag names and typos (#23542)
Fixed reference warnings and updated metal-hal-driver.md (#23547)
Added IREECPU and PCF dialects and passes to website docs (#23546)
Updated Python versions listed on the website (#23647)

Bug Fixes

Compiler

Fixes region-aware SCF handling in Explorer and ElideAsyncCopiesPass (#23257)
Fixed DenseMap iterator invalidation in OptionsBinder::topLevelOpt (#23412)
Fixed clone-through-barrier lifetime regression in resource usage analysis (#23417)
Fixed barrier handling when pipelining with non-transfer_read ops (#23435)
Fixed wrong output_shape of tensor.expand_shape ops (#23469)
Fixed EmplaceTransientsPass to handle zero allocas case (#23392)
Fixed suffix scan lowering in StableHLO reduce_window conversion (#23397)
Fixed promote_operands/promotion_types size mismatch (#23543)
Fixed layout analysis fixup crashes (#23630)
Fixes for UBSan compatibility across the runtime (#23692)
Do not vectorize with invalid vector sizes from lowering config (#23661)
Fixed uninitialized tensorCoreType causing flaky vector_to_gpu tests (#23701)
Bail out of foldReshapeIntoMapStore/MapLoad for 0-D tensor (#23716)
Fixed im2col decomposition to handle multiple K input positions (#23731)
Fixed crash in ROCDLLoadToTransposeLoad on block argument column index (#23755)
Fixed correctness issues in ConvertGatherToLDS narrow type emulation (#23763)
Fixed ub.poison legalization failure for 2D vectors in SPIR-V (#23789)
Fixed crash in complex matmul configuration logic on ROCm (#23790)
Fixed vector distribution for transposed outputs on ROCm (#23791)

Runtime

Fixed RISC-V test (#23405)
Fixed Async macOS CTS test flakes: dangling stack ops, RST detection, kqueue event loss (#23570)
Fixed Async multishot CTS test flakes (#23577)
Fixed VM ref leak from incorrect MOVE bit on branch block args (#23689)
Fixed ARM64 ring buffer oversubscription from load reordering (#23707)
Async proactor fixes: TSAN bridge and progress callback starvation (#23699)
Fixed hipHostUnregister/cuMemHostUnregister leak on import (#23779)

Full Changelog

Commits: v3.10.0...iree-3.11.0rc20260316

Contributors

Thank you to all contributors to this release!

Contributors:
@AWoloszyn, @AaronStGeorge, @Abhishek-Varma, @Groverkss, @HanKuanChen, @Hardcode84, @IanWood1, @MaheshRavishankar, @Max191, @Muzammiluddin-Syed-ECE, @RattataKing, @ScottTodd, @YashDeshpande25, @Yu-Zhewen, @amd-eochoalo, @bangtianliu, @benvanik, @bjacob, @efric, @egebeysel, @hanhanW, @javidcf, @jerryyin, @josephbak, @jtuyls, @keshavvinayak01, @krzysz00, @kuhar, @lialan, @momchil-velikov, @nirvedhmeshram, @phoebesv, @qedawkins, @rkayaith, @sa-faizal, @sjain-stanford, @sommerlukas, @stellaraccident, @yzhang93, @ziereis

Release notes generated from release candidate iree-3.11.0rc20260316

Release v3.11.0

IREE v3.11.0 Release Notes

Highlights

Breaking Changes

VMFB Compatibility

Python Version Requirement

Compiler Flag Renames

Build System Changes

API Changes

What's New

1. Compiler

1.1 Async Infrastructure & Tokenizers

1.2 Codegen & Vector Distribution

1.3 GPU Codegen Improvements

1.4 GPU Heuristics

1.5 CPU Backend

1.6 LDS & Memory Access

1.7 PCF operations enhancements

1.8 Dispatch Creation

1.9 Target Backends

1.10 Other Compiler Improvements

2. Runtime

2.1 VM Improvements

2.2 HAL & Device Infrastructure

2.3 Utilities

3. Tools & Bindings

3.1 C API

3.2 HAL CTS

4. Infrastructure & CI

5. Documentation

Bug Fixes

Compiler

Runtime

Full Changelog

Contributors

Contributors

Uh oh!