IREE v3.11.0 Release Notes
Release Candidate: iree-3.11.0rc20260316
Commits: ~539 commits since v3.10.0
VMFB Bytecode Version: 17.0 (unchanged from v3.10.0)
Highlights
- New async I/O infrastructure: Proactor-based async I/O with causal frontier scheduling, enabling cross-process shared memory support
- Streaming tokenizer: Full HuggingFace-compatible tokenizer with tiktoken format support for OpenAI BPE vocabularies (click here for more info)
- Python 3.10+ requirement: Minimum Python version bumped to 3.10; Python 3.12+ supported via Stable ABI (abi3).
- ROCm flag rename:
iree-hip-*compiler flags renamed toiree-rocm-*(old names deprecated with warnings) - Enhanced vector distribution: Refactored 2-phase forward/backward layout analysis with improved transfer_gather support
Breaking Changes
VMFB Compatibility
- VMFB bytecode version unchanged (17.0) - VMFBs compiled with
v3.10.0remain compatible withv3.11.0runtime- No recompilation needed when upgrading from v3.10.0
Python Version Requirement
- Minimum Python version is now 3.10 (#23591)
Compiler Flag Renames
iree-hip-*flags renamed toiree-rocm-*(#23420)- Old flag names emit deprecation warnings but still work
- CMake:
IREE_HIP_TEST_TARGET_CHIP→IREE_ROCM_TEST_TARGET_CHIP
Build System Changes
- Minimum CMake version bumped to 3.26 (#23607)
- Required for Python Stable ABI support
API Changes
map_gather/map_scatterops renamed tomap_load/map_storein LinalgExt (#23481)
What's New
1. Compiler
1.1 Async Infrastructure & Tokenizers
Major new infrastructure for async I/O and text processing:
- Added proactor-based async I/O with causal frontier scheduling (
iree/async/) (#23527) - Added streaming tokenizer with full HuggingFace compatibility (
iree/tokenizer/) (#23528) - Graceful degradation for io_uring slab registration on RLIMIT_MEMLOCK (#23654)
- Added tiktoken format loader for OpenAI BPE vocabularies (#23663)
- Added async infrastructure for cross-process shared memory (#23688)
1.2 Codegen & Vector Distribution
Significant improvements to vector distribution and code generation:
- Added support for
shape_castin vector distribution (#23307) - Support for padding integer attention masks (#23430)
- Added
arg_compareoperation to VectorExt (#23386) - Refactored
transfer_gatherto use unifiedindexing_maps(#23510) - Added distribution pattern for
iree_codegen.inner_tiled(#23483) - Added vectorization support for
iree_linalg_ext.arg_compare(#23440) - Added
transfer_gatherunrolling (#23517) - Support multi-batch gather vectorization to
transfer_gather(#23552) - Added
transfer_gathercanonicalizations for masking (#23565) - Refactored
VectorLayoutAnalysisinto 2-phase forward/backward design (#23611) - Added
TransferScatterOpdefinition and verifier (#23666) - Introduced
VectorizableOpInterfaceand migrated all ops (#23653, #23656, #23658, #23662, #23712, #23713, #23767) - Added
iree_mapdialect withPackMapAttrandVectorLayoutInterface(#23671, #23672) - Added
TransferScatterOpbufferization support (#23719) - Materialize vector masking on
VectorDistributepipeline (#23679) - Added vectorization of non-projected
linalg.generic(#23664) - Implemented
ValueBoundsOpInterfaceforToLayoutOp(#23766) - Apply bounds to
subgroup_id(#23768)
1.3 GPU Codegen Improvements
- Added multi-buffering support for
gather_to_ldsasync copy mode (#23354) - Enabled swizzling for scaled matmuls (#23175)
- Added
CombineSourceLayoutTransformationpass forMapGatherOp(#23165) - Reworked
GPUVerifyDistributionto use PreOrder walk with skip (#23502) - Combine
CombineBarrierRegionsPassandCombineValueBarrierOpsinto a single passGPUCombineValueSemanticsBarriersPass(#23518) - Added async copy mode pipelining for
gather_to_lds(#23400) - Move hoisting to interface and add it for barrier ops (#23519)
- GPU shared memory allocation based on layout analysis (#23631)
- Added
iree_gpu.global_subgroup_barrierop (#23451) - Added coalescing to reduction tiling (#23673)
- Make
VectorReductionToGPUscf.forall-aware (#23686) - Fixed shared memory estimation for multi-buffering (#23736)
- Added explicit async markers for multi-buffered async load pipelining (#23648)
1.4 GPU Heuristics
- Prefer larger MMA intrinsics for very large compute-bound GEMMs (#23641)
- Added min-based tile distribution for imbalanced M/N problems (#23619)
- Updated number of VGPRs on gfx1250 (RDNA4) (#23709)
- Refactored MMA heuristic seeds to be architecture-specific (#23717)
1.5 CPU Backend
- Added CPU optimization level option (#23259)
- Configure
GatherOptiling sizes based on semantics (#23419) - Tuning spec support for LLVMCPU (#23424)
- New heuristic for AArch64 matmul vector tile sizes (#22932)
- Enable masking by default for targets with AVX-512 (#23470)
- Dynamic attention support by tiling K1 when needed (#23544)
- Initial plumbing for
inner_tiledwith data-tiled MMA attribute (#23494) - Propagate reduction tile sizes to producers for fusion (#23660)
- Use
TileSwizzleforinner_tiledlayout on CPU (#23705)
1.6 LDS & Memory Access
- Only enable coalesced DMA when elements are aligned to minimum transfer size (#23416)
- Pre-check to ensure all copies are DMA-convertible before converting any (#23472)
- Added
in_boundsattribute toCoalescedGatherDMAOpfortensor.padfusion (#23365) - Added fallback for
CoalescedGatherDMAlowering (#23560)
1.7 PCF operations enhancements
- Fixed bufferization bugs for generic and loop ops (#23446)
- Added producer fusion into pcf.generic/loop ops (#23447)
- Added
FuseSubgroupConsumerspass to fuse consumers andextract_sliceops into subgroup-scopedpcf.generic/loopops (#23484) - Added
MemoryEffectsOpInterfacetoWriteSliceOp(#23490) - Added
tensor.collapse_shapefusion into pcf.generic/loop (#23491)
1.8 Dispatch Creation
- Moved iteration space tracking to LinalgExt (#23221)
- Ignore unit dims when comparing iteration spaces (#23362)
- Updated split reduction heuristics for GEMM (#23423)
- Fixed producer fusion with nested region uses (#23475)
- Split reduction sizes set for batch-first conv layouts (#23524)
- Fixed fusion of scalar reduction with consumer (#23659)
1.9 Target Backends
ROCm:
- Added workgroup reordering for data-tiling ukernels (#23358)
- Clear sticky error after
hipErrorPeerAccessAlreadyEnabled(#23538) - Added gfx950
f8e4m3fnukernel (#23581)
SPIR-V:
- Enable small float support in SPIR-V pipeline (#23391)
- Reworked
rootOpselection in kernel config (#23685) - Enable
scf.forall-based workgroup distribution (#23684)
VMVX:
- Enable sub-byte and small float support in VMVX pipeline (#23375)
- Enable
scf.foralldistribution for VMVX pipelines (#23615)
1.10 Other Compiler Improvements
- Added option to enable fp8/fp4 software emulation for all GPU targets (#23238)
- Layout for
mma.sync.m16n8k16(#22847) - Generalized
ConvertBf16ToUInt16Buffersto support fp8 types (#23389) - Moved
Convert1X1FilterConv2DToMatmulpass from GlobalOptimization to Preprocessing (#23445) - Pattern to hoist
expand_shape&collapse_shapefromscf.forloop (#23572) - Torch: Added flag to enable shape refinement (#23632)
- Support for Img2Col Transformation for Conv2D including quantized types (#23278)
- Added
PipelineAttrInterfaceandPassPipelineAttr(#23590) - Added HAL pass pipeline caching for executable translation (#23643)
- Added tuner SMT ops: constraints, knobs, assert (
iree_codegen.smt.*) (#23687, #23742, #23743, #23780) - Deleted
LinalgExt::PackOpandLinalgExt::UnPackOp(#23550) - Added
MultiPipelineNestfor cross-type parallelism (#23620) - Deleted
ConvertToDestinationPassingStylePasspass (#23783)
2. Runtime
2.1 VM Improvements
- Added sub-byte integer support to
ArithToVM(#23372) - Added small float type support to VM conversion (#23373)
- Added buffer ops for 8-bit and 16-bit floats in
UtilToVM(#23374)
2.2 HAL & Device Infrastructure
- Added device topology infrastructure to HAL (#23573)
- Added
iree_hal_device_group_tto own device topology lifecycle (#23576) - Fixed Vulkan driver crash from UNIMPLEMENTED
query_capabilities(#23582) - Added samples/hal/hello: pure HAL buffer fill, copy, and readback (#23645)
2.3 Utilities
- Math.h improvements: Use popcount builtins, cleanup in float conversions (#23385)
- Added
util.stringoperations for runtime string formatting (#23425) - Added dynamic parameter scope and key with
!util.bufferoperands (#23426) - Moved flags from iree/base/internal to iree/base/tooling (#23578)
- Added MPSC (multi-producer single-consumer) queue (#23700)
- Added huge page and NUMA placement support for SHM (#23697)
- Replaced libc printf with eyalroz/printf and added streaming status formatting (#23694)
- Added status copy allocation and payload inspection APIs (#23698)
- Removed 64-operation batch size limit from io_uring submit (#23725)
3. Tools & Bindings
3.1 C API
- Added C API support for
--iree-codegen-tuning-spec-pathflag (#23320) - Exposed XOR shuffle bounds and validation functions in CAPI (#23442)
- Added
--exclude-libs=ALLtolibIREECompiler.soshared library (#23574)
3.2 HAL CTS
- Rewrite the HAL CTS to support bazel and scale better (#23644)
4. Infrastructure & CI
- Removed MI250 testing (#23352)
- Moved linux_x64_clang_debug to postsubmit (#23434)
- Disabled internal linkage clang-tidy checks (#23569)
- Cleaned up RISC-V toolchain files (#23457)
- Added typos pre-commit hook and dictionary (#23606)
5. Documentation
- Added lowering config guide (reduction; partial reduction only) (#22250)
- Fixed invalid flag names and typos (#23542)
- Fixed reference warnings and updated metal-hal-driver.md (#23547)
- Added IREECPU and PCF dialects and passes to website docs (#23546)
- Updated Python versions listed on the website (#23647)
Bug Fixes
Compiler
- Fixes region-aware SCF handling in Explorer and
ElideAsyncCopiesPass(#23257) - Fixed
DenseMapiterator invalidation inOptionsBinder::topLevelOpt(#23412) - Fixed clone-through-barrier lifetime regression in resource usage analysis (#23417)
- Fixed barrier handling when pipelining with non-
transfer_readops (#23435) - Fixed wrong
output_shapeoftensor.expand_shapeops (#23469) - Fixed
EmplaceTransientsPassto handle zero allocas case (#23392) - Fixed suffix scan lowering in StableHLO reduce_window conversion (#23397)
- Fixed promote_operands/promotion_types size mismatch (#23543)
- Fixed layout analysis fixup crashes (#23630)
- Fixes for UBSan compatibility across the runtime (#23692)
- Do not vectorize with invalid vector sizes from lowering config (#23661)
- Fixed uninitialized
tensorCoreTypecausing flakyvector_to_gputests (#23701) - Bail out of
foldReshapeIntoMapStore/MapLoadfor 0-D tensor (#23716) - Fixed
im2coldecomposition to handle multiple K input positions (#23731) - Fixed crash in
ROCDLLoadToTransposeLoadon block argument column index (#23755) - Fixed correctness issues in
ConvertGatherToLDSnarrow type emulation (#23763) - Fixed
ub.poisonlegalization failure for 2D vectors in SPIR-V (#23789) - Fixed crash in complex matmul configuration logic on ROCm (#23790)
- Fixed vector distribution for transposed outputs on ROCm (#23791)
Runtime
- Fixed RISC-V test (#23405)
- Fixed Async macOS CTS test flakes: dangling stack ops, RST detection, kqueue event loss (#23570)
- Fixed Async multishot CTS test flakes (#23577)
- Fixed VM ref leak from incorrect
MOVEbit on branch block args (#23689) - Fixed ARM64 ring buffer oversubscription from load reordering (#23707)
- Async proactor fixes: TSAN bridge and progress callback starvation (#23699)
- Fixed
hipHostUnregister/cuMemHostUnregisterleak on import (#23779)
Full Changelog
Commits: v3.10.0...iree-3.11.0rc20260316
Contributors
Thank you to all contributors to this release!
Contributors:
@AWoloszyn, @AaronStGeorge, @Abhishek-Varma, @Groverkss, @HanKuanChen, @Hardcode84, @IanWood1, @MaheshRavishankar, @Max191, @Muzammiluddin-Syed-ECE, @RattataKing, @ScottTodd, @YashDeshpande25, @Yu-Zhewen, @amd-eochoalo, @bangtianliu, @benvanik, @bjacob, @efric, @egebeysel, @hanhanW, @javidcf, @jerryyin, @josephbak, @jtuyls, @keshavvinayak01, @krzysz00, @kuhar, @lialan, @momchil-velikov, @nirvedhmeshram, @phoebesv, @qedawkins, @rkayaith, @sa-faizal, @sjain-stanford, @sommerlukas, @stellaraccident, @yzhang93, @ziereis
Release notes generated from release candidate iree-3.11.0rc20260316