Skip to content

Release v3.11.0

Latest

Choose a tag to compare

@sa-faizal sa-faizal released this 19 Mar 23:25
· 277 commits to main since this release
e4a3b04

IREE v3.11.0 Release Notes

Release Candidate: iree-3.11.0rc20260316
Commits: ~539 commits since v3.10.0
VMFB Bytecode Version: 17.0 (unchanged from v3.10.0)


Highlights

  • New async I/O infrastructure: Proactor-based async I/O with causal frontier scheduling, enabling cross-process shared memory support
  • Streaming tokenizer: Full HuggingFace-compatible tokenizer with tiktoken format support for OpenAI BPE vocabularies (click here for more info)
  • Python 3.10+ requirement: Minimum Python version bumped to 3.10; Python 3.12+ supported via Stable ABI (abi3).
  • ROCm flag rename: iree-hip-* compiler flags renamed to iree-rocm-* (old names deprecated with warnings)
  • Enhanced vector distribution: Refactored 2-phase forward/backward layout analysis with improved transfer_gather support

Breaking Changes

VMFB Compatibility

  • VMFB bytecode version unchanged (17.0) - VMFBs compiled with v3.10.0 remain compatible with v3.11.0 runtime
    • No recompilation needed when upgrading from v3.10.0

Python Version Requirement

  • Minimum Python version is now 3.10 (#23591)

Compiler Flag Renames

  • iree-hip-* flags renamed to iree-rocm-* (#23420)
    • Old flag names emit deprecation warnings but still work
    • CMake: IREE_HIP_TEST_TARGET_CHIPIREE_ROCM_TEST_TARGET_CHIP

Build System Changes

  • Minimum CMake version bumped to 3.26 (#23607)
    • Required for Python Stable ABI support

API Changes

  • map_gather/map_scatter ops renamed to map_load/map_store in LinalgExt (#23481)

What's New

1. Compiler

1.1 Async Infrastructure & Tokenizers

Major new infrastructure for async I/O and text processing:

  • Added proactor-based async I/O with causal frontier scheduling (iree/async/) (#23527)
  • Added streaming tokenizer with full HuggingFace compatibility (iree/tokenizer/) (#23528)
  • Graceful degradation for io_uring slab registration on RLIMIT_MEMLOCK (#23654)
  • Added tiktoken format loader for OpenAI BPE vocabularies (#23663)
  • Added async infrastructure for cross-process shared memory (#23688)

1.2 Codegen & Vector Distribution

Significant improvements to vector distribution and code generation:

  • Added support for shape_cast in vector distribution (#23307)
  • Support for padding integer attention masks (#23430)
  • Added arg_compare operation to VectorExt (#23386)
  • Refactored transfer_gather to use unified indexing_maps (#23510)
  • Added distribution pattern for iree_codegen.inner_tiled (#23483)
  • Added vectorization support for iree_linalg_ext.arg_compare (#23440)
  • Added transfer_gather unrolling (#23517)
  • Support multi-batch gather vectorization to transfer_gather (#23552)
  • Added transfer_gather canonicalizations for masking (#23565)
  • Refactored VectorLayoutAnalysis into 2-phase forward/backward design (#23611)
  • Added TransferScatterOp definition and verifier (#23666)
  • Introduced VectorizableOpInterface and migrated all ops (#23653, #23656, #23658, #23662, #23712, #23713, #23767)
  • Added iree_map dialect with PackMapAttr and VectorLayoutInterface (#23671, #23672)
  • Added TransferScatterOp bufferization support (#23719)
  • Materialize vector masking on VectorDistribute pipeline (#23679)
  • Added vectorization of non-projected linalg.generic (#23664)
  • Implemented ValueBoundsOpInterface for ToLayoutOp (#23766)
  • Apply bounds to subgroup_id (#23768)

1.3 GPU Codegen Improvements

  • Added multi-buffering support for gather_to_lds async copy mode (#23354)
  • Enabled swizzling for scaled matmuls (#23175)
  • Added CombineSourceLayoutTransformation pass for MapGatherOp (#23165)
  • Reworked GPUVerifyDistribution to use PreOrder walk with skip (#23502)
  • Combine CombineBarrierRegionsPass and CombineValueBarrierOps into a single pass GPUCombineValueSemanticsBarriersPass (#23518)
  • Added async copy mode pipelining for gather_to_lds (#23400)
  • Move hoisting to interface and add it for barrier ops (#23519)
  • GPU shared memory allocation based on layout analysis (#23631)
  • Added iree_gpu.global_subgroup_barrier op (#23451)
  • Added coalescing to reduction tiling (#23673)
  • Make VectorReductionToGPU scf.forall-aware (#23686)
  • Fixed shared memory estimation for multi-buffering (#23736)
  • Added explicit async markers for multi-buffered async load pipelining (#23648)

1.4 GPU Heuristics

  • Prefer larger MMA intrinsics for very large compute-bound GEMMs (#23641)
  • Added min-based tile distribution for imbalanced M/N problems (#23619)
  • Updated number of VGPRs on gfx1250 (RDNA4) (#23709)
  • Refactored MMA heuristic seeds to be architecture-specific (#23717)

1.5 CPU Backend

  • Added CPU optimization level option (#23259)
  • Configure GatherOp tiling sizes based on semantics (#23419)
  • Tuning spec support for LLVMCPU (#23424)
  • New heuristic for AArch64 matmul vector tile sizes (#22932)
  • Enable masking by default for targets with AVX-512 (#23470)
  • Dynamic attention support by tiling K1 when needed (#23544)
  • Initial plumbing for inner_tiled with data-tiled MMA attribute (#23494)
  • Propagate reduction tile sizes to producers for fusion (#23660)
  • Use TileSwizzle for inner_tiled layout on CPU (#23705)

1.6 LDS & Memory Access

  • Only enable coalesced DMA when elements are aligned to minimum transfer size (#23416)
  • Pre-check to ensure all copies are DMA-convertible before converting any (#23472)
  • Added in_bounds attribute to CoalescedGatherDMAOp for tensor.pad fusion (#23365)
  • Added fallback for CoalescedGatherDMA lowering (#23560)

1.7 PCF operations enhancements

  • Fixed bufferization bugs for generic and loop ops (#23446)
  • Added producer fusion into pcf.generic/loop ops (#23447)
  • Added FuseSubgroupConsumers pass to fuse consumers and extract_slice ops into subgroup-scoped pcf.generic/loop ops (#23484)
  • Added MemoryEffectsOpInterface to WriteSliceOp (#23490)
  • Added tensor.collapse_shape fusion into pcf.generic/loop (#23491)

1.8 Dispatch Creation

  • Moved iteration space tracking to LinalgExt (#23221)
  • Ignore unit dims when comparing iteration spaces (#23362)
  • Updated split reduction heuristics for GEMM (#23423)
  • Fixed producer fusion with nested region uses (#23475)
  • Split reduction sizes set for batch-first conv layouts (#23524)
  • Fixed fusion of scalar reduction with consumer (#23659)

1.9 Target Backends

ROCm:

  • Added workgroup reordering for data-tiling ukernels (#23358)
  • Clear sticky error after hipErrorPeerAccessAlreadyEnabled (#23538)
  • Added gfx950 f8e4m3fn ukernel (#23581)

SPIR-V:

  • Enable small float support in SPIR-V pipeline (#23391)
  • Reworked rootOp selection in kernel config (#23685)
  • Enable scf.forall-based workgroup distribution (#23684)

VMVX:

  • Enable sub-byte and small float support in VMVX pipeline (#23375)
  • Enable scf.forall distribution for VMVX pipelines (#23615)

1.10 Other Compiler Improvements

  • Added option to enable fp8/fp4 software emulation for all GPU targets (#23238)
  • Layout for mma.sync.m16n8k16 (#22847)
  • Generalized ConvertBf16ToUInt16Buffers to support fp8 types (#23389)
  • Moved Convert1X1FilterConv2DToMatmul pass from GlobalOptimization to Preprocessing (#23445)
  • Pattern to hoist expand_shape & collapse_shape from scf.for loop (#23572)
  • Torch: Added flag to enable shape refinement (#23632)
  • Support for Img2Col Transformation for Conv2D including quantized types (#23278)
  • Added PipelineAttrInterface and PassPipelineAttr (#23590)
  • Added HAL pass pipeline caching for executable translation (#23643)
  • Added tuner SMT ops: constraints, knobs, assert (iree_codegen.smt.*) (#23687, #23742, #23743, #23780)
  • Deleted LinalgExt::PackOp and LinalgExt::UnPackOp (#23550)
  • Added MultiPipelineNest for cross-type parallelism (#23620)
  • Deleted ConvertToDestinationPassingStylePass pass (#23783)

2. Runtime

2.1 VM Improvements

  • Added sub-byte integer support to ArithToVM (#23372)
  • Added small float type support to VM conversion (#23373)
  • Added buffer ops for 8-bit and 16-bit floats in UtilToVM (#23374)

2.2 HAL & Device Infrastructure

  • Added device topology infrastructure to HAL (#23573)
  • Added iree_hal_device_group_t to own device topology lifecycle (#23576)
  • Fixed Vulkan driver crash from UNIMPLEMENTED query_capabilities (#23582)
  • Added samples/hal/hello: pure HAL buffer fill, copy, and readback (#23645)

2.3 Utilities

  • Math.h improvements: Use popcount builtins, cleanup in float conversions (#23385)
  • Added util.string operations for runtime string formatting (#23425)
  • Added dynamic parameter scope and key with !util.buffer operands (#23426)
  • Moved flags from iree/base/internal to iree/base/tooling (#23578)
  • Added MPSC (multi-producer single-consumer) queue (#23700)
  • Added huge page and NUMA placement support for SHM (#23697)
  • Replaced libc printf with eyalroz/printf and added streaming status formatting (#23694)
  • Added status copy allocation and payload inspection APIs (#23698)
  • Removed 64-operation batch size limit from io_uring submit (#23725)

3. Tools & Bindings

3.1 C API

  • Added C API support for --iree-codegen-tuning-spec-path flag (#23320)
  • Exposed XOR shuffle bounds and validation functions in CAPI (#23442)
  • Added --exclude-libs=ALL to libIREECompiler.so shared library (#23574)

3.2 HAL CTS

  • Rewrite the HAL CTS to support bazel and scale better (#23644)

4. Infrastructure & CI

  • Removed MI250 testing (#23352)
  • Moved linux_x64_clang_debug to postsubmit (#23434)
  • Disabled internal linkage clang-tidy checks (#23569)
  • Cleaned up RISC-V toolchain files (#23457)
  • Added typos pre-commit hook and dictionary (#23606)

5. Documentation

  • Added lowering config guide (reduction; partial reduction only) (#22250)
  • Fixed invalid flag names and typos (#23542)
  • Fixed reference warnings and updated metal-hal-driver.md (#23547)
  • Added IREECPU and PCF dialects and passes to website docs (#23546)
  • Updated Python versions listed on the website (#23647)

Bug Fixes

Compiler

  • Fixes region-aware SCF handling in Explorer and ElideAsyncCopiesPass (#23257)
  • Fixed DenseMap iterator invalidation in OptionsBinder::topLevelOpt (#23412)
  • Fixed clone-through-barrier lifetime regression in resource usage analysis (#23417)
  • Fixed barrier handling when pipelining with non-transfer_read ops (#23435)
  • Fixed wrong output_shape of tensor.expand_shape ops (#23469)
  • Fixed EmplaceTransientsPass to handle zero allocas case (#23392)
  • Fixed suffix scan lowering in StableHLO reduce_window conversion (#23397)
  • Fixed promote_operands/promotion_types size mismatch (#23543)
  • Fixed layout analysis fixup crashes (#23630)
  • Fixes for UBSan compatibility across the runtime (#23692)
  • Do not vectorize with invalid vector sizes from lowering config (#23661)
  • Fixed uninitialized tensorCoreType causing flaky vector_to_gpu tests (#23701)
  • Bail out of foldReshapeIntoMapStore/MapLoad for 0-D tensor (#23716)
  • Fixed im2col decomposition to handle multiple K input positions (#23731)
  • Fixed crash in ROCDLLoadToTransposeLoad on block argument column index (#23755)
  • Fixed correctness issues in ConvertGatherToLDS narrow type emulation (#23763)
  • Fixed ub.poison legalization failure for 2D vectors in SPIR-V (#23789)
  • Fixed crash in complex matmul configuration logic on ROCm (#23790)
  • Fixed vector distribution for transposed outputs on ROCm (#23791)

Runtime

  • Fixed RISC-V test (#23405)
  • Fixed Async macOS CTS test flakes: dangling stack ops, RST detection, kqueue event loss (#23570)
  • Fixed Async multishot CTS test flakes (#23577)
  • Fixed VM ref leak from incorrect MOVE bit on branch block args (#23689)
  • Fixed ARM64 ring buffer oversubscription from load reordering (#23707)
  • Async proactor fixes: TSAN bridge and progress callback starvation (#23699)
  • Fixed hipHostUnregister/cuMemHostUnregister leak on import (#23779)

Full Changelog

Commits: v3.10.0...iree-3.11.0rc20260316


Contributors

Thank you to all contributors to this release!

Contributors:
@AWoloszyn, @AaronStGeorge, @Abhishek-Varma, @Groverkss, @HanKuanChen, @Hardcode84, @IanWood1, @MaheshRavishankar, @Max191, @Muzammiluddin-Syed-ECE, @RattataKing, @ScottTodd, @YashDeshpande25, @Yu-Zhewen, @amd-eochoalo, @bangtianliu, @benvanik, @bjacob, @efric, @egebeysel, @hanhanW, @javidcf, @jerryyin, @josephbak, @jtuyls, @keshavvinayak01, @krzysz00, @kuhar, @lialan, @momchil-velikov, @nirvedhmeshram, @phoebesv, @qedawkins, @rkayaith, @sa-faizal, @sjain-stanford, @sommerlukas, @stellaraccident, @yzhang93, @ziereis


Release notes generated from release candidate iree-3.11.0rc20260316