Skip to content

Fallback SVE depending on vector lanes#9084

Open
stevesuzuki-arm wants to merge 5 commits intohalide:mainfrom
stevesuzuki-arm:pr-fallback_sve
Open

Fallback SVE depending on vector lanes#9084
stevesuzuki-arm wants to merge 5 commits intohalide:mainfrom
stevesuzuki-arm:pr-fallback_sve

Conversation

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor

While Halide accepts arbitrary factor of vectorization, compiling
it to SVE target with LLVM scalable vector type has some challenges:

  • Only vectors with lanes multiple-of-vscale is representable
  • Backend compiler is crashy for odd number of lanes

This problem happens more frequently when running existing
Halide unit tests on target with vector_bits longer than 128bit,
because the vectorization factor is too short in some case, or
unusual value for the purpose of corner case testing.

Lowering everything with predicates might be an option, however,
that would require invasive changes and the feasibility is unknown.
The other option is to convert to/from fixed sized vector, but
this causes the issue of mixing fixed and scalable in a intrin,
, performance penalty, and also infeasible on target without NEON.

To workaround this problem, we inspect vector lanes used in a function,
and if we find any problematic number for given target.vector_bits,
we stop using scalable vector entirely (i.e. set effective_vscale = 0).

Intrinsic map for Call is created for both NEON and SVE at init_module,
and dynamically selected for each function to compile.

In case this fallback happens, user is warned that SVE is disabled.
Note this situation can be typically resolved by vectorizing with
a longer and power-of-two factor.

Checklist

  • Tests added or updated (not required for docs, CI config, or typo fixes)
  • Documentation updated (if public API changed)
  • Python bindings updated (if public API changed)
  • Benchmarks are included here if the change is intended to affect performance.
  • Commits include AI attribution where applicable (see Code of Conduct)

While Halide accepts arbitrary factor of vectorization, compiling
it to SVE target with LLVM scalable vector type has some challenges:
- Only vectors with lanes multiple-of-vscale is representable
- Backend compiler is crashy for odd number of lanes

This problem happens more frequently when running existing
Halide unit tests on target with vector_bits longer than 128bit,
because the vectorization factor is too short in some case, or
unusual value for the purpose of corner case testing.

Lowering everything with predicates might be an option, however,
that would require invasive changes and the feasibility is unknown.
The other option is to convert to/from fixed sized vector, but
this causes the issue of mixing fixed and scalable in a intrin,
, performance penalty, and also infeasible on target without NEON.

To workaround this problem, we inspect vector lanes used in a function,
and if we find any problematic number for given target.vector_bits,
we stop using scalable vector entirely (i.e. set effective_vscale = 0).

Intrinsic map for Call is created for both NEON and SVE at init_module,
and dynamically selected for each function to compile.

In case this fallback happens, user is warned that SVE is disabled.
Note this situation can be typically resolved by vectorizing with
a longer and power-of-two factor.
@abadams
Copy link
Copy Markdown
Member

abadams commented Mar 30, 2026

I'm hoping the vector width legalization pass in #8629 might make this unnecessary. Let's try to get that merged.

@abadams
Copy link
Copy Markdown
Member

abadams commented Apr 10, 2026

Actually, not it won't help, because it doesn't pad vectors out to native length. At the dev meeting we decided that we should proceed with this PR to unblock other SVE work.

Comment thread src/CodeGen_ARM.cpp Outdated
// TODO: Target::SVE not supported https://github.com/halide/Halide/issues/8872
feasible_vscale = 0;
if (target.features_any_of({Target::SVE2})) {
VectorLanesGatherer vector_lanes_gatherer;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler alternative would be something like:

std::set<int> lanes_used;
mutate_with(func.body, [&](auto *self, const Expr &e) {
  lanes_used.insert(e.type().lanes()); 
  self->mutate_base(e);
});

That should visit every Expr node, which should see all the expr types.

Comment thread src/CodeGen_ARM.cpp Outdated
feasible_vscale = check_feasible_vscale(target.vector_bits, vector_lanes_gatherer.lanes_used, simple_name);
std::set<int> lanes_used;
mutate_with(func.body, [&](auto *self, const Expr &e) {
Type t = e.type();
Copy link
Copy Markdown
Member

@abadams abadams Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about some of this code. The ramp and broadcast nodes look like they don't do anything. Isn't t already that type? And why does lowering an intrinsic change the type? What can't this function not just look at t.lanes() (except for handles)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we want to check here ideally is what vector lanes we will potentially see during the code-gen. In Ramp and Broadcast, op->lanes can be different from op->type.lanes(). lower_intrinsic potentially generates different type and lanes, called in CodeGen_LLVM. The idea is borrowed from CodeGen_C. Does it make sense? I don't think this approach is able to perfectly catch vector lanes in LLVM::Value, so it is a kind of mitigation. If there is better approach, I'm happy to change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Ramp and Broadcast, the lanes field is equal to the type's lanes field after the vector flattening pass has run, so they should be the same at this point. And even if it weren't, I think the thing that is going to get generated is a vector with the Ramp-type's lanes, not the Ramp-node's lanes. I don't understand why CodeGen_C does something special here. lower_intrinsic doesn't do any vector reinterprets (I think), so it should never change the number of lanes either. CodeGen_C just wants to look at it to see if it introduces new element types. Do you have any cases that fail if this just looks at t.lanes()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done the experiment where I collect lanes in two different ways and assert if they match. As a result, it turned out specializing Ramp, Broadcast, and Call is not making any difference at least in Halide unit tests. Also, removing the handle case does not cause any test failure. I've updated with simplified code. Thank you for this feedback.

@zvookin
Copy link
Copy Markdown
Member

zvookin commented Apr 15, 2026

I don't have an objection to the PR in that it looks to make things better without breaking anything else, but really this is another chunk of complexity to deal with a failed design at the LLVM level. Is there any chance of things improving on the LLVM side to allow reliable fixed width targeting of SVE2? A major use case is WASM flexible vectors which are generally JIT compiled so the exact vector width is known at compile time. This should work without using vscale types at all.

At this point the main value in this feature is for Halide targeting the more restricted SVE2 subset stuff for neural network applications that is actually showing up in hardware. My guess is we have to go around LLVM entirely to do that via using inline assembly blocks.

@abadams
Copy link
Copy Markdown
Member

abadams commented Apr 15, 2026

This is intended to be temporary. PR #8629 adds a pass that forces vectors to be a size the backend can accept by slicing up too-wide vectors. It uses it for source-based GPU APIs (e.g. metal). It doesn't have all the features necessary to make this PR unnecessary, but it's close. The plan is to merge both to unblock people, then extend the legalization pass to handle padding sub-width vectors, then delete the check in this PR and instead legalize. That should mean the sve2 backend can handle anything, because we'll do the legalization ourselves on the Halide side instead of relying on LLVM.

So yes, it's more complexity on the Halide side, but ultimately vector width legalization is complexity we also need for some of the non-LLVM backends.

@zvookin
Copy link
Copy Markdown
Member

zvookin commented Apr 16, 2026

So yes, it's more complexity on the Halide side, but ultimately vector width legalization is complexity we also need for some of the non-LLVM backends.

The "complexity" I'm worried about here is the effective vscale is no longer constant during compilation. It shouldn't be a huge problem as it is either the value it is before this PR or zero, but reasoning about the code definitely gets trickier. It might be better to introduce a type that encodes a vector type constraint if one is needed, the effective vscale, and whether vectorization is suspended do to a length mismatch. This would be passed consistently rather than the current selection of zero to three of those values as separate arguments in various places.. But really I expect this all gets ripped out when vector length canonicalization happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants