Fallback SVE depending on vector lanes#9084
Fallback SVE depending on vector lanes#9084stevesuzuki-arm wants to merge 5 commits intohalide:mainfrom
Conversation
While Halide accepts arbitrary factor of vectorization, compiling it to SVE target with LLVM scalable vector type has some challenges: - Only vectors with lanes multiple-of-vscale is representable - Backend compiler is crashy for odd number of lanes This problem happens more frequently when running existing Halide unit tests on target with vector_bits longer than 128bit, because the vectorization factor is too short in some case, or unusual value for the purpose of corner case testing. Lowering everything with predicates might be an option, however, that would require invasive changes and the feasibility is unknown. The other option is to convert to/from fixed sized vector, but this causes the issue of mixing fixed and scalable in a intrin, , performance penalty, and also infeasible on target without NEON. To workaround this problem, we inspect vector lanes used in a function, and if we find any problematic number for given target.vector_bits, we stop using scalable vector entirely (i.e. set effective_vscale = 0). Intrinsic map for Call is created for both NEON and SVE at init_module, and dynamically selected for each function to compile. In case this fallback happens, user is warned that SVE is disabled. Note this situation can be typically resolved by vectorizing with a longer and power-of-two factor.
|
I'm hoping the vector width legalization pass in #8629 might make this unnecessary. Let's try to get that merged. |
|
Actually, not it won't help, because it doesn't pad vectors out to native length. At the dev meeting we decided that we should proceed with this PR to unblock other SVE work. |
| // TODO: Target::SVE not supported https://github.com/halide/Halide/issues/8872 | ||
| feasible_vscale = 0; | ||
| if (target.features_any_of({Target::SVE2})) { | ||
| VectorLanesGatherer vector_lanes_gatherer; |
There was a problem hiding this comment.
Simpler alternative would be something like:
std::set<int> lanes_used;
mutate_with(func.body, [&](auto *self, const Expr &e) {
lanes_used.insert(e.type().lanes());
self->mutate_base(e);
});
That should visit every Expr node, which should see all the expr types.
| feasible_vscale = check_feasible_vscale(target.vector_bits, vector_lanes_gatherer.lanes_used, simple_name); | ||
| std::set<int> lanes_used; | ||
| mutate_with(func.body, [&](auto *self, const Expr &e) { | ||
| Type t = e.type(); |
There was a problem hiding this comment.
I'm confused about some of this code. The ramp and broadcast nodes look like they don't do anything. Isn't t already that type? And why does lowering an intrinsic change the type? What can't this function not just look at t.lanes() (except for handles)?
There was a problem hiding this comment.
What we want to check here ideally is what vector lanes we will potentially see during the code-gen. In Ramp and Broadcast, op->lanes can be different from op->type.lanes(). lower_intrinsic potentially generates different type and lanes, called in CodeGen_LLVM. The idea is borrowed from CodeGen_C. Does it make sense? I don't think this approach is able to perfectly catch vector lanes in LLVM::Value, so it is a kind of mitigation. If there is better approach, I'm happy to change.
There was a problem hiding this comment.
For Ramp and Broadcast, the lanes field is equal to the type's lanes field after the vector flattening pass has run, so they should be the same at this point. And even if it weren't, I think the thing that is going to get generated is a vector with the Ramp-type's lanes, not the Ramp-node's lanes. I don't understand why CodeGen_C does something special here. lower_intrinsic doesn't do any vector reinterprets (I think), so it should never change the number of lanes either. CodeGen_C just wants to look at it to see if it introduces new element types. Do you have any cases that fail if this just looks at t.lanes()?
There was a problem hiding this comment.
I've done the experiment where I collect lanes in two different ways and assert if they match. As a result, it turned out specializing Ramp, Broadcast, and Call is not making any difference at least in Halide unit tests. Also, removing the handle case does not cause any test failure. I've updated with simplified code. Thank you for this feedback.
|
I don't have an objection to the PR in that it looks to make things better without breaking anything else, but really this is another chunk of complexity to deal with a failed design at the LLVM level. Is there any chance of things improving on the LLVM side to allow reliable fixed width targeting of SVE2? A major use case is WASM flexible vectors which are generally JIT compiled so the exact vector width is known at compile time. This should work without using vscale types at all. At this point the main value in this feature is for Halide targeting the more restricted SVE2 subset stuff for neural network applications that is actually showing up in hardware. My guess is we have to go around LLVM entirely to do that via using inline assembly blocks. |
|
This is intended to be temporary. PR #8629 adds a pass that forces vectors to be a size the backend can accept by slicing up too-wide vectors. It uses it for source-based GPU APIs (e.g. metal). It doesn't have all the features necessary to make this PR unnecessary, but it's close. The plan is to merge both to unblock people, then extend the legalization pass to handle padding sub-width vectors, then delete the check in this PR and instead legalize. That should mean the sve2 backend can handle anything, because we'll do the legalization ourselves on the Halide side instead of relying on LLVM. So yes, it's more complexity on the Halide side, but ultimately vector width legalization is complexity we also need for some of the non-LLVM backends. |
The "complexity" I'm worried about here is the effective vscale is no longer constant during compilation. It shouldn't be a huge problem as it is either the value it is before this PR or zero, but reasoning about the code definitely gets trickier. It might be better to introduce a type that encodes a vector type constraint if one is needed, the effective vscale, and whether vectorization is suspended do to a length mismatch. This would be passed consistently rather than the current selection of zero to three of those values as separate arguments in various places.. But really I expect this all gets ripped out when vector length canonicalization happens. |
While Halide accepts arbitrary factor of vectorization, compiling
it to SVE target with LLVM scalable vector type has some challenges:
This problem happens more frequently when running existing
Halide unit tests on target with vector_bits longer than 128bit,
because the vectorization factor is too short in some case, or
unusual value for the purpose of corner case testing.
Lowering everything with predicates might be an option, however,
that would require invasive changes and the feasibility is unknown.
The other option is to convert to/from fixed sized vector, but
this causes the issue of mixing fixed and scalable in a intrin,
, performance penalty, and also infeasible on target without NEON.
To workaround this problem, we inspect vector lanes used in a function,
and if we find any problematic number for given target.vector_bits,
we stop using scalable vector entirely (i.e. set effective_vscale = 0).
Intrinsic map for Call is created for both NEON and SVE at init_module,
and dynamically selected for each function to compile.
In case this fallback happens, user is warned that SVE is disabled.
Note this situation can be typically resolved by vectorizing with
a longer and power-of-two factor.
Checklist