As a followup to D95291, getOperandsScalarizationOverhead was still
using a VF as a vector factor if the arguments were scalar, and would
assert on certain matrix intrinsics with differently sized vector
arguments. This patch removes the VF arg, instead passing the Types
through directly. This should allow it to more accurately compute the
cost without having to guess at which operands will be vectorized,
something difficult with more complex intrinsics.
This adjusts one SVE test as it is now calling the wrong intrinsic vs
veccall. Without invalid InstructCosts the cost of the scalarized
intrinsic is too low. This should get fixed when the cost of
scalarization is accounted for with scalable types.
Differential Revision: https://reviews.llvm.org/D96287
It appears that pointer types were causing issues for the min/max cost
code in getIntrinsicInstrCost. This makes sure that when matching
icmp/select to a min/max, we only do that for normal int or float types.
A v8i32 compare will produce a v8i1 predicate, but during codegen the
v8i32 will be split into two v4i32, potentially requiring two v4i1
predicates to be merged into a single v8i1. Because this merging of two
v4i1's into a v8i1 is very expensive, we need to make the cost of the
compare equally high.
This patch adds the cost of that to ARMTTIImpl::getCmpSelInstrCost.
Because we don't know whether the user of the predicate can be split,
and the cost model is mostly pre-instruction, we may be pessimistic but
that should only be for larger and legal types. This also adds min/max
detection to the costmodel where it can be detected, to keep those in
line with the cost of simple min/max instructions. Otherwise for the
most part, costs that were already expensive have become more expensive.
Differential Revision: https://reviews.llvm.org/D96692
This adds basic MVE costs for SMIN/SMAX/UMIN/UMAX, as well as MINNUM and
MAXNUM representing fmin and fmax. It tightens up the costs, not using a
ICmp+Select cost.
Differential Revision: https://reviews.llvm.org/D96603
This patch uses the function getShuffleCost with SK_Reverse to compute the cost
for experimental.vector.reverse.
For scalable vector type, it adds a table will the legal types on
AArch64TTIImpl::getShuffleCost to not assert in BasicTTIImpl::getShuffleCost,
and for fixed vector, it relies on the existing cost model in BasicTTIImpl.
Depends on D94883
Differential Revision: https://reviews.llvm.org/D95603
This fixes an overly restrictive assumption that the vector is a FixedVectorType,
in code that tries to calculate the cost of a cast operation when splitting
a too-wide vector. The algorithm works the same for scalable vectors, so this
patch removes the cast<FixedVectorType>.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D96253
COST(zext (<4 x i32> load(...) to <4 x i64>)) != 0 when
<4 x i64> is an illegal result type that requires splitting
of the operation.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D96250
This adds the CostKind to getMVEVectorCostFactor, so that it can
automatically account for CodeSize costs, where it returns a cost of 1
not the MVEFactor used for Throughput/Latency. This helps simplify the
caller code and allows us to get the codesize cost more correct in more
cases.
This patch adds a cost model for SK_Broadcast in
AArch64TTIImpl::getShuffleCost with scalable vector.
Without this patch, the scalable vector type relies on BasicTTIImpl cost
implementation and assert.
Differential Revision: https://reviews.llvm.org/D95598
This adds sadd.sat, uadd.sat, ssub.sat and usub.sat costs for AArch64,
similar to how they were recently added for ARM.
Differential Revision: https://reviews.llvm.org/D95292
I have removed an unnecessary assert in LoopVectorizationCostModel::getInstructionCost
that prevented a cost being calculated for select instructions when using
scalable vectors. In addition, I have changed AArch64TTIImpl::getCmpSelInstrCost
to only do special cost calculations for fixed width vectors and fall
back to the base version for scalable vectors.
I have added a simple cost model test for cmps and selects:
test/Analysis/CostModel/sve-cmpsel.ll
and some simple tests that show we vectorize loops with cmp and select:
test/Transforms/LoopVectorize/AArch64/sve-basic-vec.ll
Differential Revision: https://reviews.llvm.org/D95039
Just like llvm.assume, there are a lot of cases where we can just ignore llvm.experimental.noalias.scope.decl.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93042
We have no lowering for VSELECT vXi1, vXi1, vXi1, so mark them as
expanded to turn them into a series of logical operations.
Differential Revision: https://reviews.llvm.org/D94946
This adds some basic MVE sadd_sat/ssub_sat/uadd_sat/usub_sat costs,
based on when the instruction is legal. With smaller than legal types
that are promoted we generate shr(qadd(shl, shl)), so the cost is 4
appropriately.
Differential Revision: https://reviews.llvm.org/D94958
This patch computes the cost for vector.reduce<operand> for scalable vectors.
The cost is split into two parts: the legalization cost and the horizontal
reduction.
Differential Revision: https://reviews.llvm.org/D93639
We did not have specific costs for larger than legal truncates that were
not otherwise cheap (where they were next to stores, for example). As
MVE does not have a dedicated instruction for them (and we do not use
loads/stores yet), they should be expensive as they get expanded to a
series of lane moves.
Differential Revision: https://reviews.llvm.org/D94260
This patch fixes a bug introduced in the patch:
https://reviews.llvm.org/D93030
This patch pulls the test for scalable vector to be the first instruction
to be checked. This avoids the Gather and Scatter cost model for AArch64 to
compute the number of vector elements for something that is not a vector and
therefore crashing.
A new TTI interface has been added 'Optional <unsigned>getMaxVScale' that
returns the maximum vscale for a given target.
When known getMaxVScale is used to compute the cost of masked gather scatter
for scalable vector.
Depends on D92094
Differential Revision: https://reviews.llvm.org/D93030
Adds cost model support for the new llvm.experimental.vector.{extract,insert}
intrinsics, using the existing getExtractSubvectorOverhead and
getInsertSubvectorOverhead functions for shuffles.
Previously this case would throw an assertion.
Differential Revision: https://reviews.llvm.org/D93043
This patch replaces FixedVectorType by VectorType in getIntrinsicInstrCost
in BasicTTIImpl.h. It re-arranges the scalable type test earlier return
and add tests for scalable types.
Depends on D91532
Differential Revision: https://reviews.llvm.org/D92094
This adds some basic MVE masked load/store costs, notably changing the
cost of legal loads/stores to the MVECostFactor and the cost of
scalarized instructions to 8*NumElts.
Differential Revision: https://reviews.llvm.org/D86538
Noticed while looking at D92701 - we only really handle TCK_RecipThroughput gather/scatter costs - for now drop back to the default implementation for non-legal gathers/scatters.
Without FMF, we lower these intrinsics into something like this:
vmaxsd %xmm0, %xmm1, %xmm2
vcmpunordsd %xmm0, %xmm0, %xmm0
vblendvpd %xmm0, %xmm1, %xmm2, %xmm0
But if we can ignore NANs, the single min/max instruction is enough
because there is no need to fix up the x86 logic that corresponds to
X > Y ? X : Y.
We probably want to make other adjustments for FP intrinsics with FMF
to account for specialized codegen (for example, FSQRT).
Differential Revision: https://reviews.llvm.org/D92337
This was modeled to have a cost of 1, but since we do not have a MUL.2d this is
scalarized into vector inserts/extracts and scalar muls.
Motivating precommitted test is test/Transforms/SLPVectorizer/AArch64/mul.ll,
which we don't want to SLP vectorize.
Test Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll
unfortunately needed changing, but the reason is documented in
LoopVectorize.cpp:6855:
// The cost of executing VF copies of the scalar instruction. This opcode
// is unknown. Assume that it is the same as 'mul'.
which I will address next as a follow up of this.
Differential Revision: https://reviews.llvm.org/D92208
If usubsat() is legal, this is likely to result in smaller codegen expansion than the default cmp+select codegen expansion.
Allows us to move the x86-specific lowering to the generic expansion code.
Differential Revision: https://reviews.llvm.org/D92183
The cost-model is not getting the cost right for a mul with <2 x i64>
operands, i.e. we don't have a MUL.2d, and this is precommitting some
tests before adjusting this.
Add a basic implementation of getGatherScatterOpCost to BasicTTIImpl.
The implementation estimates the cost of scalarizing the loads/stores,
the cost of packing/extracting the individual lanes and the cost of
only selecting enabled lanes.
This more accurately reflects the current cost on targets like AArch64.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D91984
Update costs now that D92095 and D92102 have tweaked the SSE2 implementation
The SSE42 BLENDVPD cost can actually be used on SSE41 as we don't attempt to generate PCMPGT anymore
Add scalar i16/i32/i64 costs as we can do this cheaply with CMOV
This might be a regression for some ARM targets, but that should
be changed in the target-specific overrides.
There is apparently still no default lowering for these nodes,
so I am assuming these intrinsics are not in common use.
X86, PowerPC, and RISC-V for example, just crash given the most
basic IR.
This is re-applying a combination of f7eac51b9b3f and 8ec7ea3ddce7 as one patch
to avoid regressions now that we have better testing in place.
Those were reverted with 32dd5870ee31 because of crashing in experimental intrinsics.
That bug should be fixed with 7ae346434.
Paraphrased original commit messages:
This is the last step in removing cost-kind as a consideration in the
basic class model for intrinsics.
See D89461 for the start of that.
Subsequent commits dealt with each of the special-case intrinsics that
had customization here in the basic class. This should remove a barrier
to retrying D87188 (canonicalization to the abs intrinsic).
The ARM and x86 cost diffs seen here may be wrong because the
target-specific overrides have their own bugs, but we hope this is
less wrong - if something has a significant throughput cost, then it
should have a significant size / blended cost too by default.
The only behavioral diff in current regression tests is shown in the
x86 scatter-gather test (which is misplaced or broken because it runs
the entire -O3 pipeline) - we unrolled less, and we assume that is
a improvement.
Exception: in general, we want the *size* cost for a scalar call to be
cheap even if the other costs are expensive - we expect it to just be
a branch with some optional stack manipulation.
It is likely that we will want to carve out some
exceptions/overrides to this rule as follow-up patches for
calls that have some general and/or target-specific difference
to the expected lowering.
This was noticed as a regression in unrolling, so we have a test
for that now along with a couple of direct cost model tests.
If the assumed scalarization costs for the oversized vector
calls are not realistic, that would be another follow-up
refinement of the cost models.
Differential Revision: https://reviews.llvm.org/D90554