1
0
mirror of https://github.com/RPCS3/llvm-mirror.git synced 2024-11-22 18:54:02 +01:00
Commit Graph

857 Commits

Author SHA1 Message Date
David Green
0f86a54b34 [AArch64] Update and expand min-max cost model test. NFC
This expands the cost model test for min/max to many more types,
including floating point minnum/maxnum and minimum/maximum, and FP16
with and without fullfp16.  The old llc run lines are removed, as those
are better tested by CodeGen tests.
2021-07-27 18:48:58 +01:00
Simon Pilgrim
e67bfc0f97 [Analysis] Fix getOrderedReductionCost to call target's getArithmeticInstrCost implementation
The getOrderedReductionCost implementation introduced in D105432 calls the CRTP base version getArithmeticInstrCost instead of the redirecting to the target version.

Differential Revision: https://reviews.llvm.org/D106795
2021-07-26 17:15:43 +01:00
David Sherwood
2d2e4a1b17 [Analysis] Add simple cost model for strict (in-order) reductions
I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:

  1. Tree-wise. This is the typical fast-math reduction that involves
  continually splitting a vector up into halves and adding each
  half together until we get a scalar result. This is the default
  behaviour for integers, whereas for floating point we only do this
  if reassociation is allowed.
  2. Ordered. This now allows us to estimate the cost of performing
  a strict vector reduction by treating it as a series of scalar
  operations in lane order. This is the case when FP reassociation
  is not permitted. For scalable vectors this is more difficult
  because at compile time we do not know how many lanes there are,
  and so we use the worst case maximum vscale value.

I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.

New tests have been added here:

  Analysis/CostModel/AArch64/reduce-fadd.ll
  Analysis/CostModel/AArch64/sve-intrinsics.ll
  Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D105432
2021-07-26 10:26:06 +01:00
Sander de Smalen
1f523effc3 [BasicTTI] Set scalarization cost of scalable vector casts to Invalid.
When BasicTTIImpl::getCastInstrCost can't determine the cost of a
vector cast operation when the types need legalization, it falls
back to calculating scalarization costs. Instead of crashing on
`cast<FixedVectorType>(DstVTy)` when the type is a scalable vector,
return an Invalid cost.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D106655
2021-07-24 14:13:21 +01:00
David Green
f41dff2733 [AArch64] Add worst case shuffle costs
This adds some missing single source shuffle costs for AArch64, of i16
and i8 vectors. v4i16 are the same as v4i32 with a worse case cost of 3
coming from the perfect shuffle tables. The larger vector sizes expand
into a constant pool, plus a load (and adrp) and a tbl. I arbitrarily
chose 8 for the cost to be expensive but not too expensive.

Differential Revision: https://reviews.llvm.org/D106241
2021-07-23 09:01:58 +01:00
Simon Pilgrim
c4477aac03 [CostModel][X86] Adjust shift SSE4 legalized costs based on llvm-mca reports.
Update shl/lshr/ashr costs based on the worst case costs from the script in D103695 - many of the 128-bit shifts (usually where integer multiplies aren't used) have similar behaviour to AVX1 so we can merge them.
2021-07-22 20:07:32 +01:00
Simon Pilgrim
b82c6c9bf3 [CostModel][X86] Fix funnel shift check prefixes
We'd lost AVX1 test coverage due to bulldozer (XOP) trying to use the same check prefixes - we really need to fix the update script to avoid this!
2021-07-22 20:07:31 +01:00
David Green
f13ef26613 [AArch64] Adjust the cost of integer sum reductions
This changes the cost to (LT.first-1) * cost(add) + 2, where the cost of
an add is assumed to be 1. This brings it inline with the other
reductions.

Differential Revision: https://reviews.llvm.org/D106240
2021-07-22 18:19:54 +01:00
Simon Pilgrim
8371f55768 [CostModel][X86] Adjust shift SSE legalized costs based on llvm-mca reports.
Update shl/lshr/ashr costs based on the worst case costs from the script in D103695.
2021-07-22 18:12:49 +01:00
David Green
3712bffb3b [AArch64] Add and update reduction and shuffle costs. NFC 2021-07-22 10:22:42 +01:00
Simon Pilgrim
6e7877566a [CostModel][X86] Add fast math tests for float reductions
As noticed on D105432 we didn't have any coverage to distinguish between fast/exact float reductions
2021-07-19 13:01:28 +01:00
Sander de Smalen
d616b58c4d [CostModel][AArch64] Make loads/stores of <vscale x 1 x eltty> invalid.
At the moment, <vscale x 1 x eltty> are not yet fully handled by the
code-generator, so to avoid vectorizing loops with that VF, we mark the
cost for these types as invalid.
The reason for not adding a new "TTI::getMinimumScalableVF" is because
the type is supposed to be a type that can be legalized. It partially is,
although the support for these types need some more work.

Reviewed By: paulwalker-arm, dmgreen

Differential Revision: https://reviews.llvm.org/D103882
2021-07-14 16:44:22 +01:00
Simon Pilgrim
32c30ddee7 [X86] Implement smarter instruction lowering for FP_TO_UINT from f32/f64 to i32/i64 and vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction.
We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs (and for FP_TO_UINT, negative float values are undefined). We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic:

small := CVTTPS2SI(x);
fp_to_ui(x) := small | (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31))

Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied there too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering.

@TomHender checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded).

Original Patch by @TomHender (Tom Hender)

Differential Revision: https://reviews.llvm.org/D89697
2021-07-14 12:03:49 +01:00
Simon Pilgrim
436c202090 [CostModel][X86] Adjust fptosi/fptoui SSE/AVX legalized costs based on llvm-mca reports.
Update (mainly) vXf32/vXf64 -> vXi8/vXi16 fptosi/fptoui costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-12 20:38:25 +01:00
Simon Pilgrim
063de0d715 [CostModel][X86] Adjust truncate SSE/AVX legalized costs based on llvm-mca reports.
Update truncation costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-12 13:50:43 +01:00
David Green
fd61052e59 [TTI] Remove IsPairwiseForm from getArithmeticReductionCost
This patch removes the IsPairwiseForm flag from the Reduction Cost TTI
hooks, along with some accompanying code for pattern matching reductions
from trees starting at extract elements. IsPairWise is now assumed to be
false, which was the predominant way that the value was used from both
the Loop and SLP vectorizers. Since the adjustments such as D93860, the
SLP vectorizer has not relied upon this distinction between paiwise and
non-pairwise reductions.

This also removes some code that was detecting reductions trees starting
from extract elements inside the costmodel. This case was
double-counting costs though, adding the individual costs on the
individual instruction _and_ the total cost of the reduction. Removing
it changes the costs in llvm/test/Analysis/CostModel/X86/reduction.ll to
not double count. The cost of reduction intrinsics is still tested
through the various tests in
llvm/test/Analysis/CostModel/X86/reduce-xyz.ll.

Differential Revision: https://reviews.llvm.org/D105484
2021-07-09 11:51:16 +01:00
Simon Pilgrim
5d69b0490b [CostModel][X86] Account for older SSE targets with slow fp->int conversions
Both the conversion cost and the xmm->gpr transfer cost tend to be a lot higher on early SSE targets
2021-07-08 18:08:24 +01:00
Sander de Smalen
3bbfdfb241 [CostModel] Express cost(urem) as cost(div+mul+sub) when set to Expand.
The Legalizer expands the operations of urem/srem into a div+mul+sub or divrem
when those are legal/custom. This patch changes the cost-model to reflect that
cost.

Since there is no 'divrem' Instruction in LLVM IR, the cost of divrem
is assumed to be the same as div+mul+sub since the three operations will
need to be executed at runtime regardless.

Patch co-authored by David Sherwood (@david-arm)

Reviewed By: RKSimon, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D103799
2021-07-07 14:40:28 +01:00
Simon Pilgrim
8dde5bc762 [CostModel][X86] Adjust sext/zext SSE/AVX legalized costs based on llvm-mca reports.
Update costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-07 13:58:27 +01:00
Simon Pilgrim
929bb9374e [CostModel][X86] Adjust sitofp/uitofp SSE/AVX legalized costs based on llvm-mca reports.
Update (mainly) vXi8/vXi16 -> vXf32/vXf64 sitofp/uitofp costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-07 12:03:45 +01:00
Simon Pilgrim
b8bec02bc7 [CostModel][X86] fptosi/fptoui to i8/i16 are truncated from fptosi to i32
Provide a generic fallback that performs the fptosi to i32 types, then truncates to sub-i32 scalars.

These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.
2021-07-06 17:28:03 +01:00
Simon Pilgrim
1787dc0460 [CostModel][X86] i8/i16 sitofp/uitofp are sext/zext to i32 for sitofp
Provide a generic fallback that extends sub-i32 scalars before using the existing sitofp instructions.

These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.

We get the extension for free for non-vector loads.
2021-07-06 13:58:52 +01:00
Caroline Concatto
1631d2fbaa [AArch64][CostModel] Add cost model for experimental.vector.splice
This patch adds a new  ShuffleKind SK_Splice and then handle the cost in
getShuffleCost, as in experimental.vector.reverse.

Differential Revision: https://reviews.llvm.org/D104630
2021-07-05 14:30:24 +01:00
Simon Pilgrim
8ed3c99d18 [CostModel][X86] Handle costs for insert/extractelement with non-immediate indices via stack
Determine the insert/extractelement costs when performing this as a sequence of aliased loads+stores via the stack.
2021-07-05 13:26:53 +01:00
Simon Pilgrim
4d75d8ef44 [CostModel][X86] Adjust i32/i64 to f32/f64 scalar based on llvm-mca reports (+ Agner).
Older SSE targets have slower gpr->fpu scalar conversions - we also need to account for uitofp i32 > f32/f64 being lowered as sitofp i64 -> f32/f64
2021-07-05 13:26:53 +01:00
Sjoerd Meijer
cda6f69edf [AArch64] Cost-model i8 vector loads/stores
Loads of <4 x i8> vectors were modeled as extremely expensive. And while we
don't have a load instruction that supports this, it isn't that expensive to
create a vector of i8 elements. The codegen for this was fixed/optimised in
D105110. This now tweaks the cost model and enables SLP vectorisation of my
motivating case loadi8.ll.

Differential Revision: https://reviews.llvm.org/D103629
2021-07-05 11:25:10 +01:00
Simon Pilgrim
5e6cee0948 [CostModel][X86] Drop some hard coded fp<->int scalarization costs
Scalarization costs handling is a lot better now, and the hard coded costs were higher than the worse case numbers from the script in D103695
2021-07-02 14:29:32 +01:00
Simon Pilgrim
1a043ee5ed [CostModel][X86] Find AVX conversion costs using legalized types if custom types didn't match
Building on rG2a1ef8784ad9a, fallback to attempting to match against legalized types like we do for SSE targets.
2021-07-02 13:49:31 +01:00
Simon Pilgrim
c54e15d18e [CostModel][X86] Adjust uitofp(vXi64) SSE/AVX legalized costs based on llvm-mca reports.
Update v4i64 -> v4f32/v4f64 uitofp costs based on the worst case costs from the script in D103695.

Fixes a few regressions before we start adding AVX costs for legalized types.
2021-07-02 13:09:00 +01:00
Florian Hahn
6aab8b237f [AArch64] Use custom lowering for fp16 vector copysign.
The custom copysign lowering already supports fp16. Use it.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D105277
2021-07-02 11:15:30 +01:00
Simon Pilgrim
f0a2f7f0f9 [CostModel][X86] Adjust fp<->int vXi32 SSE legalized costs based on llvm-mca reports.
Building on rG2a1ef8784ad9a, adjust the SSE cost tables to use the legalized types based on the worst case costs from the script in D103695.

To account for different numbers of src/dst legalized type registers we must scale the cost by maximum of the src/dst, not just use src
2021-07-01 15:34:20 +01:00
Simon Pilgrim
b53b3c8d1c [CostModel][X86] getCastInstrCost - attempt to match custom cast/conversion before legalized types.
Move the (SSE-only) generic, legalized type conversion matching after the specific,custom conversion cases, allowing us to properly provide cost overrides.

The next step will be to clean up some of the weird existing costs and then to enable AVX+ legalized costs, which will let us strip out a lot of the cost tables entries.
2021-07-01 12:06:40 +01:00
Simon Pilgrim
3ab40c9e92 [CostModel][X86] Adjust fp<->int vXi32 AVX1+ costs based on llvm-mca reports
Based off the worse case numbers generated by D103695, the AVX1/2/512 sitofp/uitofp/fptosi/fptoui costs were higher than necessary (based off instruction counts instead of actual throughput).

The SSE costs still need further fixes, but I hit an issue with the order in which SSE costs are checked - we need to check CUSTOM costs (with non-legal types) first, and then fallback to LEGALIZED types. I'm looking at this now, and this should let us start thinning out a lot of the duplicates in the costs tables.

Then we can finally start work on vXi64 / vXi16 / vXi8 / vXi1 integers, which should let us look at sub-128-bit vectorization (D103925).
2021-06-30 15:23:34 +01:00
alex-t
a236991319 [AMDGPU] PHI node cost should not be counted for the size and latency.
Details: https://reviews.llvm.org/D96805 changed the GCNTTIImpl::getCFInstrCost to return 1 for the PHI nodes
  for the TTI::TCK_CodeSize and TTI::TCK_SizeAndLatency. This is incorrect because the value moves that are the
  result of the PHI lowering are inserted into the basic block predecessors - not into the block itself.
  As a result of this change LoopRotate and LoopUnroll were broken because of the incorrect Loop header and loop
  body size/cost estimation.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D105104
2021-06-30 16:11:17 +03:00
Rosie Sumpter
de55f5a044 [CostModel][AArch64] Improve cost model for vector reduction intrinsics
OR, XOR and AND entries are added to the cost table. An extra cost
is added when vector splitting occurs.

This is done to address the issue of a missed SLP vectorization
opportunity due to unreasonably high costs being attributed to the vector
Or reduction (see: https://bugs.llvm.org/show_bug.cgi?id=44593).

Differential Revision: https://reviews.llvm.org/D104538
2021-06-24 12:02:58 +01:00
Bjorn Pettersson
29ffba4b56 Update @llvm.powi to handle different int sizes for the exponent
This can be seen as a follow up to commit 0ee439b705e82a4fe20e2,
that changed the second argument of __powidf2, __powisf2 and
__powitf2 in compiler-rt from si_int to int. That was to align with
how those runtimes are defined in libgcc.
One thing that seem to have been missing in that patch was to make
sure that the rest of LLVM also handle that the argument now depends
on the size of int (not using the si_int machine mode for 32-bit).
When using __builtin_powi for a target with 16-bit int clang crashed.
And when emitting libcalls to those rtlib functions, typically when
lowering @llvm.powi), the backend would always prepare the exponent
argument as an i32 which caused miscompiles when the rtlib was
compiled with 16-bit int.

The solution used here is to use an overloaded type for the second
argument in @llvm.powi. This way clang can use the "correct" type
when lowering __builtin_powi, and then later when emitting the libcall
it is assumed that the type used in @llvm.powi matches the rtlib
function.

One thing that needed some extra attention was that when vectorizing
calls several passes did not support that several arguments could
be overloaded in the intrinsics. This patch allows overload of a
scalar operand by adding hasVectorInstrinsicOverloadedScalarOpd, with
an entry for powi.

Differential Revision: https://reviews.llvm.org/D99439
2021-06-17 09:38:28 +02:00
Rosie Sumpter
b4082e9aee [CostModel][AArch64] Improve the cost estimate of CTPOP intrinsic
Added a case for CTPOP to AArch64TTIImpl::getIntrinsicInstrCost so that
the cost estimate matches the codegen in
test/CodeGen/AArch64/arm64-vpopcnt.ll

Differential Revision: https://reviews.llvm.org/D103952
2021-06-11 11:15:46 +01:00
Irina Dobrescu
e44ee5c21e [AArch64] Add cost tests for bitreverse
This patch includes cost tests for bit reverse as well as some adjustments to the cost model.

Differential Revision: https://reviews.llvm.org/D102755
2021-06-10 14:51:33 +01:00
Kerry McLaughlin
d259f6577a [CostModel] Return an invalid cost for memory ops with unsupported types
Fixes getTypeConversion to return `TypeScalarizeScalableVector` when a scalable vector
type cannot be legalized by widening/splitting. When this is the method of legalization
found, getTypeLegalizationCost will return an Invalid cost.

The getMemoryOpCost, getMaskedMemoryOpCost & getGatherScatterOpCost functions already call
getTypeLegalizationCost and will now also return an Invalid cost for unsupported types.

Reviewed By: sdesmalen, david-arm

Differential Revision: https://reviews.llvm.org/D102515
2021-06-08 12:07:36 +01:00
Simon Pilgrim
b21f333bb4 [CostModel][X86] Improve AVX1/AVX2 truncation costs
Based off the worse case numbers generated by D103695, we were overestimating the cost of a number of vector truncations:

AVX2: v2i32->v2i8, v2i64->v2i16 + v4i64->v4i32
AVX1: v2i32->v2i8, v4i64->v4i16 + v16i16->v16i8

Once we have a working set of conversion costs, the intention is to cleanup the tables and use legalized types a lot more to reduce the number of entries we currently have.
2021-06-08 10:41:03 +01:00
Sander de Smalen
c2092c7f0d [CostModel][AArch64] NFC: Simplify some cost model tests for SVE.
* Merged some functions into a single function, to make the costs more obvious.
* Moved scalable-mem-op-cost-model.ll -> sve-ldst.ll to be more consistent with other filenames.
2021-06-07 17:26:23 +01:00
Sander de Smalen
052444882c [CostModel] Return Invalid cost in getArithmeticCost instead of crashing for scalable vectors.
This fixes an issue in BasicTTIImpl.h where it tries to do a
cast<FixedVectorType> on a scalable vector type in order to get the
scalarization cost. Because scalarization of scalable vectors is not
supported, we return Invalid instead.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D103798
2021-06-07 17:26:23 +01:00
Simon Pilgrim
779c968967 [CostModel][X86] Add 512-bit bswap costs 2021-06-06 22:36:34 +01:00
Simon Pilgrim
3d3fded81f [CostModel][X86] Add 512-bit bswap cost tests 2021-06-06 22:36:34 +01:00
Simon Pilgrim
3d02a9f278 [CostModel][X86] Improve AVX512 FDIV costs
Add missing v16f32/v8f64 costs and adjust other costs as well based off the SkylakeServer model
2021-06-06 21:41:05 +01:00
Rosie Sumpter
aa63068e95 [CostModel][AArch64] Add tests for ctlz, ctpop and cttz. NFC.
Differential Revision: https://reviews.llvm.org/D103601
2021-06-03 17:12:22 +01:00
Irina Dobrescu
fa93338923 [AArch64][NFC] Fix failing cost-model test 2021-06-02 15:00:19 +01:00
Fraser Cormack
be68e4ef95 [RISCV] Expand unaligned fixed-length vector memory accesses
RVV vectors must be aligned to their element types, so anything less is
unaligned.

For regular loads and stores, our custom-lowering of fixed-length
vectors meant that we opted out of LegalizeDAG's built-in unaligned
expansion. This patch adds that logic in to our custom lower function.

For masked intrinsics, we declare that anything unaligned is not legal,
leaving the ScalarizeMaskedMemIntrin pass to do the expansion for us.

Note that neither of these methods can handle the expansion of
scalable-vector memory ops, so those cases are left alone by this patch.
Scalable loads and stores already go through expansion by default but
hit an assertion, and scalable masked intrinsics will silently generate
incorrect code. It may be prudent to return an error in both of these
cases.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D102493
2021-06-02 09:27:44 +01:00
Simon Pilgrim
0c78aae183 [CostModel][X86] Improve accuracy of sext/zext to 256-bit vector costs on AVX1 targets
Determined from llvm-mca analysis (btver2 vs bdver2 vs sandybridge), the split+extends+concat sequence on AVX1 capable targets are cheaper than the #ops that the cost was previously based on.
2021-05-27 18:17:50 +01:00
Simon Pilgrim
154058d3b3 [CostModel][X86] AVX512 truncation ops are slower than cost models indicate.
The SkylakeServer model (and later IceLake/TigerLake targets according to Agner) have the PMOV truncations as uops=2, rthroughput=2 instructions.

Noticed while trying to reduce the diffs between cost tables and llvm-mca analysis.
2021-05-27 16:07:42 +01:00