1
0
mirror of https://github.com/RPCS3/llvm-mirror.git synced 2025-01-31 20:51:52 +01:00

840 Commits

Author SHA1 Message Date
Sander de Smalen
3bbfdfb241 [CostModel] Express cost(urem) as cost(div+mul+sub) when set to Expand.
The Legalizer expands the operations of urem/srem into a div+mul+sub or divrem
when those are legal/custom. This patch changes the cost-model to reflect that
cost.

Since there is no 'divrem' Instruction in LLVM IR, the cost of divrem
is assumed to be the same as div+mul+sub since the three operations will
need to be executed at runtime regardless.

Patch co-authored by David Sherwood (@david-arm)

Reviewed By: RKSimon, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D103799
2021-07-07 14:40:28 +01:00
Simon Pilgrim
8dde5bc762 [CostModel][X86] Adjust sext/zext SSE/AVX legalized costs based on llvm-mca reports.
Update costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-07 13:58:27 +01:00
Simon Pilgrim
929bb9374e [CostModel][X86] Adjust sitofp/uitofp SSE/AVX legalized costs based on llvm-mca reports.
Update (mainly) vXi8/vXi16 -> vXf32/vXf64 sitofp/uitofp costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-07 12:03:45 +01:00
Simon Pilgrim
b8bec02bc7 [CostModel][X86] fptosi/fptoui to i8/i16 are truncated from fptosi to i32
Provide a generic fallback that performs the fptosi to i32 types, then truncates to sub-i32 scalars.

These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.
2021-07-06 17:28:03 +01:00
Simon Pilgrim
1787dc0460 [CostModel][X86] i8/i16 sitofp/uitofp are sext/zext to i32 for sitofp
Provide a generic fallback that extends sub-i32 scalars before using the existing sitofp instructions.

These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.

We get the extension for free for non-vector loads.
2021-07-06 13:58:52 +01:00
Caroline Concatto
1631d2fbaa [AArch64][CostModel] Add cost model for experimental.vector.splice
This patch adds a new  ShuffleKind SK_Splice and then handle the cost in
getShuffleCost, as in experimental.vector.reverse.

Differential Revision: https://reviews.llvm.org/D104630
2021-07-05 14:30:24 +01:00
Simon Pilgrim
8ed3c99d18 [CostModel][X86] Handle costs for insert/extractelement with non-immediate indices via stack
Determine the insert/extractelement costs when performing this as a sequence of aliased loads+stores via the stack.
2021-07-05 13:26:53 +01:00
Simon Pilgrim
4d75d8ef44 [CostModel][X86] Adjust i32/i64 to f32/f64 scalar based on llvm-mca reports (+ Agner).
Older SSE targets have slower gpr->fpu scalar conversions - we also need to account for uitofp i32 > f32/f64 being lowered as sitofp i64 -> f32/f64
2021-07-05 13:26:53 +01:00
Sjoerd Meijer
cda6f69edf [AArch64] Cost-model i8 vector loads/stores
Loads of <4 x i8> vectors were modeled as extremely expensive. And while we
don't have a load instruction that supports this, it isn't that expensive to
create a vector of i8 elements. The codegen for this was fixed/optimised in
D105110. This now tweaks the cost model and enables SLP vectorisation of my
motivating case loadi8.ll.

Differential Revision: https://reviews.llvm.org/D103629
2021-07-05 11:25:10 +01:00
Simon Pilgrim
5e6cee0948 [CostModel][X86] Drop some hard coded fp<->int scalarization costs
Scalarization costs handling is a lot better now, and the hard coded costs were higher than the worse case numbers from the script in D103695
2021-07-02 14:29:32 +01:00
Simon Pilgrim
1a043ee5ed [CostModel][X86] Find AVX conversion costs using legalized types if custom types didn't match
Building on rG2a1ef8784ad9a, fallback to attempting to match against legalized types like we do for SSE targets.
2021-07-02 13:49:31 +01:00
Simon Pilgrim
c54e15d18e [CostModel][X86] Adjust uitofp(vXi64) SSE/AVX legalized costs based on llvm-mca reports.
Update v4i64 -> v4f32/v4f64 uitofp costs based on the worst case costs from the script in D103695.

Fixes a few regressions before we start adding AVX costs for legalized types.
2021-07-02 13:09:00 +01:00
Florian Hahn
6aab8b237f [AArch64] Use custom lowering for fp16 vector copysign.
The custom copysign lowering already supports fp16. Use it.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D105277
2021-07-02 11:15:30 +01:00
Simon Pilgrim
f0a2f7f0f9 [CostModel][X86] Adjust fp<->int vXi32 SSE legalized costs based on llvm-mca reports.
Building on rG2a1ef8784ad9a, adjust the SSE cost tables to use the legalized types based on the worst case costs from the script in D103695.

To account for different numbers of src/dst legalized type registers we must scale the cost by maximum of the src/dst, not just use src
2021-07-01 15:34:20 +01:00
Simon Pilgrim
b53b3c8d1c [CostModel][X86] getCastInstrCost - attempt to match custom cast/conversion before legalized types.
Move the (SSE-only) generic, legalized type conversion matching after the specific,custom conversion cases, allowing us to properly provide cost overrides.

The next step will be to clean up some of the weird existing costs and then to enable AVX+ legalized costs, which will let us strip out a lot of the cost tables entries.
2021-07-01 12:06:40 +01:00
Simon Pilgrim
3ab40c9e92 [CostModel][X86] Adjust fp<->int vXi32 AVX1+ costs based on llvm-mca reports
Based off the worse case numbers generated by D103695, the AVX1/2/512 sitofp/uitofp/fptosi/fptoui costs were higher than necessary (based off instruction counts instead of actual throughput).

The SSE costs still need further fixes, but I hit an issue with the order in which SSE costs are checked - we need to check CUSTOM costs (with non-legal types) first, and then fallback to LEGALIZED types. I'm looking at this now, and this should let us start thinning out a lot of the duplicates in the costs tables.

Then we can finally start work on vXi64 / vXi16 / vXi8 / vXi1 integers, which should let us look at sub-128-bit vectorization (D103925).
2021-06-30 15:23:34 +01:00
alex-t
a236991319 [AMDGPU] PHI node cost should not be counted for the size and latency.
Details: https://reviews.llvm.org/D96805 changed the GCNTTIImpl::getCFInstrCost to return 1 for the PHI nodes
  for the TTI::TCK_CodeSize and TTI::TCK_SizeAndLatency. This is incorrect because the value moves that are the
  result of the PHI lowering are inserted into the basic block predecessors - not into the block itself.
  As a result of this change LoopRotate and LoopUnroll were broken because of the incorrect Loop header and loop
  body size/cost estimation.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D105104
2021-06-30 16:11:17 +03:00
Rosie Sumpter
de55f5a044 [CostModel][AArch64] Improve cost model for vector reduction intrinsics
OR, XOR and AND entries are added to the cost table. An extra cost
is added when vector splitting occurs.

This is done to address the issue of a missed SLP vectorization
opportunity due to unreasonably high costs being attributed to the vector
Or reduction (see: https://bugs.llvm.org/show_bug.cgi?id=44593).

Differential Revision: https://reviews.llvm.org/D104538
2021-06-24 12:02:58 +01:00
Bjorn Pettersson
29ffba4b56 Update @llvm.powi to handle different int sizes for the exponent
This can be seen as a follow up to commit 0ee439b705e82a4fe20e2,
that changed the second argument of __powidf2, __powisf2 and
__powitf2 in compiler-rt from si_int to int. That was to align with
how those runtimes are defined in libgcc.
One thing that seem to have been missing in that patch was to make
sure that the rest of LLVM also handle that the argument now depends
on the size of int (not using the si_int machine mode for 32-bit).
When using __builtin_powi for a target with 16-bit int clang crashed.
And when emitting libcalls to those rtlib functions, typically when
lowering @llvm.powi), the backend would always prepare the exponent
argument as an i32 which caused miscompiles when the rtlib was
compiled with 16-bit int.

The solution used here is to use an overloaded type for the second
argument in @llvm.powi. This way clang can use the "correct" type
when lowering __builtin_powi, and then later when emitting the libcall
it is assumed that the type used in @llvm.powi matches the rtlib
function.

One thing that needed some extra attention was that when vectorizing
calls several passes did not support that several arguments could
be overloaded in the intrinsics. This patch allows overload of a
scalar operand by adding hasVectorInstrinsicOverloadedScalarOpd, with
an entry for powi.

Differential Revision: https://reviews.llvm.org/D99439
2021-06-17 09:38:28 +02:00
Rosie Sumpter
b4082e9aee [CostModel][AArch64] Improve the cost estimate of CTPOP intrinsic
Added a case for CTPOP to AArch64TTIImpl::getIntrinsicInstrCost so that
the cost estimate matches the codegen in
test/CodeGen/AArch64/arm64-vpopcnt.ll

Differential Revision: https://reviews.llvm.org/D103952
2021-06-11 11:15:46 +01:00
Irina Dobrescu
e44ee5c21e [AArch64] Add cost tests for bitreverse
This patch includes cost tests for bit reverse as well as some adjustments to the cost model.

Differential Revision: https://reviews.llvm.org/D102755
2021-06-10 14:51:33 +01:00
Kerry McLaughlin
d259f6577a [CostModel] Return an invalid cost for memory ops with unsupported types
Fixes getTypeConversion to return `TypeScalarizeScalableVector` when a scalable vector
type cannot be legalized by widening/splitting. When this is the method of legalization
found, getTypeLegalizationCost will return an Invalid cost.

The getMemoryOpCost, getMaskedMemoryOpCost & getGatherScatterOpCost functions already call
getTypeLegalizationCost and will now also return an Invalid cost for unsupported types.

Reviewed By: sdesmalen, david-arm

Differential Revision: https://reviews.llvm.org/D102515
2021-06-08 12:07:36 +01:00
Simon Pilgrim
b21f333bb4 [CostModel][X86] Improve AVX1/AVX2 truncation costs
Based off the worse case numbers generated by D103695, we were overestimating the cost of a number of vector truncations:

AVX2: v2i32->v2i8, v2i64->v2i16 + v4i64->v4i32
AVX1: v2i32->v2i8, v4i64->v4i16 + v16i16->v16i8

Once we have a working set of conversion costs, the intention is to cleanup the tables and use legalized types a lot more to reduce the number of entries we currently have.
2021-06-08 10:41:03 +01:00
Sander de Smalen
c2092c7f0d [CostModel][AArch64] NFC: Simplify some cost model tests for SVE.
* Merged some functions into a single function, to make the costs more obvious.
* Moved scalable-mem-op-cost-model.ll -> sve-ldst.ll to be more consistent with other filenames.
2021-06-07 17:26:23 +01:00
Sander de Smalen
052444882c [CostModel] Return Invalid cost in getArithmeticCost instead of crashing for scalable vectors.
This fixes an issue in BasicTTIImpl.h where it tries to do a
cast<FixedVectorType> on a scalable vector type in order to get the
scalarization cost. Because scalarization of scalable vectors is not
supported, we return Invalid instead.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D103798
2021-06-07 17:26:23 +01:00
Simon Pilgrim
779c968967 [CostModel][X86] Add 512-bit bswap costs 2021-06-06 22:36:34 +01:00
Simon Pilgrim
3d3fded81f [CostModel][X86] Add 512-bit bswap cost tests 2021-06-06 22:36:34 +01:00
Simon Pilgrim
3d02a9f278 [CostModel][X86] Improve AVX512 FDIV costs
Add missing v16f32/v8f64 costs and adjust other costs as well based off the SkylakeServer model
2021-06-06 21:41:05 +01:00
Rosie Sumpter
aa63068e95 [CostModel][AArch64] Add tests for ctlz, ctpop and cttz. NFC.
Differential Revision: https://reviews.llvm.org/D103601
2021-06-03 17:12:22 +01:00
Irina Dobrescu
fa93338923 [AArch64][NFC] Fix failing cost-model test 2021-06-02 15:00:19 +01:00
Fraser Cormack
be68e4ef95 [RISCV] Expand unaligned fixed-length vector memory accesses
RVV vectors must be aligned to their element types, so anything less is
unaligned.

For regular loads and stores, our custom-lowering of fixed-length
vectors meant that we opted out of LegalizeDAG's built-in unaligned
expansion. This patch adds that logic in to our custom lower function.

For masked intrinsics, we declare that anything unaligned is not legal,
leaving the ScalarizeMaskedMemIntrin pass to do the expansion for us.

Note that neither of these methods can handle the expansion of
scalable-vector memory ops, so those cases are left alone by this patch.
Scalable loads and stores already go through expansion by default but
hit an assertion, and scalable masked intrinsics will silently generate
incorrect code. It may be prudent to return an error in both of these
cases.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D102493
2021-06-02 09:27:44 +01:00
Simon Pilgrim
0c78aae183 [CostModel][X86] Improve accuracy of sext/zext to 256-bit vector costs on AVX1 targets
Determined from llvm-mca analysis (btver2 vs bdver2 vs sandybridge), the split+extends+concat sequence on AVX1 capable targets are cheaper than the #ops that the cost was previously based on.
2021-05-27 18:17:50 +01:00
Simon Pilgrim
154058d3b3 [CostModel][X86] AVX512 truncation ops are slower than cost models indicate.
The SkylakeServer model (and later IceLake/TigerLake targets according to Agner) have the PMOV truncations as uops=2, rthroughput=2 instructions.

Noticed while trying to reduce the diffs between cost tables and llvm-mca analysis.
2021-05-27 16:07:42 +01:00
Sjoerd Meijer
dd5fcc39b3 [CostModel][AArch64] Add floating point arithmetic tests. NFC. 2021-05-26 20:26:20 +01:00
Roman Lebedev
aded4a78a5 [NFC][X86][Costmodel] Add some more interleaved load/store test with i16 element type
Not sure if even larger interleaving factors are needed,
but these are what i have seen being queried in the wild.
2021-05-26 21:55:37 +03:00
Sjoerd Meijer
4ed1b287cc [CostModel][AArch64] Add tests for bitreverse. NFC. 2021-05-26 14:56:58 +01:00
Simon Pilgrim
215df0a901 [CostModel][X86] Remove old testshift* tests
The vector shift cost tests are better covered (more cpu/sse levels) by the vshift-*-*cost files, and we're trying to avoid codegen tests in here as it makes it harder to maintain the test files.
2021-05-26 10:31:00 +01:00
Simon Pilgrim
c8ca7d77d1 [CostModel][X86] Improve accuracy of 256-bit non-uniform vector shifts on AVX1
Determined from llvm-mca analysis, AVX1 capable targets have a higher throughput for VPBLENDVB and shuffle ops, making it cheaper to perform shift+shuffle/select shift patterns.
2021-05-25 17:31:45 +01:00
Simon Pilgrim
414d8672a3 [CostModel][X86] Improve accuracy of vXi64 vector non-uniform shift costs on AVX2+ targets
rG1ad4f887bd7692a9e63fb42586f0ece366f2fe01 incorrectly assumed that vXi64 non-uniform shifts were slow like vXi32 were - but llvm-mca (+Agner) both confirm that Haswell/Broadwell are full rate.
2021-05-25 15:58:23 +01:00
Simon Pilgrim
491a1dbd4b [CostModel][X86] Improve accuracy of vXi8/vXi16 vector non-uniform shift costs on AVX2/AVX512 targets
Determined from llvm-mca analysis, AVX2+ capable targets have a higher throughput for VPBLENDVB and VPMOVZX ops, making it cheaper to perform shift+select patterns for vXi8 shifts or extend/shift/truncate for vXi16 shifts. Similarly AVX512BW can perform vXi8 as extend/shift/truncate patterns.
2021-05-25 11:35:57 +01:00
serge-sans-paille
73bc91a5e6 Revert "[NFC] remove explicit default value for strboolattr attribute in tests"
This reverts commit bda6e5bee04c75b1f1332b4fd1ac4e8ef6c3c247.

See https://lab.llvm.org/buildbot/#/builders/109/builds/15424 for instance
2021-05-24 19:43:40 +02:00
serge-sans-paille
1f63b26006 [NFC] remove explicit default value for strboolattr attribute in tests
Since d6de1e1a71406c75a4ea4d5a2fe84289f07ea3a1, no attributes is quivalent to
setting attribute to false.

This is a preliminary commit for https://reviews.llvm.org/D99080
2021-05-24 19:31:04 +02:00
Roman Lebedev
7cf7fe8908 [X86][Costmodel] getMaskedMemoryOpCost(): don't scalarize non-power-of-two vectors with legal element type
This follows in steps of similar `getMemoryOpCost()` changes, D100099/D100684.

Intel SDM, `VPMASKMOV — Conditional SIMD Integer Packed Loads and Stores`:
```
Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to
referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no
faults will be detected if the mask bits are all zero.
```
I.e., if mask is all-zeros, any address is fine.

Masked load/store's prime use-case is e.g. tail masking the loop remainder,
where for the last iteration, only first some few elements of a vector exist.

So much similarly, i don't see why must we scalarize non-power-of-two vectors,
iff the element type is something we can masked- store/load.
We simply need to legalize it, widen the mask, and be done with it.
And we even already count the cost of widening the mask.

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D102990
2021-05-24 20:09:54 +03:00
Simon Pilgrim
f196122112 [CostModel][X86] Add missing SSE41 v2iX sext/zext costs
Also fix existing v4i8->v4i16 sext cost to match the equivalents
2021-05-24 15:53:43 +01:00
Simon Pilgrim
2618dfbd6b [CostModel][X86] Regenerate sse-itoi.ll test checks 2021-05-24 15:41:01 +01:00
Simon Pilgrim
6129236a8f [CostModel][X86] Improve accuracy of vector non-uniform shift costs on XOP/AVX2 targets
By llvm-mca analysis, Haswell/Broadwell has a non-uniform vector shift recip-throughput cost of the AVX2 targets at 2 for both 128 and 256-bit vectors - XOP capable targets have better 128-bit vector shifts so improve the fallback in those cases.
2021-05-24 14:18:21 +01:00
Simon Pilgrim
4fa97550f8 [CostModel][X86] Improve accuracy of vXi64 MUL costs on AVX2/AVX512 targets
By llvm-mca analysis, Haswell/Broadwell has the worst v4i64 recip-throughput cost of the AVX2 targets at 6 (vs the currently used cost of 8). Similarly SkylakeServer (our only AVX512 target model) implements PMULLQ with an average cost of 1.5 (rounded up to 2.0), and the PMULUDQ-sequence (without AVX512DQ) as a cost of 6.
2021-05-24 09:48:32 +01:00
Roman Lebedev
9f647330f9 [NFC][X86][Costmodel] Add tests with with masked loads/stores w/non-power-of-two vectors 2021-05-23 21:45:36 +03:00
Simon Pilgrim
f8e2239648 [CostModel][X86] Align v2i64 MUL costs on SSE42+ targets with worst case
Based on worst case of sandybridge (which seems to match nehalem for this SSE sequence) (vs btver2 + bdver2) llvm-mca analysis
2021-05-23 16:20:57 +01:00
Simon Pilgrim
f6dc684c0c [CostModel][X86] Align v4i64 MUL costs on AVX1 targets with worst case
Based on worst case of sandybridge (vs btver2 + bdver2) llvm-mca analysis - which is a lot less than what we were predicting (I think based off total uop count).
2021-05-22 20:07:55 +01:00