1
0
mirror of https://github.com/RPCS3/llvm-mirror.git synced 2025-02-01 05:01:59 +01:00

672 Commits

Author SHA1 Message Date
Simon Pilgrim
bb7e18f9de [CostModel][X86] Remove unused check-prefixes 2020-11-10 12:48:35 +00:00
Sanjay Patel
a8868babff [ARM] remove cost-kind predicate for cmp/sel costs
This is the cmp/sel sibling to D90692.
Again, the reasoning is: the throughput cost is number of instructions/uops,
so size/blended costs are identical except in special cases (for example,
fdiv or other known-expensive machine instructions or things like MVE that
may require cracking into >1 uops).

We need to check for a valid (non-null) condition type parameter because
SimplifyCFG may pass nullptr for that (and so we will crash multiple
regression tests without that check). I'm not sure if passing nullptr makes
sense, but other code in the cost model does appear to check if that param
is set or not.

Differential Revision: https://reviews.llvm.org/D90781
2020-11-05 14:52:25 -05:00
Sanjay Patel
4b28ca8f6f [ARM] remove cost-kind predicate for most math op costs
This is based on the same idea that I am using for the basic model implementation
and what I have partly already done for x86: throughput cost is number of
instructions/uops, so size/blended costs are identical except in special cases
(for example, fdiv or other known-expensive machine instructions or things like
MVE that may require cracking into >1 uop)).

Differential Revision: https://reviews.llvm.org/D90692
2020-11-03 17:23:46 -05:00
Sanjay Patel
81f8bd9111 [CostModel] fix cost calc bug for sadd/ssub with overflow
As noted in D90554, there's an opcode typo in using an easily
misused cost model API: getCmpSelInstrCost(). Beyond that, the
assumed sequence of ops is questionable, but that would be
another patch.

My guess is that the x86 test diffs show that we are probably
wrong both before and after this change, so there will be no
practical difference.
As an example, I tried this test which shows a cost of '7'
either way:

  define <4 x i32> @sadd(<4 x i32> %va, <4 x i32> %vb) {
    %V4I32  = call {<4 x i32>, <4 x i1>}  @llvm.sadd.with.overflow.v4i32(<4 x i32> %va, <4 x i32> %vb)
    %ov = extractvalue {<4 x i32>, <4 x i1>} %V4I32, 1
    %r = extractvalue {<4 x i32>, <4 x i1>} %V4I32, 0
    %z = select <4 x i1> %ov, <4 x i32> <i32 42, i32 42, i32 42, i32 42>, <4 x i32> %r
    ret <4 x i32> %z
  }

  $ llc -o - sadd.ll -mattr=avx
        vpaddd  %xmm1, %xmm0, %xmm2
        vpcmpgtd        %xmm2, %xmm0, %xmm0
        vpxor   %xmm0, %xmm1, %xmm0
        vblendvps       %xmm0, LCPI0_0(%rip), %xmm2, %xmm0a

Differential Revision: https://reviews.llvm.org/D90681
2020-11-03 11:03:47 -05:00
David Green
41688b499e [CostModel] Make target intrinsics cheap by default
This patch changes the intrinsics cost model to assume that by default
target intrinsics are cheap. This didn't seem to be the case for all
intrinsics, and is potentially an MVE problem due to our scalarization
overheads. Cheap seems to be a good default in general though.

Differential Revision: https://reviews.llvm.org/D90597
2020-11-03 09:58:28 +00:00
David Green
7290582ceb [ARM] Cost model test for target intrinsics. NFC 2020-11-02 17:46:48 +00:00
Sanjay Patel
389858bbc4 [x86] add AVX2 cost model entries for maxnum of 256-bit vectors
As noticed in D90554 ,
the AVX2 costs for 256-bit vectors did not include FMAXNUM entries,
so we fell back to AVX1 which assumes those ops will be split into
128-bit halves or something close to that.

Differential Revision: https://reviews.llvm.org/D90613
2020-11-02 12:20:17 -05:00
Florian Hahn
1db8566f5e Reland "[TTI] Add VecPred argument to getCmpSelInstrCost."
This reverts the revert commit 408c4408facc3a79ee4ff7e9983cc972f797e176.

This version of the patch includes a fix for a crash caused by
treating ICmp/FCmp constant expressions as instructions.

Original message:

On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.

This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.

This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.

I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.
2020-11-02 15:39:29 +00:00
Fangrui Song
0536175cbb [test] Fix some unused check prefixes in test/Analysis/CostModel/X86 2020-10-31 23:29:57 -07:00
David Green
e722c3abc4 [ARM] Fix crash for gather of pointer costs.
If the elt size is unknown due to it being a pointer, a comparison
against 0 will cause an assert. Make sure the elt size is large enough
before comparing and for the moment just return the scalar cost.
2020-10-31 13:10:14 +00:00
Florian Hahn
f44af1a603 Revert "[TTI] Add VecPred argument to getCmpSelInstrCost."
This reverts commit 73f01e3df58dca9d1596440b866b52929e3878de.

This appears to break
http://lab.llvm.org:8011/#/builders/85/builds/383.
2020-10-30 21:26:14 +00:00
Sanjay Patel
a82a184b76 [x86] add cost overrides for mul with overflow
I'm assuming the standard size integer instructions for this end up as something like:
mulq %rsi
seto %al

And the 'mul' generally has reciprocal throughput of 1 on typical implementations
(higher latency, but that's not handled here).
The default costs may end up much higher than that, and that's what we see in the test diffs.

Vector types are left as a 'TODO'.

Differential Revision: https://reviews.llvm.org/D90431
2020-10-30 12:38:16 -04:00
Florian Hahn
d8012eedb7 [TTI] Add VecPred argument to getCmpSelInstrCost.
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.

This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.

This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.

I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.

Reviewed By: dmgreen, RKSimon

Differential Revision: https://reviews.llvm.org/D90070
2020-10-30 13:49:08 +00:00
Sanjay Patel
b21114cf03 [x86] add test for umul intrinsic costs; NFC 2020-10-29 12:12:52 -04:00
Sanjay Patel
ac57df4098 [CostModel][x86] remove cost-kind predicate for intrinsic costs
We model cost as number of instructions / uops, so it does not
make sense to treat size/blended costs any differently than
throughput.
2020-10-28 14:33:37 -04:00
Sanjay Patel
4542ce5444 [CostModel] remove cost-kind predicate for funnel shift costs
Completing the series of FIXME removals for special-case intrinsics:
50dfa19cc799
f2c25c70791d
c963bde0152a
01ea93d85d6e

This one looks quite different than the others. The size/blended
cost is still potentially very far off from the throughput cost,
but this is hopefully not worse on the whole. It looks like the
underlying costs for the expanded shift/logic have their own
cost-kind limitations. Also, we are not asking the target if
it has a legal funnel shift op, so we just assume that the
intrinsic gets expanded.
2020-10-28 14:02:34 -04:00
David Green
61a282c3b7 [AArch64] Remove AArch64ISD::NOT, use vnot instead
vnot (xor -1) should be equivalent to the AArch64 specific AArch64ISD::NOT
node, but allow more folding thanks to all the target independent
optimizations. Specifically this allows select(icmp ne, x, y) to
become "cmeq; bsl y, x" as opposed to needing to convert the predicate
with "cmeq; mvn; bsl x, y"

Unfortunately there is a regression in a cmtst test, but the code it
selected from was already non-canonical, with instcombine preferring to
use an eq predicate instead. Plus the more common case of icmp ne is
improved.

Differential Revision: https://reviews.llvm.org/D90126
2020-10-28 08:15:37 +00:00
Sanjay Patel
b47683ecbd [CostModel] remove cost-kind predicate for FP add/mul vector reduction costs
This was originally part of:
f2c25c70791d
but that was reverted because there was an underlying bug in
processing the vector type of these intrinsics. That was
fixed with:
74ffc823ed21

This is similar in spirit to 01ea93d85d6e (memcpy) except that
here the underlying caller assumptions were created for vectorizer
use (throughput) rather than other passes.

That meant targets could have an enormous throughput cost with no
corresponding size, latency, or blended cost increase.

Paraphrasing from the previous commits:
This may not make sense for some callers, but at least now the
costs will be consistently wrong instead of mysteriously wrong.

Targets should provide better overrides if the current modeling
is not accurate.
2020-10-27 18:00:20 -04:00
Sanjay Patel
ab0890e928 [CostModel] add tests for FP reductions; NFC 2020-10-27 18:00:20 -04:00
Bing1 Yu
450c5e89bb [CostModel][X86] teach TTI calculate cost of chain of vector inserts/extracts more precisely and correctly:In each 128-lane, if there is at least one index is demanded and not all indices are demanded...
In each 128-lane, if there is at least one index is demanded and not all
indices are demanded and this 128-lane is not the first 128-lane of the
legalized-vector, then this 128-lane needs a extracti128;
If in each 128-lane, there is at least one index is demanded, this 128-lane
needs a inserti128.

The following cases will help you build a better understanding:
Assume we insert several elements into a v8i32 vector in avx2,
Case#1: inserting into 1th index needs vpinsrd + inserti128
Case#2: inserting into 5th index needs extracti128 + vpinsrd +
inserti128
Case#3: inserting into 4,5,6,7 index needs 4*vpinsrd + inserti128.

Reviewed By: pengfei, RKSimon

Differential Revision: https://reviews.llvm.org/D89767
2020-10-27 11:21:13 +08:00
Joe Ellis
a376939c97 [SVE][AArch64] Fix TypeSize warning in GEP cost analysis
The warning would fire when calling getGEPCost for analyzing the cost of
a GEP instruction. This would result in the use of the now deprecated
implicit cast of TypeSize to uint64_t through the overloaded operator.

This patch fixes the issue by using getKnownMinSize instead of the
implicit cast. This is possible because the code is already
scalable-vector aware. The semantic behaviour of the code is unchanged
by this patch.

Reviewed By: sdesmalen, fpetrogalli

Differential Revision: https://reviews.llvm.org/D89872
2020-10-26 17:40:19 +00:00
Tyker
1d03399e60 [Annotation] Allows annotation to carry some additional constant arguments.
This allows using annotation in a much more contexts than it currently has.
especially when annotation with template or constexpr.

Reviewed By: aaron.ballman

Differential Revision: https://reviews.llvm.org/D88645
2020-10-26 10:50:05 +01:00
Sanjay Patel
fe343140eb [CostModel] remove cost-kind predicate for some vector reduction costs
This is a modified 2nd try of 22d10b8ab44f
(reverted by 1c8371692d because it managed
to expose an existing crashing bug that should be fixed by
74ffc823 ).

Original commit message:

This is similar in spirit to 01ea93d85d6e (memcpy) except that
here the underlying caller assumptions were created for vectorizer
use (throughput) rather than other passes.

That meant targets could have an enormous throughput cost with no
corresponding size, latency, or blended cost increase.
The ARM costs show a small difference between throughput and
size because there's an underlying difference in cmp/sel
costs that is also predicated on cost-kind.

Paraphrasing from the previous commits:
This may not make sense for some callers, but at least now the
costs will be consistently wrong instead of mysteriously wrong.

Targets should provide better overrides if the current modeling
is not accurate.
2020-10-25 15:17:52 -04:00
Sanjay Patel
74732bce4a [CostModel] fix operand/type accounting for fadd/fmul reductions
I'm not sure if/how this ever worked, but it must not be tested
currently because the basic tests added here were crashing as
noted in the post-review comments for 1c83716 (which reverted
another cost-model fix in 22d10b8ab44f).
2020-10-25 15:01:19 -04:00
Martin Storsjö
e7cc77c1de Revert "[CostModel] remove cost-kind predicate for vector reduction costs"
This reverts commit 22d10b8ab44f703b72b8316a9b3b8adc623ca73f.

This broke compilation e.g. like this:
$ cat synth.c
*a;
float *b;
c() {
  for (;;) {
    float d = -*b * *a++;
    d -= *--b * *a++;
    d -= *--b * *a;
    d -= *--b * *a;
    e(d);
  }
}
$ clang -target x86_64-linux-gnu -c -O2 -ffast-math synth.c
clang: ../include/llvm/Support/Casting.h:104: static bool llvm::isa_impl
_cl<To, const From*>::doit(const From*) [with To = llvm::PointerType; Fr
om = llvm::Type]: Assertion `Val && "isa<> used on a null pointer"' fail
ed.
2020-10-25 08:47:54 +02:00
Sanjay Patel
ed4fa653c2 [CostModel] remove cost-kind predicate for vector reduction costs
This is similar in spirit to 01ea93d85d6e (memcpy) except that
here the underlying caller assumptions were created for vectorizer
use (throughput) rather than other passes.

That meant targets could have an enormous throughput cost with no
corresponding size, latency, or blended cost increase.
The ARM costs show a small difference between throughput and
size because there's an underlying difference in cmp/sel
costs that is also predicated on cost-kind.

Paraphrasing from the previous commits:
This may not make sense for some callers, but at least now the
costs will be consistently wrong instead of mysteriously wrong.

Targets should provide better overrides if the current modeling
is not accurate.
2020-10-24 13:20:17 -04:00
dfukalov
efaecfc60e [AMDGPU][CostModel] Refine cost model for half- and quarter-rate instructions.
1. Throughput and codesize costs estimations was separated and updated.
2. Updated fdiv cost estimation for different cases.
3. Added scalarization processing for types that are treated as !isSimple() to
improve codesize estimation in getArithmeticInstrCost() and
getArithmeticInstrCost(). The code was borrowed from TCK_RecipThroughput path
of base implementation.

Next step is unify scalarization part in base class that is currently works for
TCK_RecipThroughput path only.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D89973
2020-10-24 19:53:08 +03:00
Florian Hahn
9f772c5b7d [AArch64] Add vector compare/select cost-model tests. 2020-10-23 20:43:04 +01:00
Florian Hahn
785022033e [AArch64] Implement getIntrinsicInstrCost, handle min/max intrinsics.
This patch adds a specialized implementation of getIntrinsicInstrCost
and add initial cost-modeling for min/max vector intrinsics.

AArch64 NEON support umin/smin/umax/smax for vectors
<8 x i8>, <16 x i8>, <4 x i16>, <8 x i16>, <2 x i32> and <4 x i32>.
Notably, it does not support vectors with i64 elements.

This change by itself should have very little impact on codegen, but in
follow-up patches I plan to teach the vectorizers to consider using
those intrinsics on platforms where it is profitable, e.g. because there
is no general 'select'-like instruction.

The current cost returned should be better for throughput, latency and size.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D89953
2020-10-23 11:32:42 +01:00
Florian Hahn
0d2c92bfc9 [AArch64] Add min/max cost-model tests for v2i32. 2020-10-22 16:04:13 +01:00
Florian Hahn
2db9fbe33c [AArch64] Add min/max cost-model tests for v4i16. 2020-10-22 15:47:50 +01:00
Florian Hahn
c258c898fb [AArch64] Add cost model tests for min/max intrinsics. 2020-10-22 13:28:04 +01:00
Sanjay Patel
3393df6056 [CostModel] remove cost-kind predicate for scatter/gather cost
This is similar in spirit to 01ea93d85d6e (memcpy) except that
here the underlying caller assumptions were created for vectorizer
use (throughput) rather than other passes.

That meant ARM could have an enormous throughput cost with no
corresponding size, latency, or blended cost increase. X86 has
the same throughput restriction as the basic implementation, so
it is still unchanged.

Paraphrasing from the previous commit:
This may not make sense for some callers, but at least now the
costs will be consistently wrong instead of mysteriously wrong.

Targets should provide better overrides if the current modeling
is not accurate.
2020-10-21 14:26:05 -04:00
Sanjay Patel
7e2375171f [ARM] add cost-kind tests for intrinsics; NFC
This is a copy of the x86 file to provide better coverage;
x86 may have strange overrides that mask changes in the
generic model.
2020-10-21 14:26:04 -04:00
Sanjay Patel
d345d7741b [CostModel] remove cost-kind predicate for memcpy cost
The default implementation base returns TCC_Expensive (currently
set to '4'), so that explains the test diff. This probably does
not make sense for most callers, but at least now the costs will
be consistently wrong instead of mysteriously wrong.

The ARM target has an override that tries to model codegen expansion,
and that should likely be adapted for general usage.

This probably does not affect anything because the vectorizers are
the primary users of the throughput cost, but memcpy is not listed
as a trivially vectorizable intrinsic.
2020-10-21 08:50:44 -04:00
David Green
e1c6fdf0df [ARM] Basic getArithmeticReductionCost reduction costs
This adds some basic costs for MVE reductions - currently just costing
the simple legal add vectors as a single MVE instruction. More complex
costing can be added in the future when the framework more readily
allows it.

Differential Revision: https://reviews.llvm.org/D88980
2020-10-17 10:29:00 +01:00
David Green
afa83bbf3d [ARM] Add a very basic active_lane_mask cost
This adds a very basic cost for active_lane_mask under MVE - making the
assumption that they will be free and then apologizing for that in a
comment.

In reality they may either be free (by being nicely folded into a tail
predicated loop), cost the same as a VCTP or be expanded into vdup's,
adds and cmp's. It is difficult to detect the difference from a single
getIntrinsicInstrCost call, so makes the assumption that the vectorizer
is adding them, and only added them where it makes sense.

We may need to change this in the future to better model predicate costs
in the vectorizer, especially at -Os or non-tail predicated loops. The
vectorizer currently does not query the cost of these instructions but
that will change in the future and a zero cost there probably makes the
most sense at the moment.

Differential Revision: https://reviews.llvm.org/D88989
2020-10-17 10:09:42 +01:00
Sanjay Patel
21d0e78943 [CostModel] remove cost-kind predicate for ctlz/cttz intrinsics in basic TTI implementation
The cost modeling for intrinsics is a patchwork based on different
expectations from the callers, so it's a mess. I'm hoping to untangle
this to allow canonicalization to the new min/max intrinsics in IR.
The general goal is to remove the cost-kind restriction here in the
basic implementation class. Ie, if some intrinsic has throughput cost
of 104, assume that it has the same size, latency, and blended costs.
Effectively, an intrinsic with cost N is composed of N simple
instructions. If that's not correct, the target should provide a more
accurate override.

The x86-64 SSE2 subtarget cost diffs require explanation:

1. The scalar ctlz/cttz are assuming "BSR+XOR+CMOV" or
   "TEST+BSF+CMOV/BRANCH", so not cheap.
2. The 128-bit SSE vector width versions assume cost of 18 or 26
   (no explanation provided in the tables, but this corresponds to a
   bunch of shift/logic/compare).
3. The 512-bit vectors in the test file are scaled up by a factor of
   4 from the legal vector width costs.
4. The plain latency cost-kind is not affected in this patch because
   that calc is diverted before we get to getIntrinsicInstrCost().

Differential Revision: https://reviews.llvm.org/D89461
2020-10-15 13:14:41 -04:00
Sanjay Patel
8874abb153 [CostModel] rearrange basic intrinsic cost implementation
This is bigger/uglier than before, but it should allow fixing
all of the broken paths more easily. Test coverage added with
rGfab028b and other commits.

This is not NFC - the scalable vector test would crash
without this patch.
2020-10-13 11:52:00 -04:00
Sanjay Patel
9dd0476f6d [x86] add cost model test for memcpy; NFC
This is treated as a special-case in the base class
implementation of getIntrinsicInstrCost().
2020-10-13 11:42:44 -04:00
Sanjay Patel
4b1b25be42 [AArch64] fix spacing in test's RUN lines; NFC 2020-10-13 10:44:18 -04:00
Sanjay Patel
b896e3d245 [x86] add tests for cost model kinds of intrinsics; NFC
This provides coverage for existing special-cases and
a sampling of other intrinsics. Current output appears
to be wrong in several cases.
2020-10-13 10:39:43 -04:00
Sanjay Patel
8b55364bd6 [AArch64] add cost model test for scalable vector math; NFC
Testing for the various cost model "TargetCostKind" is limited,
and testing for scalable vectors is limited. The motivating
example of an intrinsic is not included here yet because that
just crashes.
2020-10-13 08:39:04 -04:00
Simon Pilgrim
c913b82065 [X86][SSE2] Use smarter instruction patterns for lowering UMIN/UMAX with v8i16.
This is my first LLVM patch, so please tell me if there are any process issues.

The main observation for this patch is that we can lower UMIN/UMAX with v8i16 by using unsigned saturated subtractions in a clever way. Previously this operation was lowered by turning the signbit of both inputs and the output which turns the unsigned minimum/maximum into a signed one.

We could use this trick in reverse for lowering SMIN/SMAX with v16i8 instead. In terms of latency/throughput this is the needs one large move instruction. It's just that the sign bit turning has an increased chance of being optimized further. This is particularly apparent in the "reduce" test cases. However due to the slight regression in the single use case, this patch no longer proposes this.

Unfortunately this argument also applies in reverse to the new lowering of UMIN/UMAX with v8i16 which regresses the "horizontal-reduce-umax", "horizontal-reduce-umin", "vector-reduce-umin" and "vector-reduce-umax" test cases a bit with this patch. Maybe some extra casework would be possible to avoid this. However independent of that I believe that the benefits in the common case of just 1 to 3 chained min/max instructions outweighs the downsides in that specific case.

Patch By: @TomHender (Tom Hender) ActuallyaDeviloper

Differential Revision: https://reviews.llvm.org/D87236
2020-10-11 11:21:23 +01:00
David Green
863d69bc9d [ARM] Add MVE vecreduce costmodel tests. NFC
There were some existing tests that were not super useful. New ones are
added for testing MVE specific patterns.
2020-10-09 16:25:25 +01:00
Amara Emerson
59c2440372 [llvm][mlir] Promote the experimental reduction intrinsics to be first class intrinsics.
This change renames the intrinsics to not have "experimental" in the name.

The autoupgrader will handle legacy intrinsics.

Relevant ML thread: http://lists.llvm.org/pipermail/llvm-dev/2020-April/140729.html

Differential Revision: https://reviews.llvm.org/D88787
2020-10-07 10:36:44 -07:00
Sanjay Patel
9fe6872096 [CostModel] add cl option to check size and latency costs; NFC
This is a setting used by SimplifyCFG, LoopUnroll, and InlineCost,
but there is apparently no direct test coverage for any of those
cost model values.
2020-09-27 09:52:56 -04:00
Jonas Paulsson
2e5f4ba2ce [SystemZ] Make sure not to call getZExtValue on a >64 bit constant.
Better use isZero() and isIntN() in SystemZTargetTransformInfo rather than
calling getZExtValue() since the immediate operand may be wider than 64 bits,
which is not allowed with getZExtValue().

Fixes https://bugs.llvm.org/show_bug.cgi?id=47600

Review: Simon Pilgrim
2020-09-23 15:36:32 +02:00
Bing1 Yu
9b182096ba [CostModel][X86] add CostModel for SK_Select(v8f64, v8i64, v16f32, v16i32, v32i16, v64i8)
add CostModel for SK_Select(v8f64, v8i64, v16f32, v16i32, v32i16, v64i8)

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D87884
2020-09-23 10:29:10 +08:00
Simon Pilgrim
1e7924bec0 [CostModel][X86] Add some select shuffle costs tests for D87884 2020-09-21 16:09:05 +01:00