1
0
mirror of https://github.com/RPCS3/llvm-mirror.git synced 2025-02-01 13:11:39 +01:00

1078 Commits

Author SHA1 Message Date
Nicholas Guy
f4899fdead [ARM] Rearrange SizeReduction when using -Oz
Move the Thumb2SizeReduce pass to before IfConversion when optimising
for minimal code size.

Running the Thumb2SizeReduction pass before IfConversionallows T1
instructions to propagate to the final output, rather than the
ifConverter modifying T2 instructions and preventing them from being
reduced later.

This change does introduce a regression regarding execution time, so
it's only applied when optimising for size.

Running the LLVM Test Suite with this change produces a geomean
difference of -0.1% for the size..text metric.

Differential Revision: https://reviews.llvm.org/D82439
2020-07-02 09:19:38 +01:00
Sam Parker
dd087b6361 [NFC][ARM] Add test. 2020-07-01 09:28:56 +01:00
Sam Parker
e85afea7aa [ARM][LowOverheadLoops] Handle reductions
While validating live-out values, record instructions that look like
a reduction. This will comprise of a vector op (for now only vadd),
a vorr (vmov) which store the previous value of vadd and then a vpsel
in the exit block which is predicated upon a vctp. This vctp will
combine the last two iterations using the vmov and vadd into a vector
which can then be consumed by a vaddv.

Once we have determined that it's safe to perform tail-predication,
we need to change this sequence of instructions so that the
predication doesn't produce incorrect code. This involves changing
the register allocation of the vadd so it updates itself and the
predication on the final iteration will not update the falsely
predicated lanes. This mimics what the vmov, vctp and vpsel do and
so we then don't need any of those instructions.

Differential Revision: https://reviews.llvm.org/D75533
2020-07-01 08:31:49 +01:00
Samuel Tebbs
dcd6d8787a [ARM] Allow the fabs intrinsic to be tail predicated
This patch stops the fabs intrinsic from blocking tail predication.

Differential Revision: https://reviews.llvm.org/D82570
2020-06-30 17:27:28 +01:00
Samuel Tebbs
d908315744 [ARM] Allow the usub_sat and ssub_sat intrinsics to be tail predicated
This patch stops the usub_sat and ssub_sat intrinsics from blocking tail predication.

Differential Revision: https://reviews.llvm.org/D82571
2020-06-30 17:16:58 +01:00
Sjoerd Meijer
4fb902bfc8 [ARM][MVE] Tail-predication: clean-up of unused code
After the rewrite of this pass (D79175) I missed one thing: the inserted VCTP
intrinsic can be cloned to exit blocks if there are instructions present in it
that perform the same operation, but this wasn't triggering anymore. However,
it turns out that for handling reductions, see D75533, it's actually easier not
not to have the VCTP in exit blocks, so this removes that code.

This was possible because it turned out that some other code that depended on
this, rematerialization of the trip count enabling more dead code removal
later, wasn't doing much anymore due to more aggressive dead code removal that
was added to the low-overhead loops pass.

Differential Revision: https://reviews.llvm.org/D82773
2020-06-30 17:09:36 +01:00
Samuel Tebbs
4fc642e8a5 [ARM] Allow rounding intrinsics to be tail predicated
This patch stops the trunc, rint, round, floor and ceil intrinsics from blocking tail predication.

Differential Revision: https://reviews.llvm.org/D82553
2020-06-30 16:52:25 +01:00
Sam Parker
5b21582cc5 [NFC][ARM] Tail predication reduction tests 2020-06-30 13:27:22 +01:00
David Green
6b973734b0 [ARM] Better reductions
MVE has native reductions for integer add and min/max. The others need
to be expanded to a series of extract's and scalar operators to reduce
the vector into a single scalar. The default codegen for that expands
the reduction into a series of in-order operations.

This modifies that to something more suitable for MVE. The basic idea is
to use vector operations until there are 4 remaining items then switch
to pairwise operations. For example a v8f16 fadd reduction would become:
Y = VREV X
Z = ADD(X, Y)
z0 = Z[0] + Z[1]
z1 = Z[2] + Z[3]
return z0 + z1

The awkwardness (there is always some) comes in from something like a
v4f16, which is first legalized by adding identity values to the extra
lanes of the reduction, and which can then not be optimized away through
the vrev; fadd combo, the inserts remain. I've made sure they custom
lower so that we can produce the pairwise additions before the extra
values are added.

Differential Revision: https://reviews.llvm.org/D81397
2020-06-29 16:04:13 +01:00
David Green
33b3ee4e0a [ARM] VCVTT fpround instruction selection
Similar to the recent patch for fpext, this adds vcvtb and vcvtt with
insert into vector instruction selection patterns for fptruncs. This
helps clear up a lot of register shuffling that we would otherwise do.

Differential Revision: https://reviews.llvm.org/D81637
2020-06-26 10:24:06 +01:00
David Green
6ea3fc012b [ARM] VCVTT instruction selection
We current extract and convert from a top lane of a f16 vector using a
VMOVX;VCVTB pair. We can simplify that to use a single VCVTT. The
pattern is mostly copied from a vector extract pattern, but produces a
VCVTTHS f32 directly.

This had to move some code around so that ARMInstrVFP had access to the
required pattern frags that were previously part of ARMInstrNEON.

Differential Revision: https://reviews.llvm.org/D81556
2020-06-26 08:58:55 +01:00
Sjoerd Meijer
ef2fc8e627 [SelectionDAG] Lower @llvm.get.active.lane.mask to setcc
This lowers intrinsic @llvm.get.active.lane.mask to a setcc node, i.e. an icmp
ule, and creates vectors for its 2 arguments on which the comparison is
performed.

Differential Revision: https://reviews.llvm.org/D82292
2020-06-26 07:46:38 +01:00
Sjoerd Meijer
9e7cf4b604 [ARM] Don't revert get.active.lane.mask in ARM Tail-Predication pass
Don't revert intrinsic get.active.lane.mask here, this is moved to isel
legalization in D82292.

Differential Revision: https://reviews.llvm.org/D82105
2020-06-26 07:42:39 +01:00
David Green
bc2c8f2f3b [ARM] Split FPExt loads
This extends PerformSplittingToWideningLoad to also handle FP_Ext, as
well as sign and zero extends. It uses an integer extending load
followed by a VCVTL on the bottom lanes to efficiently perform an fpext
on a smaller than legal type.

The existing code had to be rewritten a little to not just split the
node in two and let legalization handle it from there, but to actually
split into legal chunks.

Differential Revision: https://reviews.llvm.org/D81340
2020-06-25 21:55:13 +01:00
David Green
f40cf5ffb9 [ARM] MVE VCVT lowering for f16->f32 extends
This adds code to lower f16 to f32 fp_exts's using an MVE VCVT
instructions, similar to a recent similar patch for fp_trunc. Again it
goes through the lowering of a BUILD_VECTOR, but is slightly simpler
only having to deal with interleaved indices. It adds a VCVTL node to
lower to, similar to VCVTN.

Differential Revision: https://reviews.llvm.org/D81339
2020-06-25 20:54:26 +01:00
David Green
5a95ea8473 [ARM] Add FP_ROUND handling to splitting MVE stores
This splits MVE vector stores of a fp_trunc in the same way that we do
for standard trunc's. It extends PerformSplittingToNarrowingStores to
handle fp_round, splitting the store into pieces and adding a VCVTNb to
perform the actual fp_round. The actual store is then converted to an
integer store so that it can truncate bottom lanes of the result.

Differential Revision: https://reviews.llvm.org/D81141
2020-06-25 19:37:15 +01:00
David Green
6e22f03ef5 [ARM] MVE VCVT lowering for f32->f16 truncs
This adds code to lower f32 to f16 fp_trunc's using a pair of MVE VCVT
instructions. Due to v4f16 not being legal, fp_round are often split up
fairly early. So this reconstructs the vcvt's from a buildvector of
fp_rounds from two vector inputs. Something like:

BUILDVECTOR(FP_ROUND(EXTRACT_ELT(X, 0),
            FP_ROUND(EXTRACT_ELT(Y, 0),
            FP_ROUND(EXTRACT_ELT(X, 1),
            FP_ROUND(EXTRACT_ELT(Y, 1), ...)

It adds a VCVTN node to handle this, which like VMOVN or VQMOVN lowers
into the top/bottom lanes of an MVE instruction.

Differential Revision: https://reviews.llvm.org/D81139
2020-06-25 15:59:36 +01:00
Sam Tebbs
2aff9bba46 [ARM] Allow tail predication on sadd_sat and uadd_sat intrinsics
This patch stops the sadd_sat and uadd_sat intrinsics from blocking tail predication.

Differential revision: https://reviews.llvm.org/D82377
2020-06-25 11:54:29 +01:00
David Green
680cf8ff46 [ARM] Mark more integer instructions as not having side effects.
LDRD and STRD along with UBFX and SBFX are selected from DAGToDAG
transforms, so do not have tblgen patterns. They don't get marked as
having side effects so cannot be scheduled as efficiently as you would
like.

This specifically marks then as not having side effects.

Differential Revision: https://reviews.llvm.org/D82358
2020-06-23 22:45:51 +01:00
Tres Popp
e6cca551e7 Revert "[CGP] Enable CodeGenPrepares phi type convertion."
This reverts commit 67121d7b82ed78a47ea32f0c87b7317e2b469ab2.

This is causing compile times to be 2x slower on some large binaries.
2020-06-22 13:06:18 +02:00
David Green
2825171bc9 [CGP] Enable CodeGenPrepares phi type convertion. 2020-06-21 16:46:16 +01:00
Sjoerd Meijer
c43371a1bf [ARM][MVE] tail-predication: renamed internal option.
Renamed -force-tail-predication to -force-mve-tail-predication because
that's more descriptive and consistent.
2020-06-19 15:07:06 +01:00
Lucas Prates
a515304bf7 [ARM] Supporting lowering of half-precision FP arguments and returns in AArch32's backend
Summary:
Half-precision floating point arguments and returns are currently
promoted to either float or int32 in clang's CodeGen and there's
no existing support for the lowering of `half` arguments and returns
from IR in AArch32's backend.

Such frontend coercions, implemented as coercion through memory
in clang, can cause a series of issues in argument lowering, as causing
arguments to be stored on the wrong bits on big-endian architectures
and incurring in missing overflow detections in the return of certain
functions.

This patch introduces the handling of half-precision arguments and returns in
the backend using the actual "half" type on the IR. Using the "half"
type the backend is able to properly enforce the AAPCS' directions for
those arguments, making sure they are stored on the proper bits of the
registers and performing the necessary floating point convertions.

Reviewers: rjmccall, olista01, asl, efriedma, ostannard, SjoerdMeijer

Reviewed By: ostannard

Subscribers: stuij, hiraditya, dmgreen, llvm-commits, chill, dnsampaio, danielkiss, kristof.beyls, cfe-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D75169
2020-06-18 13:15:13 +01:00
Sjoerd Meijer
0d40769e87 [ARM] Reimplement MVE Tail-Predication pass using @llvm.get.active.lane.mask
To set up a tail-predicated loop, we need to to calculate the number of
elements processed by the loop. We can now use intrinsic
@llvm.get.active.lane.mask() to do this, which is emitted by the vectoriser in
D79100. This intrinsic generates a predicate for the masked loads/stores, and
consumes the Backedge Taken Count (BTC) as its second argument. We can now use
that to reconstruct the loop tripcount, instead of the IR pattern match
approach we were using before.

Many thanks to Eli Friedman and Sam Parker for all their help with this work.

This also adds overflow checks for the different, new expressions that we
create: the loop tripcount, and the sub expression that calculates the
remaining elements to be processed. For the latter, SCEV is not able to
calculate precise enough bounds, so we work around that at the moment, but is
not entirely correct yet, it's conservative. The overflow checks can be
overruled with a force flag, which is thus potentially unsafe (but not really
because the vectoriser is the only place where this intrinsic is emitted at the
moment). It's also good to mention that the tail-predication pass is not yet
enabled by default.  We will follow up to see if we can implement these
overflow checks better, either by a change in SCEV or we may want revise the
definition of llvm.get.active.lane.mask.

Differential Revision: https://reviews.llvm.org/D79175
2020-06-17 15:17:42 +01:00
David Green
9c4e61f00c [LSR] Filter for postinc formulae
In more complicated loops we can easily hit the complexity limits of
loop strength reduction. If we do and filtering occurs, it's all too
easy to remove the wrong formulae for post-inc preferring accesses due
to it attempting to maximise register re-use. The patch adds an
alternative filtering step when the target is preferring postinc to pick
postinc formulae instead, hopefully lowering the complexity to below the
limit so that aggressive filtering is not needed.

There is also a change in here to stop considering existing addrecs as
free under postinc. We should already be modelling them as a reg so
don't want it to cause us to get the cost wrong. (I'm not sure that code
makes sense in general, but there are X86 tests specifically for it
where it seems to be helping so have left it around for the standard
non-post-inc case).

Differential Revision: https://reviews.llvm.org/D80273
2020-06-17 12:32:04 +01:00
David Green
3175ed5612 [ARM] Fix crash trying to generate i1 immediates
These code patterns attempt to call isVMOVModifiedImm on a splat of i1
values, leading to an unreachable being hit. I've guarded the call on a
more specific set of sizes, as i1 vectors are legal under MVE.

Differential Revision: https://reviews.llvm.org/D81860
2020-06-16 12:27:24 +01:00
David Green
8a5c6f3865 [ARM] Add some MVE vecreduce tests. NFC 2020-06-09 12:07:19 +01:00
David Green
499648daa2 [ARM] VQMOVN demand bits analysis
Similar to VMOVN, a VQMOVN will only demand the top/bottom lanes of it's
first input. However unlike VMOVN it will need access to the entire
second argument, as that value is saturated not just moved in place.

Differential Revision: https://reviews.llvm.org/D80515
2020-06-05 18:41:02 +01:00
David Green
5cd7010367 [ARM] FP16 conversion tests. NFC 2020-06-04 13:13:56 +01:00
David Green
9a7508e506 [ARM] Extra MVE VMLAV reduction patterns
These patterns for i8 and i16 VMLA's were missing. They end up from
legalized vector.reduce.add.v8i16 and vector.reduce.add.v16i8, and
although the instruction works differently (the mul and add are
performed in a higher precision), I believe it is OK because only an
i8/i16 are demanded from them, and so the results will be the same. At
least, they pass any testing I can think to run on them.

There are some tests that end up looking worse, but are quite artificial
due to passing half vector types through a call boundary. I would not
expect the vmull to realistically come up like that, and a vmlava is
likely better a lot of the time.

Differential Revision: https://reviews.llvm.org/D80524
2020-05-29 16:23:24 +01:00
David Green
6307317d50 [ARM] More tests for MVE LSR and float issues. NFC 2020-05-28 22:04:12 +01:00
Victor Campos
8541a94229 [ARM] Fix rewrite of frame index in Thumb2's address mode i8s4
Summary:
In Thumb2's frame index rewriting process, the address mode i8s4, which
is used by LDRD and STRD instructions, is handled by taking the
immediate offset operand and multiplying it by 4.

This behaviour is wrong, however. In this specific address mode, the
MachineInstr's immediate operand is already in the expected form. By
consequence of that, multiplying it once more by 4 yields a flawed
offset value, four times greater than it should be.

Differential Revision: https://reviews.llvm.org/D80557
2020-05-27 13:09:13 +01:00
David Green
5d1d2066e7 [ARM] MVE VMINV/VMAXV test additions. NFC 2020-05-26 14:00:14 +01:00
David Green
92f08097a7 [ARM] VMULH tests for when other parts are working. NFC 2020-05-25 12:46:18 +01:00
Jean-Michel Gorius
2d66ce0e5e Revert "[CodeGen] Add support for multiple memory operands in MachineInstr::mayAlias"
This temporarily reverts commit 7019cea26dfef5882c96f278c32d0f9c49a5e516.

It seems that, for some targets, there are instructions with a lot of memory operands (probably more than would be expected). This causes a lot of buildbots to timeout and notify failed builds. While investigations are ongoing to find out why this happens, revert the changes.
2020-05-22 21:26:46 +02:00
Jean-Michel Gorius
b6e158e140 [CodeGen] Add support for multiple memory operands in MachineInstr::mayAlias
Summary:
To support all targets, the mayAlias member function needs to support instructions with multiple operands.

This revision also changes the order of the emitted instructions in some test cases.

Reviewers: efriedma, hfinkel, craig.topper, dmgreen

Reviewed By: efriedma

Subscribers: MatzeB, dmgreen, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D80161
2020-05-21 23:02:54 +02:00
Sjoerd Meijer
a5cc4d095a [HardwareLoops] llvm.loop.decrement.reg definition
This is split off from D80316, slightly tightening the definition of overloaded
hardwareloop intrinsic llvm.loop.decrement.reg specifying that both operands
its result have the same type.
2020-05-21 10:48:16 +01:00
Pierre-vh
66d4cf9f6d [Target][ARM] Make Low Overhead Loops coexist with VPT blocks.
Previously, the LowOverheadLoops pass couldn't handle VPT blocks
with conditions, or with multiple VCTPs. This patch improves the
LowOverheadLoops pass so it can handle those cases.

It also adds support for VCMPs before the VCTP.

Differential Revision: https://reviews.llvm.org/D78206
2020-05-20 12:24:55 +01:00
Sam Parker
c326fd4f01 [NFC][ARM] Add more tail predication tests 2020-05-19 14:01:10 +01:00
David Green
e11168f2ca [ARM] Patterns for VQSHRN
Given a VQMOVN(VSHR), we can fold that into a VQSHRN simply enough using
a few tablegen patterns.

Differential Revision: https://reviews.llvm.org/D77720
2020-05-16 17:46:43 +01:00
David Green
c1a15ae1a8 [ARM] Combines for VMOVN
This adds two combines for VMOVN, one to fold
VMOVN[tb](c, VQMOVNb(a, b)) => VQMOVN[tb](c, b)
The other to perform demand bits analysis on the lanes of a VMOVN. We
know that only the bottom lanes of the second operand and the top or
bottom lanes of the Qd operand are needed in the result, depending on if
the VMOVN is bottom or top.

Differential Revision: https://reviews.llvm.org/D77718
2020-05-16 15:13:16 +01:00
David Green
4120e7a927 [ARM] MVE saturating truncates
This adds some custom lowering for VQMOVN, an instruction that can be
used to perform saturating truncates from a pair of min(max(X, -0x8000),
0x7fff), providing those constants are correct. This leaves a VQMOVNBs
which saturates the value and inserts that into the bottom lanes of an
existing vector. We then need to do something with the other lanes,
extending the value using a vmovlb.

Ideally, as will often be the case, only the bottom lane of what remains
will be demanded, allowing the vmovlb to be removed. Which should mean
the instruction is either equal or a win most of the time, and allows
some extra follow-up folding to happen.

Differential Revision: https://reviews.llvm.org/D77590
2020-05-16 15:10:20 +01:00
David Green
b6aab3138f [ARM] Extra VQMOVN/VQSHRN tests. NFC 2020-05-16 14:23:26 +01:00
David Green
4c9b5189ca [ARM] Change more triples to arm-none-none-eabi. NFC 2020-05-15 22:53:07 +01:00
Anna Welker
8fb628942b [ARM][MVE] Add support for incrementing scatters
Adds support to build pre-incrementing scatters.
If the increment (i.e., add instruction) that is merged into
the scatter is the loop increment, an incrementing write-back
scatter can be built, which then assumes the role of the loop
increment.

Differential Revision: https://reviews.llvm.org/D79859
2020-05-15 17:02:00 +01:00
David Green
439830ee0e [ARM] Convert floating point splats to integer
Under MVE a vdup will always take a gpr register, not a floating point
value. During DAG combine we convert the types to a bitcast to an
integer in an attempt to fold the bitcast into other instructions. This
is OK, but only works inside the same basic block. To do the same trick
across a basic block boundary we need to convert the type in
codegenprepare, before the splat is sunk into the loop.

This adds a convertSplatType function to codegenprepare to do that,
putting bitcasts around the splat to force the type to an integer. There
is then some adjustment to the code in shouldSinkOperands to handle the
extra bitcasts.

Differential Revision: https://reviews.llvm.org/D78728
2020-05-13 15:24:16 +01:00
David Green
0021a951bd [ARM] Sink splats to fma intrinsics
Similar to fmul/fadd, we can sink a splat into a loop containing a fma
in order to use more register instruction variants. For that there are
also adjustments to the sinking code to handle more than 2 arguments.

Differential Revision: https://reviews.llvm.org/D78386
2020-05-13 14:58:30 +01:00
Pierre-vh
39a7b5b535 [LSR][ARM] Add new TTI hook to mark some LSR chains as profitable
This patch adds a new TTI hook to allow targets to tell LSR that
a chain including some instruction is already profitable and
should not be optimized. This patch also adds an implementation
of this TTI hook for ARM so LSR doesn't optimize chains that include
the VCTP intrinsic.

Differential Revision: https://reviews.llvm.org/D79418
2020-05-13 14:18:28 +01:00
Pierre-vh
90e5c93ad7 [Target][ARM] Replace re-uses of old VPR values with VPNOTs
Differential Revision: https://reviews.llvm.org/D76847
2020-05-12 12:09:57 +01:00
Eli Friedman
f704804dd2 [SelectionDAG] Don't promote the alignment of allocas beyond the stack alignment.
allocas in LLVM IR have a specified alignment. When that alignment is
specified, the alloca has at least that alignment at runtime.

If the specified type of the alloca has a higher preferred alignment,
SelectionDAG currently ignores that specified alignment, and increases
the alignment. It does this even if it would trigger stack realignment.
I don't think this makes sense, so this patch changes that.

I was looking into this for SVE in particular: for SVE, overaligning
vscale'ed types is extra expensive because it requires realigning the
stack multiple times, or using dynamic allocation. (This currently isn't
implemented.)

I updated the expected assembly for a couple tests; in particular, for
arg-copy-elide.ll, the optimization in question does not increase the
alignment the way SelectionDAG normally would. For the rest, I just
increased the specified alignment on the allocas to match what
SelectionDAG was inferring.

Differential Revision: https://reviews.llvm.org/D79532
2020-05-11 17:39:00 -07:00