When widening, each half of the v2s16 operands needs to be sign extended
for G_ASHR or zero extended for G_LSHR.
Differential Revision: https://reviews.llvm.org/D96048
SALU min/max s32 instructions exist so use them. This means that
regbankselect can handle min/max much like add/sub/mul/shifts.
Differential Revision: https://reviews.llvm.org/D96047
If amdgpu-unsafe-fp-atomics is specified, allow {flat|global}_atomic_add_f32 even if atomic modes don't match.
Differential Revision: https://reviews.llvm.org/D95391
It was discussed a few years ago and agreed that it makes sense to
remove this assertion as other targets do not perform similar register
size checking in inline assembly constraint logic, so the check just
adds a needless barrier on AVR.
This patch removes the assertion and removes 'XFAIL' from two Generic
CodeGen tests for AVR as a result.
This new f16 shuffle under Neon would hit an assert in
GeneratePerfectShuffle as it would try to treat a f16 vector as an i8.
Add f16 handling, treating them like an i16.
Differential Revision: https://reviews.llvm.org/D95446
When SGPRs are spilled to VGPRs, they can overwrite any lane. We need
to preserve the value of inactive lanes in function calls, so we save
the register even if it is marked as caller saved.
Also, teach buildPrologSpill to work when no registers are free like in
CodeGen/AMDGPU/pei-scavenge-vgpr-spill.mir and update the comment on
findScratchNonCalleeSaveRegister as it is not used anymore to realign
the stack pointer since D95865.
Differential Revision: https://reviews.llvm.org/D95946
Similar to the G_PTR_ADD + G_LOAD twiddling we do in `preISelLower`.
The imported patterns expect scalars only, so they can't handle things like
```
G_STORE %ptr1, %ptr2
```
To get around this, use s64 instead.
(This probably makes a good portion of the manual selection code for G_STORE
dead.)
This is a 0.2% geomean code size improvement on CTMark at -Os.
(Best is consumer-typeset @ -0.7%)
Differential Revision: https://reviews.llvm.org/D95908
When we have a zeroext parameter coming in on the stack, build
```
%x = G_LOAD ...
%x_assert_zext = G_ASSERT_ZEXT %x, narrow_size
%trunc = G_TRUNC %x_assert_zext
```
Rather than just loading into the truncated type.
This allows us to optimize cases like this: https://godbolt.org/z/vfjhW8
Differential Revision: https://reviews.llvm.org/D95805
For the fixed ABI, set this in the initial argument constructor,
rather than relying on the allocation logic to set the values. Also
stop passing them for amdgpu_gfx, since the DAG path seems to skip
these. I'm unclear on what amdgpu_gfx's expectations are. This will
allow moving the special input registers out of the normal argument
range.
Extends D95779 to permit insertion into float/doubles vectors while avoiding a lot of aliased memory traffic.
The scalar value is already on the simd unit, so we only need to transfer and splat the index value, then perform the select.
SSE4 codegen is a little bulky due to the tied register requirements of (non-VEX) BLENDPS/PD but the extra moves are cheap so shouldn't be an actual problem.
Differential Revision: https://reviews.llvm.org/D95866
This reverts commits 62af0305b7cc..677a3529d3e6 from D93708.
They cause failures in the sanitizer builds because of uninitialized
values.
A fix is in D95878, but it might take some time until this is pushed,
so reverting the changes for now.
This patch adds a cost model for SK_Broadcast in
AArch64TTIImpl::getShuffleCost with scalable vector.
Without this patch, the scalable vector type relies on BasicTTIImpl cost
implementation and assert.
Differential Revision: https://reviews.llvm.org/D95598
We don't register i128 as a legal type with addRegisterClass, but it
appears in the list of legal register types. This inconsistency
resulted in the asm constraint lowering trying to use 2 128-bit
registers for these operands. This would leave behind a dead def that
would waste registers.
Regresses GlobalISel tests for i128 load/store, but these aren't very
important right now. Ideally these would not depend on the list of
register types.
This should only consider whether the pressure impact of the bundle at
the given point in the program will decrease the occupancy. High VGPR
pressure was incorrectly blocking the formation of scalar bundles, and
vice versa. This was also blocking bundling from high pressure
situations at other points in the program.
Second land attempt. MachineVerifier DefRegState expensive check errors fixed.
Prologs and epilogs handle callee-save registers and tend to be irregular with
different immediate offsets that are not often handled by the MachineOutliner.
Commit D18619/a5335647d5e8 (combining stack operations) stretched irregularity
further.
This patch tries to emit homogeneous stores and loads with the same offset for
prologs and epilogs respectively. We have observed that this canonicalizes
(homogenizes) prologs and epilogs significantly and results in a greatly
increased chance of outlining, resulting in a code size reduction.
Despite the above results, there are still size wins to be had that the
MachineOutliner does not provide due to the special handling X30/LR. To handle
the LR case, his patch custom-outlines prologs and epilogs in place. It does
this by doing the following:
* Injects HOM_Prolog and HOM_Epilog pseudo instructions during a Prolog and
Epilog Injection Pass.
* Lowers and optimizes said pseudos in a AArchLowerHomogneousPrologEpilog Pass.
* Outlined helpers are created on demand. Identical helpers are merged by the linker.
* An opt-in flag is introduced to enable this feature. Another threshold flag
is also introduced to control the aggressiveness of outlining for application's need.
This reduced an average of 4% of code size on LLVM-TestSuite/CTMark targeting arm64/-Oz.
Differential Revision: https://reviews.llvm.org/D76570
Due to a clerical error, the sdiv operation was mapping to vdivu and
udiv to vdiv, when the opposite mapping is the correct one.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D95869
This allows the peephole optimizer to know that a MVE_VMOV_to_lane_32 is
the same as an insert subreg, allowing it to optimize some redundant
lane moves.
Differential Revision: https://reviews.llvm.org/D95433
The temporary register is only used to compute the frame pointer.
The frame pointer is overwritten and not used in between, so we
can reuse the frame pointer for the computation, saving one register.
Differential Revision: https://reviews.llvm.org/D95865
Saving callee-save registers happens in whole wave mode. Exec is saved
to a free register, which can be reused to save the frame pointer.
Therefore, saving the fp needs to happen after saving csrs.
Differential Revision: https://reviews.llvm.org/D95861
NOTE: This patch was originally written by Anil Mahmud. His code has been
rebased but otherwise left mostly unchanged.
A new instructon on Power 10 allows for the materialization of 34 bit
immediate values. This patch allows the compiler to take advantage of
the new instruction in this situation.
Reviewed By: amyk
Differential Revision: https://reviews.llvm.org/D92879
A v4i32 insert of an extract can become a simple lane move, as opposed
to round-tripping via a GPR. This adds a patterns that turns an v4i32
insert-extract pair into a EXTRACT_SUBREG/INSERT_SUBREG, with the
required COPY_TO_REGCLASS. These get better optimized into a simple lane
move by the rest of the backend.
Differential Revision: https://reviews.llvm.org/D95428
This patch adds tablegen patterns for pairs of i16/f16 insert/extracts.
If we are inserting into two adjacent vector lanes (0 and 1 for
example), we can use either a vmov;vins or vmovx;vins to insert the pair
together, avoiding a round-trip from GRP registers. This is quite a
large patterns with a number of EXTRACT_SUBREG/INSERT_SUBREG/
COPY_TO_REGCLASS nodes, but hopefully as most of those become copies all
that will be cleaned up by further optimizations.
The VINS pattern was also adjusted to allow it to represent that it is
inserting into the top half of an existing register.
Differential Revision: https://reviews.llvm.org/D95381
With predicate masks, AVX512 can efficiently perform variable-index vector insertion with 2 broadcasts + 1 comparison, avoiding a lot of aliased memory traffic.
Differential Revision: https://reviews.llvm.org/D95779
For x86-64 the REX.w prefix takes precedence over any other size
override (i.e. 0x66). Therefore, for x86-64 when REX.w is present set
'hasOpSize' to false to ensure that any size override is ignored.
Fixes PR48901.
Differential Revision: https://reviews.llvm.org/D95682
A DLS lr, lr instruction only moves lr to itself. It need not be emitted
on it's own to save a instruction in the loop preheader.
Differential Revision: https://reviews.llvm.org/D78916
This patch updates IRBuilder::CreateMaskedGather/Scatter to work
with ScalableVectorType and adds isLegalMaskedGather/Scatter functions
to AArch64TargetTransformInfo. In addition I've fixed up
isLegalMaskedLoad/Store to return true for supported scalar types,
since this is what the vectorizer asks for.
In LoopVectorize.cpp I've changed
LoopVectorizationCostModel::getInterleaveGroupCost to return an invalid
cost for scalable vectors, since currently this relies upon using shuffle
vector for reversing vectors. In addition, in
LoopVectorizationCostModel::setCostBasedWideningDecision I have assumed
that the cost of scalarising memory ops is infinitely expensive.
I have added some simple masked load/store and gather/scatter tests,
including cases where we use gathers and scatters for conditional invariant
loads and stores.
Differential Revision: https://reviews.llvm.org/D95350
I guess instructions were marked as frame-setup by accident, they are
restores as part of the epilog.
Differential Revision: https://reviews.llvm.org/D95783