Noticed while trying to clean up the shift costs model for SSE4 targets using the script in D10369 - SLM double-pumps all the 128-bit vector conversion ops and only use FP0 pipe - numbers taken from Intel AOM + Agner.
This changes the cost to (LT.first-1) * cost(add) + 2, where the cost of
an add is assumed to be 1. This brings it inline with the other
reductions.
Differential Revision: https://reviews.llvm.org/D106240
This patch is in a series of patches to provide builtins for compatibility
with the XL compiler. This patch adds the builtin and intrinsic for "__stbcx".
Reviewed By: nemanjai, #powerpc
Differential revision: https://reviews.llvm.org/D106484
Lowering certain float vectors without legal vector types could cause a
crash due to a bad interaction between passing floats via GPRs and
argument splitting. Split vector floats appear just like scalar floats.
Under certain situations we choose to pass these float arguments via
GPRs and use an XLenVT location and set the 'BCvt' info to track how
they must be converted back to floating-point values. However, later
logic for handling split arguments may take over, in which case we lose
the previous information and set the 'Indirect' info, thus incorrectly
lowering to integer types.
I don't believe that we would have come across the notion of split
floating-point arguments before. This patch addresses the issue by
updating the lowering so that split arguments are only passed indirectly
when they are scalar integer types.
This has some change to how we lower some larger illegal float vectors,
as can be seen in 'fastcc-float.ll' where the vector is now passed
partly in registers and partly on the stack.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D102852
This relands a6ca88e908b5befcd9b0f8c8cb40f53095cc17bc which was originally
reverted due to overflow bugs in e3fa2b1eab60342dc882b7b888658b03c472fa2b.
This patch teaches the compiler to identify a wider variety of
`BUILD_VECTOR`s which form integer arithmetic sequences, and to lower
them to `vid.v` with modifications for non-unit steps and non-zero
addends.
The sequences handled by this optimization must either be monotonically
increasing or decreasing. Consecutive elements holding the same value
indicate a fractional step which, while simple mathematically,
becomes more complex to handle both in the realm of lossy integer
division and in the presence of `undef`s.
For example, a common "interleaving" shuffle index will be lowered by
LLVM to both `<0,u,1,u,2,...>` and `<u,0,u,1,u,...>` `BUILD_VECTOR`
nodes. Either of these would ideally be lowered to `vid.v` shifted right
by 1. Detection of this sequence in presence of general `undef` values
is more complicated, however: `<0,u,u,1,>` could match either
`<0,0,0,1,>` or `<0,0,1,1,>` depending on later values in the sequence.
Both are possible, so backtracking or multiple passes is inevitable.
Sticking to monotonic sequences keeps the logic simpler as it can be
done in one pass. Fractional steps will likely be a separate
optimization in a future patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104921
Allow MIMG instructions to be selected with 6/7 VGPRs for vaddr.
Previously these were rounded up to VReg_256 this saves VGPRs.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D103800
Disable null export (for kills) when a frontend defines a pixel
shader as not exporting using amdgpu-color-export and
amdgpu-depth-export function attrbutes.
This allows the generation of export free pixel shaders.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D105683
This is SCC pass, moving it to the end of SCC PM saves one
Function PM. This needs the analysis to take into account
memory access width since it is now places after the
load/store optimizer (D105651).
Differential Revision: https://reviews.llvm.org/D105652
These had
```
.clampScalar(0, s1, 64)
.widenScalarToNextPow2(0, 8)
```
If you have s2 or s4, then `widenScalarToNextPow2` does nothing.
This changes the `widenScalarToNextPow2` rule to use s8 as the minimum type
instead, allowing us to correctly widen s2 and s4.
This does not impact s1, since it's marked as legal already.
Differential Revision: https://reviews.llvm.org/D106413
A function with less memory instructions but wider access
is the same as a function with more but narrower accesses
in terms of memory boundness. In fact the pass would give
different answers before and after vectorization without
this change.
Differential Revision: https://reviews.llvm.org/D105651
The existing rule about the operand type is strange. Instead, just say
the operand is a TargetConstant with the right width. (Legalization
ignores TargetConstants, so it doesn't matter if that width is legal.)
Highlights:
1. I had to substantially rewrite the AArch64 isel patterns to expect a
TargetConstant. Nothing too exotic, but maybe a little hairy. Maybe
worth considering a target-specific node with some dagcombines instead
of this complicated nest of isel patterns.
2. Our behavior on RV32 for vectors of i64 has changed slightly. In
particular, we correctly preserve the width of the arithmetic through
legalization. This changes the DAG a bit. Maybe room for
improvement here.
3. I explicitly defined the behavior around overflow. This is necessary
to make the DAGCombine transforms legal, and I don't think it causes any
practical issues.
Differential Revision: https://reviews.llvm.org/D105673
Replace the experimental clang builtins and LLVM intrinsics for these
instructions with normal instruction selection patterns. The wasm_simd128.h
intrinsics header was already using portable code for the corresponding
intrinsics, so now it produces the correct instructions.
Differential Revision: https://reviews.llvm.org/D106400
ML64.EXE applies implicit RIP-relative addressing only to memory references that include a named-variable reference.
Reviewed By: mstorsjo
Differential Revision: https://reviews.llvm.org/D105372
This patch is in a series of patches to provide
builtins for compatibility with the XL compiler.
This patch adds builtins related to floating point
operations
Reviewed By: #powerpc, nemanjai, amyk, NeHuang
Differential Revision: https://reviews.llvm.org/D103986
The killed flag is not always set. E.g. when a variable is used in a
loop, it is never marked as killed, although it is unused in following
basic blocks. Also, we try to deprecate kill flags and not use them.
Check if the register is live in the endif block. If not, consider it
killed in the then and else blocks.
The vgpr-liverange tests have two new tests with loops
(pre-committed, so the diff is visible).
I also needed to change the subtarget to gfx10.1, otherwise calls
are not working.
Differential Revision: https://reviews.llvm.org/D106291
Rename getBufferOffsetForMMO to updateBufferMMO and pass in the MMO to
be updated, in preparation for the bug fix in D106284.
Call updateBufferMMO consistently for all buffer intrinsics, even the
ones that use setBufferOffsets to decompose a combined offset
expression.
Add a getIdxEn helper function.
Differential Revision: https://reviews.llvm.org/D106354
In mandatory tail calling conventions we might have to deallocate stack
space used by our arguments before return. This happens after popping
CSRs, so the pop cannot be turned into the return itself in this case.
The else branch here was already a nop, so removing it as a tidy-up.
We have SelectionDAG patterns for 8 & 16-bit atomic operations, but they
assume the value types will have been legalized to 32-bits. So this adds
the ability to widen them to both AArch64 & generic GISel
infrastructure.
This patch adds the mova instruction to insert/extract an SVE vector
register to/from a ZA tile vector.
The preferred MOV aliases are also implemented.
Depends on D105572.
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2021-06
Reviewed By: david-arm, CarolineConcatto
Differential Revision: https://reviews.llvm.org/D105574
If a CMOV is in a loop and is converted to branches, CMOV conversion wouldn't
add newly created basic blocks to loop info. Since the candidates is collected
based on loops, instructions in these basic blocks will be ignored.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D104623
Implemented builtins for mtmsr, mfspr, mtspr on PowerPC;
the patch is intended for XL Compatibility.
Differential revision: https://reviews.llvm.org/D106130
This patch implements store, load, move from and to registers related
builtins, as well as the builtin for stfiw. The patch aims to provide
feature parady with xlC on AIX.
Differential revision: https://reviews.llvm.org/D105946
Add manual selection code similar to the code in AArch64ISelDAGToDAG, and add
`createTuple` helpers similar to the code there as well.
This accounted for around 111 fallbacks while building clang for AArch64 with
GlobalISel.
This also should make it easy to add selection code for other store
intrinsics.
As a minor cleanup, this uses `createQTuple` in the other place where we use
REG_SEQUENCE.
Differential Revision: https://reviews.llvm.org/D106332
Basically two parts to this fix:
1. Stop using AtomicExpand to expand cmpxchg i128
2. Fix AArch64ExpandPseudoInsts to use a correct expansion.
From ARM architecture reference:
To atomically load two 64-bit quantities, perform a Load-Exclusive
pair/Store-Exclusive pair sequence of reading and writing the same value
for which the Store-Exclusive pair succeeds, and use the read values
from the Load-Exclusive pair.
Fixes https://bugs.llvm.org/show_bug.cgi?id=51102
Differential Revision: https://reviews.llvm.org/D106039
This patch is in a series of patches to provide builtins for compatibility
with the XL compiler. This patch add the builtin and emit target independent
code for __cmpb.
Reviewed By: nemanjai, #powerpc
Differential revision: https://reviews.llvm.org/D105194
If we need to shift left anyway we might be able to take advantage
of LUI implicitly shifting its immediate left by 12 to cover part
of the shift. This allows us to use more bits of the LUI immediate
to avoid an ADDI.
isDesirableToCommuteWithShift now considers compressed instruction
opportunities when deciding if commuting should be allowed.
I believe this is the same or similar to one of the optimizations
from D79492.
Reviewed By: luismarques, arcbbb
Differential Revision: https://reviews.llvm.org/D105417
Replace some existing isel patterns that are covered by the new
code. SLLIUWPat has been removed in favor of folding its root case
into the new code. The other uses in isel patterns for shXadd.uw
have been switched to using hardcoded AND masks.
This is based on the original version of D49585 from ARM. The final
version of that was made a DAG combine, but I've chosen to keep it
as custom isel. I'm not convinced DAG combine is as good with
shift pairs as it is with and+shift. I saw some issues optimizing
the shifts created by vscale lowering if an and isn't created for
from a shift pair.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D106230
ACC registers are a combination of four consecutive vector registers.
If the vector registers are assigned first this often forces a number
of copies to appear just before the ACC register is created. If the ACC
register is assigned first then fewer copies are generated when the vector
registers are assigned.
This patch tries to force the register allocator to assign the ACC registers first
and then the UACC registers and then the vector pair registers. It does this
by changing the priority of the register classes.
This patch also adds hints to help the register allocator assign UACC registers from
known ACC registers and vector pair registers from known UACC registers.
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D105854
I don't think the semantics of the llvm masked gather intrinsic care
about the order the elements are loaded. For example, type legalization
by splitting will chain them in parallel. This is different than
scatter which we do chain in order.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D106025
We were using auto instead of auto* in a number of places which failed the llvm-qualified-auto check.
Additionally we were using auto in some places where the type wasn't immediately obvious - the style guide rule of thumb is only to use auto from casts etc. where the type is already explicitly stated.
First, collect the register usage in each function, then apply the
maximum register usage of all functions to functions with indirect
calls.
This is more accurate than guessing the maximum register usage without
looking at the actual usage.
As before, assume that indirect calls will hit a function in the
current module.
Differential Revision: https://reviews.llvm.org/D105839