1
0
mirror of https://github.com/RPCS3/llvm-mirror.git synced 2024-11-25 04:02:41 +01:00
Commit Graph

21745 Commits

Author SHA1 Message Date
Malcolm Jestadt
c725f494c9 X86: Avoid converting EVEX to VEX when disp8 would be beneficial
Saves around 2% code size
2022-05-27 15:29:35 +03:00
Nekotekina
318b8fe374 X86: fixup matchPMADDWD_3 2021-11-18 12:01:17 +03:00
Nekotekina
1cc7bdd501 X86: improve (V)PMADDWD detection (2)
Implement "full" pattern.
2021-11-16 13:50:49 +03:00
Nekotekina
610c27aa1c X86: disable AVX512 truncate with saturation instructions
These are not very useful in RPCS3.
However, pessimizations occur.
2021-11-16 13:45:26 +03:00
Nekotekina
c9fceef173 X86: fixup (V)PMADDWD detection
Fix some bugs (missing checks).
Add constant support.
2021-11-02 17:42:39 +03:00
Nekotekina
f7d625e31a X86: improve (V)PMADDWD detection
In function combineMulToPMADDWD, if 17 bit are sign bits,
not just zero bits, the optimization can be applied sometimes.
For now, detect and replace SRA pairs with SRL.
2021-11-02 17:42:39 +03:00
Nekotekina
c36b21c023 X86: modify PreserveAll CC to save full AVX-512 state 2021-11-02 17:42:39 +03:00
Nekotekina
548daf04b5 X86: avoid vector-scalar shifts if splat amount is directly a vector ADD/SUB/AND op.
Prefer vector-vector shifts if available (AVX2+).
Improves code generated for rotate and funnel shifts.
Otherwise it would generate a shuffle + slower vector-scalar shift.
2021-11-02 17:42:39 +03:00
Nekotekina
d5bc359dfd X86: add patterns for X86ISD::VSHLV and X86ISD::VSRLV
Replace VSELECT instruction which zeroes their result on exceeding legal SHL/SRL shift amount.
2021-11-02 17:42:39 +03:00
Nekotekina
bed700114e X86: add pattern for X86ISD::VSRAV
Detect clamping ashr shift amount to max legal value
2021-11-02 17:42:39 +03:00
Nekotekina
2ffa82223f X86: expand detectAVGPattern()
Allow all integer widths in the pattern, allow ashr
Handle signed and mixed cases, allowing to replace truncation
2021-11-02 17:42:39 +03:00
Nekotekina
5ff8f4151c X86: optimize VSELECT for v16i8 with shl + sign bit test 2021-11-02 17:42:39 +03:00
Nekotekina
4743d020ce X86: LowerShift: new algorithm for vector-vector shifts
Emit pair of shifts of double size if possible
2021-11-02 17:42:39 +03:00
Nekotekina
d18817ded9 X86: Fix/workaround Small Code Model for JIT
Force RIP-relative jump tables and global values
These things were causing crashes due to use of absolute addressing
2021-11-02 17:42:39 +03:00
Simon Pilgrim
aed4e7449f [X86] combineX86ShuffleChain - ensure we only peek through bitcasts to vectors (PR51858)
When searching for hidden identity shuffles (added at rG41146bfe82aecc79961c3de898cda02998172e4b), only peek through bitcasts to the source operand if it is a vector type as well.

(cherry picked from commit dcba99418438ec1d624ad207674234bd2e9e3394)
2021-09-20 11:22:27 -07:00
Elliot Saba
a967752a95 [X86] Don't clobber EBX in stackprobes
On X86, the stackprobe emission code chooses the `R11D` register, which
is illegal on i686.  This ends up wrapping around to `EBX`, which does
not get properly callee-saved within the stack probing prologue,
clobbering the register for the callers.

We fix this by explicitly using `EAX` as the stack probe register.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D109203

(cherry picked from commit ae8507b0df738205a6b9e3795ad34672b7499381)
2021-09-10 09:30:52 -07:00
Simon Pilgrim
45d26b8826 [X86][AVX] Extract SUBV_BROADCAST constant bits from just the lower subvector range (PR51281)
As reported on PR51281, an internal fuzz test encountered an issue when extracting constant bits from a SUBV_BROADCAST node from a constant pool source larger than the broadcasted subvector width.

The getTargetConstantBitsFromNode was assuming that the Constant would the same size as the subvector, resulting in the incorrect packing of the per-element bits data.

This patch attempts to solve this by using the SUBV_BROADCAST node to determine the subvector width, and then ensuring we extract only the lowest bits from Constant of that subvector bitsize.

Differential Revision: https://reviews.llvm.org/D107158

(cherry picked from commit 18e6a03b1a15b2661259af15ae604b4c4850cd61)
2021-08-18 12:15:46 -07:00
Andrea Di Biagio
681b643c07 [X86][SchedModel] Add missing ReadAdvance for some arithmetic ops (PR51318 and PR51322).
This fixes a bug where implicit uses of EFLAGS were not marked as ReadAdvance in
the RM/MR variants of ADC/SBB (PR51318)

This also fixes the absence of ReadAdvance for the register operand of
RMW arithmetic instructions (PR51322).

Differential Revision: https://reviews.llvm.org/D107367

(cherry picked from commit 7a1a35a1d1ae2e69769505c9f39910067c53d53b)
2021-08-11 21:40:03 -07:00
Xiang1 Zhang
a6d5003afd [X86] Fix lowering to illegal type in LowerINSERT_VECTOR_ELT
Differential Revision: https://reviews.llvm.org/D106780
2021-07-28 08:16:59 +08:00
Xiang1 Zhang
5d447ad589 Revert "[X86] Fix lowering to illegal type in LowerINSERT_VECTOR_ELT"
This reverts commit 6ff73efea94621e74642e4d7a15cc86a5fb6d411.
2021-07-28 08:12:29 +08:00
Xiang1 Zhang
409f0eedd6 [X86] Fix lowering to illegal type in LowerINSERT_VECTOR_ELT 2021-07-28 08:08:30 +08:00
Sanjay Patel
11aa71a71d [x86] update stale code comment; NFC
The transform was generalized with:
1ce05ad619a5
2021-07-27 16:45:52 -04:00
Tres Popp
05308b27ea Revert "[X86][AVX] Add getBROADCAST_LOAD helper function. NFCI."
This reverts commit 1cfecf4fc4278afb0005923f6dff595cd372da5c.

This commit broke LLVM code generated through XLA by removing a
conditional on Ld->getExtensionType() == ISD::NON_EXTLOAD

This is not a perfect revert. The new function is left as other uses of
it exist now.
2021-07-27 16:55:50 +02:00
Tres Popp
d809395fde Revert "Revert "[X86][AVX] Add getBROADCAST_LOAD helper function. NFCI.""
This reverts commit d7bbb1230a94cb239aa4a8cb896c45571444675d.

There were follow up uses of a deleted method and I didn't run the
tests. Undo the revert, so I can do it properly.
2021-07-27 16:48:31 +02:00
Tres Popp
9d32182a3a Revert "[X86][AVX] Add getBROADCAST_LOAD helper function. NFCI."
This reverts commit 1cfecf4fc4278afb0005923f6dff595cd372da5c.

This commit broke LLVM code generated through XLA by removing a
conditional on Ld->getExtensionType() == ISD::NON_EXTLOAD
2021-07-27 16:22:25 +02:00
Simon Pilgrim
96fce2fe63 [X86][AVX] Prefer vinsertf128 to vperm2f128 on AVX1 targets
Splatting the lower xmm with vinsertf128 is at least as quick as vperm2f128, and a lot faster on some AMD targets.

First step towards PR50053
2021-07-26 11:11:56 +01:00
David Sherwood
2d2e4a1b17 [Analysis] Add simple cost model for strict (in-order) reductions
I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:

  1. Tree-wise. This is the typical fast-math reduction that involves
  continually splitting a vector up into halves and adding each
  half together until we get a scalar result. This is the default
  behaviour for integers, whereas for floating point we only do this
  if reassociation is allowed.
  2. Ordered. This now allows us to estimate the cost of performing
  a strict vector reduction by treating it as a series of scalar
  operations in lane order. This is the case when FP reassociation
  is not permitted. For scalable vectors this is more difficult
  because at compile time we do not know how many lanes there are,
  and so we use the worst case maximum vscale value.

I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.

New tests have been added here:

  Analysis/CostModel/AArch64/reduce-fadd.ll
  Analysis/CostModel/AArch64/sve-intrinsics.ll
  Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D105432
2021-07-26 10:26:06 +01:00
Simon Pilgrim
a499e7c2d0 [X86][AVX] Add getBROADCAST_LOAD helper function. NFCI.
Begin replacing individual getMemIntrinsicNode calls and setup (for X86ISD::VBROADCAST_LOAD + X86ISD::SUBV_BROADCAST_LOAD opcodes) with this getBROADCAST_LOAD helper.
2021-07-25 20:37:58 +01:00
Simon Pilgrim
4b99bcdef6 [X86][SSE] LowerRotate - perform modulo on the amount splat source directly.
If the rotation amount is a known splat, perform the modulo on the splat source, and then perform the splat. That way the amount-extension performed later by LowerScalarVariableShift can fold the splats away without any multiple-use issues.

Fixes one of the concerns raised on D104156
2021-07-25 17:30:32 +01:00
Sanjay Patel
f3223b51e0 [x86] improve CMOV codegen by pushing add into operands, part 2
This is a minimum extension of D106607 to allow folding for
2 non-zero constantsi that can be materialized as immediates..

In the reduced test examples, we save 1 instruction by rolling
the constants into LEA/ADD. In the motivating test from the bullet
benchmark, we absorb both of the constant moves into add ops via
LEA magic, so we reduce by 2 instructions.

Differential Revision: https://reviews.llvm.org/D106684
2021-07-25 10:05:41 -04:00
Simon Pilgrim
2aaf1231f6 [X86][AVX] Adjust AllowBWIVPERMV3 tolerance to account for VariableCrossLaneShuffleDepth
As noticed on D105390 - we were hardwiring the depth limit for combining to VPERMI2W/VPERMI2B instructions. Not only had we made the limit too low, we hadn't accounted for slow/fast shuffles via the VariableCrossLaneShuffleDepth control
2021-07-25 14:05:11 +01:00
Craig Topper
da55d61e8f [X86] Fix a bug in TEST with immediate creation
This code tries to form a TEST from CMP+AND with an optional
truncate in between. If we looked through the truncate, we may
have extra bits in the AND mask that shouldn't participate in
the checks. Normally SimplifyDemendedBits takes care of this, but
the AND may have another user. So manually mask out any extra bits.

Fixes PR51175.

Differential Revision: https://reviews.llvm.org/D106634
2021-07-23 09:03:53 -07:00
Sanjay Patel
845cd9f16b [x86] improve CMOV codegen by pushing add into operands
This is not the transform direction we want in general,
but by the time we have a CMOV, we've already tried
everything else that could be better.
The transform increases the uses of the other add operand,
but that is safe according to Alive2:
https://alive2.llvm.org/ce/z/Yn6p-A

We could probably extend this to other binops (not just add).
This is the motivating pattern discussed in:
https://llvm.org/PR51069

The test with i8 shows a missed fold because there's a trunc
sitting in front of the add. That can be handled with a small
follow-up.

Differential Revision: https://reviews.llvm.org/D106607
2021-07-23 09:39:32 -04:00
Simon Pilgrim
2e0db61480 [X86][AVX] lowerV2X128Shuffle - attempt to recognise broadcastf128 subvector load
As noticed on PR50053 we were failing to recognise when a shuffle of a load was really a subvector broadcast load
2021-07-23 13:10:38 +01:00
Simon Pilgrim
c4477aac03 [CostModel][X86] Adjust shift SSE4 legalized costs based on llvm-mca reports.
Update shl/lshr/ashr costs based on the worst case costs from the script in D103695 - many of the 128-bit shifts (usually where integer multiplies aren't used) have similar behaviour to AVX1 so we can merge them.
2021-07-22 20:07:32 +01:00
Simon Pilgrim
815b215830 [X86] Fix SLM FP<->INT throughputs.
Noticed while trying to clean up the shift costs model for SSE4 targets using the script in D10369 - SLM double-pumps all the 128-bit vector conversion ops and only use FP0 pipe - numbers taken from Intel AOM + Agner.
2021-07-22 19:39:04 +01:00
Simon Pilgrim
8371f55768 [CostModel][X86] Adjust shift SSE legalized costs based on llvm-mca reports.
Update shl/lshr/ashr costs based on the worst case costs from the script in D103695.
2021-07-22 18:12:49 +01:00
Eric Astor
be0514d8cd [ms] [llvm-ml] Restrict implicit RIP-relative addressing to named-variable references
ML64.EXE applies implicit RIP-relative addressing only to memory references that include a named-variable reference.

Reviewed By: mstorsjo

Differential Revision: https://reviews.llvm.org/D105372
2021-07-21 11:49:58 -04:00
Tianqing Wang
5f5c9808cd [X86] Update MachineLoopInfo in CMOV conversion.
If a CMOV is in a loop and is converted to branches, CMOV conversion wouldn't
add newly created basic blocks to loop info. Since the candidates is collected
based on loops, instructions in these basic blocks will be ignored.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D104623
2021-07-21 10:53:46 +08:00
Simon Pilgrim
9e495cac81 [X86] X86InstCombineIntrinsic.cpp - silence clang-tidy warnings about incorrect uses of auto. NFCI.
We were using auto instead of auto* in a number of places which failed the llvm-qualified-auto check.

Additionally we were using auto in some places where the type wasn't immediately obvious - the style guide rule of thumb is only to use auto from casts etc. where the type is already explicitly stated.
2021-07-20 13:37:45 +01:00
Simon Pilgrim
5157b0c187 [X86] Fix case of IsAfterLegalize argument. NFC.
Pulled out of D106280
2021-07-19 17:15:28 +01:00
Jeremy Morse
6df259fb43 [InstrRef][X86] Drop debug instruction numbers from x87 instructions
Avoid a crash when using instruction referencing if x87 floating point
instructions are used. These instructions are significantly mutated when
they're rewritten from referring to registers, to referring to
floating-point-stack positions. As a result, their operands are re-ordered,
and (InstrRef) LiveDebugValues asserts when it sees a DBG_INSTR_REF
referring to a non-reg non-def register operand.

To fix this, drop the instruction numbers, and thus variable locations.
This patch adds a helper utility do do that.

Dropping the variable locations is sub-optimal, but applying DBG_VALUEs to
the $fp0 and similar registers is dropped on emission too. It seems we've
never done well at describing variables that live in x87 registers, at all.

Differential Revision: https://reviews.llvm.org/D105657
2021-07-19 15:08:27 +01:00
Eli Friedman
5fd061997c [X86] Remove incorrect use of known bits in shuffle simplification.
This reverts commit 2a419a0b9957ebac9e11e4b43bc9fbe42a9207df.

The result of a shufflevector must not propagate poison from any element
other than the one noted in the shuffle mask.

The regressions outside of fptoui-may-overflow.ll can probably be
recovered some other way; for example, using isGuaranteedNotToBePoison.

See discussion on https://reviews.llvm.org/D106053 for more background.

Differential Revision: https://reviews.llvm.org/D106222
2021-07-18 18:13:11 -07:00
Simon Pilgrim
1d449cdfe0 [X86][SSE] matchShuffleWithPACK - avoid poison pollution from bitcasting multiple elements together.
D106053 exposed that we've not been taking into account that by bitcasting smaller elements together and then performing a ComputeKnownBits on the result we'd be allowing a poison element to influence other neighbouring elements being used in the pack. Instead we now peek through any existing bitcast to ensure that the source type already matches the width source of the pack node we're trying to match.

This has also been a chance to stop matchShuffleWithPACK creating unused nodes on the fly which could affect oneuse tests during shuffle lowering/combining.

The only regression we're seeing is due to being unable to peek through a bitcast as its on the other side of a extract_subvector - which should go away once we finally allow shuffle combining across different vector widths (by making matchShuffleWithPACK using const SelectionDAG& we've gotten closer to this - see PR45974).
2021-07-18 14:25:28 +01:00
Simon Pilgrim
6a5b1b579f [X86][SSE] combineX86ShufflesRecursively - bail if constant folding fails due to oneuse limits.
Fixes issue reported on D105827 where a single shuffle of a constant (with multiple uses) was caught in an infinite loop where one shuffle (UNPCKL) used an undef arg but then that got recombined to SHUFPS as the constant value had its own undef that confused matching.....
2021-07-16 19:21:46 +01:00
Guozhi Wei
7d6ba24baf [X86FixupLEAs] Try again to transform the sequence LEA/SUB to SUB/SUB
This patch transforms the sequence
    lea (reg1, reg2), reg3
    sub reg3, reg4
to two sub instructions
    sub reg1, reg4
    sub reg2, reg4

Similar optimization can also be applied to LEA/ADD sequence.

The modifications to TwoAddressInstructionPass is to ensure the operands of ADD
instruction has expected order (the dest register of LEA should be src register
of ADD).

Differential Revision: https://reviews.llvm.org/D104684
2021-07-16 10:16:03 -07:00
Harald van Dijk
f675df37ba [X86] Fix handling of maskmovdqu in X32
The maskmovdqu instruction is an odd one: it has a 32-bit and a 64-bit
variant, the former using EDI, the latter RDI, but the use of the
register is implicit. In 64-bit mode, a 0x67 prefix can be used to get
the version using EDI, but there is no way to express this in
assembly in a single instruction, the only way is with an explicit
addr32.

This change adds support for the instruction. When generating assembly
text, that explicit addr32 will be added. When not generating assembly
text, it will be kept as a single instruction and will be emitted with
that 0x67 prefix. When parsing assembly text, it will be re-parsed as
ADDR32 followed by MASKMOVDQU64, which still results in the correct
bytes when converted to machine code.

The same applies to vmaskmovdqu as well.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D103427
2021-07-15 22:56:08 +01:00
Simon Pilgrim
32c30ddee7 [X86] Implement smarter instruction lowering for FP_TO_UINT from f32/f64 to i32/i64 and vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction.
We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs (and for FP_TO_UINT, negative float values are undefined). We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic:

small := CVTTPS2SI(x);
fp_to_ui(x) := small | (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31))

Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied there too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering.

@TomHender checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded).

Original Patch by @TomHender (Tom Hender)

Differential Revision: https://reviews.llvm.org/D89697
2021-07-14 12:03:49 +01:00
Simon Pilgrim
920f3945dc [X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result (REAPPLIED)
Annoyingly, i686 cmpsd handling still fails to remove the unnecessary neg(and(x,1))

Reapplied rGe4aa6ad13216 with fix for intrinsic variants of the opcode which uses a vector return type
2021-07-13 12:31:09 +01:00
Vitaly Buka
790e45b85d Revert "[X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result"
Fails here https://lab.llvm.org/buildbot/#/builders/37/builds/5267

This reverts commit e4aa6ad132164839a4a97dff0d433ea4766f77f1.
2021-07-12 22:26:54 -07:00