1
0
mirror of https://github.com/RPCS3/llvm-mirror.git synced 2024-10-19 19:12:56 +02:00
Commit Graph

16440 Commits

Author SHA1 Message Date
Craig Topper
bbe3b80de6 [X86] Reverse the operand order of invlpga in at&t syntax to match gas.
llvm-svn: 325190
2018-02-14 23:53:21 +00:00
Craig Topper
1184db4a1c [X86] Don't swap argument on BOUND instruction in at&t syntax.
The bound instruction does not have reversed operands in gas.

Fixes PR27653.

Patch by Maya Madhavan.

Differential Revision: https://reviews.llvm.org/D43243

llvm-svn: 325178
2018-02-14 21:54:58 +00:00
Simon Pilgrim
0d2a33eaff [X86][SSE] truncateVectorWithPACK - Use src type instead of dst to select between PACK*SDW/PACK*SWB
Try to keep PACK*SDW/PACK*SWB as wide as possible, this helps ComputeNumSignBits as it can only peek through bitcasts to wider types, pre-AVX2 codegen was already doing this as it could peek through bitcasts/subvectors more easily than AVX2 could through shuffles.

This shouldn't affect existing results as calls to truncateVectorWithPACK ensure we have enough sign bits to pack to the same value, but it should make it possible to use truncateVectorWithPACK chains to perform saturation in combineTruncateWithSat with a future patch.

llvm-svn: 325149
2018-02-14 18:23:58 +00:00
Paul Robinson
222ff14ae7 [DWARF] Fix incorrect prologue end line record.
The prologue-end line record must be emitted after the last
instruction that is part of the function frame setup code and before
the instruction that marks the beginning of the function body.

Patch by Carlos Alberto Enciso!

Differential Revision: https://reviews.llvm.org/D41762

llvm-svn: 325143
2018-02-14 17:35:52 +00:00
Simon Pilgrim
0438b3f961 Fix GCC -Wlogical-op-parentheses warning. NFCI.
llvm-svn: 325129
2018-02-14 15:07:36 +00:00
Lama Saba
cc71753981 [X86] Reduce Store Forward Block issues in HW - Recommit after fixing Bug 36346
If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.

This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.

The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.

Change-Id: Ic41aa9ade6512e0478db66e07e2fde41b4fb35f9
llvm-svn: 325128
2018-02-14 14:58:53 +00:00
Simon Pilgrim
ba2f6e6405 [X86][SSE] Relax type legality for combineTruncateWithSat PACKSS/PACKUS truncation
While the AVX512 VTRUNCS/VTRUNCUS instructions require legal types, truncateVectorWithPACK handles cases with multiples of legal types through splitting/concatenation. So we just need to ensure that the src/dst scalar types are correct and leave truncateVectorWithPACK to handle the rest of it.

llvm-svn: 325127
2018-02-14 14:14:29 +00:00
Reid Kleckner
71205eed30 [X86] Remove dead code from retpoline thunk generation
Follow-up to r325049

llvm-svn: 325085
2018-02-14 00:24:29 +00:00
Reid Kleckner
f0fe7234f4 [X86] Use EDI for retpoline when no scratch regs are left
Summary:
Instead of solving the hard problem of how to pass the callee to the indirect
jump thunk without a register, just use a CSR. At a call boundary, there's
nothing stopping us from using a CSR to hold the callee as long as we save and
restore it in the prologue.

Also, add tests for this mregparm=3 case. I wrote execution tests for
__llvm_retpoline_push, but they never got committed as lit tests, either
because I never rewrote them or because they got lost in merge conflicts.

Reviewers: chandlerc, dwmw2

Subscribers: javed.absar, kristof.beyls, hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D43214

llvm-svn: 325049
2018-02-13 20:47:49 +00:00
Craig Topper
3990a6fdd4 [X86] Add combine to shrink 64-bit ands when one input is an any_extend and the other input guarantees upper 32 bits are 0.
Summary: This gets the shift case from PR35792.

Reviewers: spatel, RKSimon

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D43222

llvm-svn: 325018
2018-02-13 16:25:25 +00:00
Craig Topper
198308e988 [X86] Teach EVEX->VEX pass to turn VRNDSCALE into VROUND when bits 7:4 of the immediate are 0 and the regular EVEX->VEX checks pass.
Bits 7:4 control the scale part of the operation. If the scale is 0 the behavior is equivalent to VROUND.

Fixes PR36246

llvm-svn: 324985
2018-02-13 04:19:26 +00:00
Craig Topper
08dead3212 [X86] Use getTypeAction in most places that were checking ExperimentalVectorWideningLegalization.
This will allow more flexibility in what types we legalize via widening or not. This should help with a couple lines in D41062.

llvm-svn: 324980
2018-02-13 01:49:58 +00:00
Craig Topper
58364b39ac [X86] Simplify X86DAGToDAGISel::matchBEXTRFromAnd by creating an X86ISD::BEXTR node and calling Select. Add isel patterns to recognize this node.
This removes a bunch of special case code for selecting the immediate and folding loads.

llvm-svn: 324939
2018-02-12 21:18:11 +00:00
Craig Topper
012709d17d [X86] Remove unused multiclass argument. NFC
llvm-svn: 324938
2018-02-12 21:18:09 +00:00
Simon Pilgrim
c7c90124a2 [X86] Add missing scheduling class tag for i64 absolute address moves
Expand existing SchedRW to encompass these like it did for the other memory offset movs - added comments to closing braces to keep track of def scopes.

We only tagged it with the itinerary class, so completeness checks were erroneously passed (PR35639).

llvm-svn: 324910
2018-02-12 17:21:28 +00:00
Simon Pilgrim
00452705e4 [X86][AVX512] Add missing scheduling class tag for KMOVB/KMOVW/KMOVD/KMOVQ moves/loads/stores.
We only tagged it with the itinerary class, so completeness checks were erroneously passed (PR35639).

llvm-svn: 324905
2018-02-12 16:59:04 +00:00
Simon Pilgrim
2c978151de [X86][AVX512] Add missing scheduling class tag for VMOVQ/VMOVHLPS/VMOVLHPS/VMOVHPD/VMOVHPS/VMOVLPD/VMOVLPS
Tag AVX512 variants to match SSE/AVX originals.

We only tagged it with the itinerary class, so completeness checks were erroneously passed (PR35639).

llvm-svn: 324901
2018-02-12 16:18:36 +00:00
Simon Pilgrim
09469e72e3 [X86] Tag CET-IBT instruction scheduler classes
llvm-svn: 324898
2018-02-12 15:57:00 +00:00
Simon Pilgrim
c527dc7f40 [X86][MMX] Add missing scheduling class tag for EMMS/FEMMS
We only tagged it with the itinerary class, so completeness checks were erroneously passed (PR35639).

AMD targets can perform these a lot quicker than WriteMicrocoded so will need an override in the models.

llvm-svn: 324897
2018-02-12 15:52:59 +00:00
Hans Wennborg
0264d64e8e Revert r324835 "[X86] Reduce Store Forward Block issues in HW"
It asserts building Chromium; see PR36346.

(This also reverts the follow-up r324836.)

> If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
> A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
> The estimated penalty for a store forward block is ~13 cycles.
>
> This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
> of a load and a store.
>
> The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
> breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.

llvm-svn: 324887
2018-02-12 12:43:39 +00:00
Craig Topper
8259f5e976 [X86] Don't look for TEST instruction shrinking opportunities when the root node is a X86ISD::SUB.
I don't believe we ever create an X86ISD::SUB with a 0 constant which is what the TEST handling needs. The ternary operator at the end of this code shows up as only going one way in the llvm-cov report from the bots.

llvm-svn: 324865
2018-02-12 03:02:02 +00:00
Craig Topper
544fea01ab [X86] Remove check for X86ISD::AND with no flag users from the TEST instruction immediate shrinking code.
We turn X86ISD::AND with no flag users back to ISD::AND in PreprocessISelDAG.

llvm-svn: 324864
2018-02-12 03:02:01 +00:00
Craig Topper
08505b384a [X86] Change some compare patterns to use loadi8/loadi16/loadi32/loadi64 helper fragments.
This enables CMP8mi to fold zextloadi8i1 which in all tests allows us to avoid creating a TEST8rr that peephole can't fold.

llvm-svn: 324863
2018-02-12 02:48:42 +00:00
Craig Topper
e53e2913c0 [X86] Add KADD X86ISD opcode instead of reusing ISD::ADD.
ISD::ADD implies individual vector element addition with no carries between elements. But for a vXi1 type that would be the same as XOR. And we already turn ISD::ADD into ISD::XOR for all vXi1 types during lowering. So the ISD::ADD pattern would never be able to match anyway.

KADD is different, it adds the elements but also propagates a carry between them. This just a way of doing an add in k-register without bitcasting to the scalar domain. There's still no way to match the pattern, but at least its not obviously wrong.

llvm-svn: 324861
2018-02-12 01:33:38 +00:00
Craig Topper
f31386f462 [X86] Allow zextload/extload i1->i8 to be folded into instructions during isel
Previously we just emitted this as a MOV8rm which would likely get folded during the peephole pass anyway. This just makes it explicit earlier.

The gpr-to-mask.ll test changed because the kaddb instruction has no memory form.

llvm-svn: 324860
2018-02-12 01:33:36 +00:00
Craig Topper
bf1d141721 [X86] Remove MASK_BINOP intrinsic type. NFC
llvm-svn: 324858
2018-02-11 22:32:30 +00:00
Craig Topper
00e4bc826d [X86] Remove dead code from getMaskNode that looked for a i64 mask with a maskVT that wasn't v64i1. NFC
llvm-svn: 324857
2018-02-11 22:32:29 +00:00
Craig Topper
da1bc8da2f [X86] Remove LowerBoolVSETCC_AVX512, we get this with a target independent DAG combine now. NFC
llvm-svn: 324856
2018-02-11 22:32:27 +00:00
Simon Pilgrim
b203875de8 [X86][SSE] Use SplitBinaryOpsAndApply to recognise PSUBUS patterns before they're split on AVX1
This needs to be generalised further to support AVX512BW cases but I want to add non-uniform constants first.

llvm-svn: 324844
2018-02-11 17:29:42 +00:00
Craig Topper
598a86fcc7 [X86] Use min/max for vector ult/ugt compares if avoids a sign flip.
Summary:
Currently we only use min/max to help with ule/uge compares because it removes an invert of the result that would otherwise be needed. But we can also use it for ult/ugt compares if it will prevent the need for a sign bit flip needed to use pcmpgt at the cost of requiring an invert after the compare.

I also refactored the code so that the max/min code is self contained and does its own return instead of setting up a flag to manipulate the rest of the function's behavior.

Most of the test cases look ok with this. I did notice that we added instructions when one of the operands being sign flipped is a constant vector that we were able to constant fold the flip into.

I also noticed that sometimes the SSE min/max clobbers a register that is needed after the compare. This resulted in an extra move being inserted before the min/max to preserve the register. We could try to detect this and switch from min to max and change the compare operands to use the operand that gets reused in the compare.

Reviewers: spatel, RKSimon

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42935

llvm-svn: 324842
2018-02-11 17:11:40 +00:00
Simon Pilgrim
4f6d54b20e [X86][SSE] Moved SplitBinaryOpsAndApply earlier so more methods can use it. NFCI.
llvm-svn: 324841
2018-02-11 17:01:43 +00:00
Simon Pilgrim
779929ce3a [X86][SSE] Enable SMIN/SMAX/UMIN/UMAX custom lowering for all legal types
This allows us to recognise more saturation patterns and also simplify some MINMAX codegen that was failing to combine CMPGE comparisons to a legal CMPGT.

Differential Revision: https://reviews.llvm.org/D43014

llvm-svn: 324837
2018-02-11 10:52:37 +00:00
Lama Saba
06e4700602 [X86] Reduce Store Forward Block issues in HW
If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.

This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.

The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.

Change-Id: I620b6dc91583ad9a1444591e3ddc00dd25d81748
llvm-svn: 324835
2018-02-11 09:34:12 +00:00
Craig Topper
8a41bbcdfd [X86] Don't make 512-bit vectors legal when preferred vector width is 256 bits and 512 bits aren't required
This patch adds a new function attribute "required-vector-width" that can be set by the frontend to indicate the maximum vector width present in the original source code. The idea is that this would be set based on ABI requirements, intrinsics or explicit vector types being used, maybe simd pragmas, etc. The backend will then use this information to determine if its save to make 512-bit vectors illegal when the preference is for 256-bit vectors.

For code that has no vectors in it originally and only get vectors through the loop and slp vectorizers this allows us to generate code largely similar to our AVX2 only output while still enabling AVX512 features like mask registers and gather/scatter. The loop vectorizer doesn't always obey TTI and will create oversized vectors with the expectation the backend will legalize it. In order to avoid changing the vectorizer and potentially harm our AVX2 codegen this patch tries to make the legalizer behavior similar.

This is restricted to CPUs that support AVX512F and AVX512VL so that we have good fallback options to use 128 and 256-bit vectors and still get masking.

I've qualified every place I could find in X86ISelLowering.cpp and added tests cases for many of them with 2 different values for the attribute to see the codegen differences.

We still need to do frontend work for the attribute and teach the inliner how to merge it, etc. But this gets the codegen layer ready for it.

Differential Revision: https://reviews.llvm.org/D42724

llvm-svn: 324834
2018-02-11 08:06:27 +00:00
Craig Topper
ea72946e88 [X86] Remove setOperationAction lines for promoting vXi1 SINT_TO_FP/UINT_TO_FP.
We promote these via a DAG combine now before lowering gets the chance.

Also remove the v2i1 custom handling since it will no longer be triggered.

llvm-svn: 324833
2018-02-11 07:44:33 +00:00
Craig Topper
cb3958d56e [X86] Remove some redundant qualifications from the setOperationAction blocks. NFC
These were added as part of the refactoring for prefer vector width. At the time I thought the hasAVX512 here would be replaced with "allow 512 bit vectors" so that it would read "allow 512 bit vectors OR VLX". But now the plan is to only give the option of disabling 512 bit vectors when VLX is enabled. So we don't need this qualification at all

llvm-svn: 324831
2018-02-11 03:07:19 +00:00
Craig Topper
362df7b20a [X86] Change signatures of avx512 packed fp compare intrinsics to return a vXi1 mask type to be closer to an fcmp.
Summary:
This patch changes the signature of the avx512 packed fp compare intrinsics to return a vXi1 vector and no longer take a mask as input. The casts to scalar type will now need to be explicit in the IR. The masking node will now be an explicit and in the IR.

This makes the intrinsic look much more similar to an fcmp instruction that we wish we could use for these but can't. We already use icmp instructions for integer compares.

Previously the lowering step of isel would turn the intrinsic into an X86 specific ISD node and a emit the masking nodes as well as some bitcasts. This means DAG combines can't see the vXi1 type until somewhat late, making it more difficult to combine out gpr<->mask transition sequences. By exposing the vXi1 type explicitly in the IR and initial SelectionDAG we give earlier DAG combines and even InstCombine the chance to see it and optimize it.

This should make any issues with gpr<->mask sequences the same between integer and fp. Meaning we only have to fix them once.

Reviewers: spatel, delena, RKSimon, zvi

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D43137

llvm-svn: 324827
2018-02-10 23:33:55 +00:00
Simon Pilgrim
7401969779 [X86][SSE] Increase PMULLD costs to better match hardware
Until Skylake, most hardware could only issue a PMULLD op every other cycle

llvm-svn: 324823
2018-02-10 19:27:10 +00:00
Craig Topper
918de73309 [X86] Custom legalize (v2i32 (setcc (v2f32))) so that we don't end up with a (v4i1 (setcc (v4f32)))
Undef VLX, getSetCCResultType returns v2i1/v4i1 for v2f32/v4f32 so default type legalization will end up changing the setcc result type back to vXi1 if it had been extended. The resulting extend gets messed up further by type legalization and is difficult to recombine back to (v4i32 (setcc (v4f32))) after legalization.

I went ahead and enabled this for SSE2 and later since its always the result we want and this helps type legalization get there in less steps.

llvm-svn: 324822
2018-02-10 19:12:58 +00:00
Craig Topper
45fbe5faa5 [X86] Extend inputs with elements smaller than i32 to sint_to_fp/uint_to_fp before type legalization.
This prevents extends of masks being introduced during lowering where it become difficult to combine them out.

There are a few oddities in here.

We sometimes concatenate two k-registers produced by two compares, sign_extend the combined pair, then extract two halves. This worked better previously because the sign_extend wasn't created until after the fp_to_sint was split which led to a split sign_extend being created.

We probably also need to custom type legalize (v2i32 (sext v2i1)) via widening.

llvm-svn: 324820
2018-02-10 17:58:58 +00:00
Craig Topper
6403919bfe [X86] Custom legalize (v2i1 (fp_to_uint/fp_to_sint v2f64)) without AVX512VL.
Strangely the code was already present, just the setOperationAction wasn't being called without VLX.

llvm-svn: 324806
2018-02-10 08:39:31 +00:00
Craig Topper
4735eaf00e [X86] Legalize zero extends from vXi1 to vXi16/vXi32/vXi64 using a sign extend and a shift.
This avoids a constant pool load to create 1.

The int->float are showing converts to mask and back. We probably need to widen inputs to sint_to_fp/uint_to_fp before type legalization.

llvm-svn: 324805
2018-02-10 08:06:52 +00:00
Craig Topper
849912b820 [X86] Teach combineExtSetcc to handle ZERO_EXTEND by widening the setcc and then masking. A later DAG combine will convert to a shift.
This helps to avoid a constant pool load needed to zero extend from the mask.

llvm-svn: 324804
2018-02-10 08:06:49 +00:00
Nirav Dave
8ce2bc977c [DAG] Make early exit hasPredecessorHelper return true. NFCI.
All uses conservatively assume in early exit case that it will be a
predecessor. Changing default removes checking code in all uses.

llvm-svn: 324797
2018-02-10 02:41:22 +00:00
Craig Topper
e10a87bee0 [X86] Teach combineInsertSubvector how to combine some k-register insert_subvectors and extract_subvector sequences to remove extra zeroing.wq
llvm-svn: 324791
2018-02-10 01:00:41 +00:00
Craig Topper
13cd406305 [X86] Teach lower1BitVectorShuffle to recognize shuffles that are just filling upper elements with zero. Replace with insert_subvector.
There's still some extra kshifts in one of the modified test cases here, but hopefully that's only a DAG combine away.

llvm-svn: 324782
2018-02-09 23:32:27 +00:00
Francis Visoiu Mistrih
a26c70efb8 [X86][MC] Fix assembling rip-relative addressing + immediate displacements
In the rare case where the input contains rip-relative addressing with
immediate displacements, *and* the instruction ends with an immediate,
we encode the instruction in the wrong way:

movl $12345678, 0x400(%rdi) // all good, no rip-relative addr
movl %eax, 0x400(%rip) // all good, no immediate at the end of the instruction
movl $12345678, 0x400(%rip) // fails, encodes address as 0x3fc(%rip)

Offset is a label:

movl $12345678, foo(%rip)

we want to account for the size of the immediate (in this case,
$12345678, 4 bytes).

Offset is an immediate:

movl $12345678, 0x400(%rip)

we should not account for the size of the immediate, assuming the
immediate offset is what the user wanted.

Differential Revision: https://reviews.llvm.org/D43050

llvm-svn: 324772
2018-02-09 21:47:07 +00:00
Craig Topper
f73d5c8428 [X86] Simplify some code in lowerV4X128VectorShuffle and lowerV2X128VectorShuffle
Previously we extracted two subvectors and concatenate. But the concatenate will be lowered to two insert subvectors. Then DAG combine will merge once of the inserts and one of the extracts back into the original vector. We might as well just directly use one extract and one insert.

llvm-svn: 324710
2018-02-09 05:54:36 +00:00
Craig Topper
1b91f8cb57 [X86] Teach shuffle lowering to recognize 128/256 bit insertions into a zero vector.
This regresses a couple cases in the shuffle combining test. But those cases use intrinsics that InstCombine knows how to turn into a generic shuffle earlier. This should give opportunities to fold this earlier in InstCombine or DAG combine.

llvm-svn: 324709
2018-02-09 05:54:34 +00:00
Alexander Ivchenko
d75045427e [GlobalISel][X86] Fixing failures after https://reviews.llvm.org/D37775
The patch essentially makes sure that X86CallLowering adds proper
G_COPY/G_TRUNC and G_ANYEXT/G_COPY when we are doing lowering of
arguments/returns for floating point values passed on registers.

Tests are updated accordingly

Reviewed By: qcolombet

Differential Revision: https://reviews.llvm.org/D42287

llvm-svn: 324665
2018-02-08 22:41:47 +00:00