1
0
mirror of https://github.com/RPCS3/llvm-mirror.git synced 2024-10-19 19:12:56 +02:00
Commit Graph

16469 Commits

Author SHA1 Message Date
Simon Pilgrim
1d50b0c236 [X86][SSE] Don't colaesce v4i32 extracts
We currently coalesce v4i32 extracts from all 4 elements to 2 v2i64 extracts + shifts/sign-extends.

This seems to have been added back in the days when we tended to spill vectors and reload scalars, or ended up with repeated shuffles moving everything down to 0'th index. I don't think either of these are likely these days as we have better EXTRACT_VECTOR_ELT and VECTOR_SHUFFLE handling, and the existing code tends to make it very difficult for various vector and load combines.

Differential Revision: https://reviews.llvm.org/D42308

llvm-svn: 323541
2018-01-26 17:11:34 +00:00
Simon Pilgrim
a0702cbb69 [X86][SSE] Drop PMADDWD in lowerMul
As mentioned in D42258, we don't need this any more

llvm-svn: 323540
2018-01-26 16:57:36 +00:00
Simon Pilgrim
6d1803055d [X86] Cleanup SDLoc arguments as mentioned on D42544
llvm-svn: 323526
2018-01-26 14:00:01 +00:00
Hiroshi Inoue
7f54536b89 [NFC] fix trivial typos in comments and documents
"in in" -> "in", "on on" -> "on" etc.

llvm-svn: 323508
2018-01-26 08:15:29 +00:00
Craig Topper
e73fcd8dbc [X86] Remove dead code from LowerBUILD_VECTOR that tried to handle i64 element type in 32-bit mode.
Type legalization would prevent any i64 operands to the build_vector from existing before we get here. The coverage bots show this code as uncovered.

llvm-svn: 323506
2018-01-26 07:30:44 +00:00
Craig Topper
8f6dce08e9 [X86] Remove code from combineBitcastvxi1 that was needed to support the previous native IR for kunpck intrinsics.
The original autoupgrade for kunpck intrinsics used a bitcasted scalar shift, or, and. This combine would turn this into a concat_vectors. Now the kunpck intrinsics are autoupgraded to a vector shuffle that will become a concat_vectors.

llvm-svn: 323504
2018-01-26 07:15:21 +00:00
Craig Topper
b59c317ba9 [X86] Remove unused intrinsic type handling. NFC
llvm-svn: 323503
2018-01-26 07:15:20 +00:00
Craig Topper
6e530b76fa [X86] Simplify condition in VSETCC. NFC
This listed all legal 128-bit integer types individually, but since we already know we have a legal type and its integer, we can just check is128BitVector.

llvm-svn: 323502
2018-01-26 07:15:18 +00:00
Craig Topper
55f9d42328 [X86] Remove LowerVSETCC code for handling vXi1 setcc with vXi8/vXi16 input type. NFC
These kinds of setccs are promoted by a DAG combine before they ever get to legalization.

llvm-svn: 323501
2018-01-26 07:15:17 +00:00
Craig Topper
f85eadaa6c [X86] Remove some dead code from LowerVSETCC. NFC
This code was added in r321967, but ultimately I fixed the issue in the legalizer and this code was no longer required.

llvm-svn: 323500
2018-01-26 07:15:16 +00:00
Serguei Katkov
7d96f22167 [X86] Fix killed flag handling in X86FixupLea pass
When pass creates a MOV instruction for 
lea (%base,%index,1), %dst => mov %base,%dst; add %index,%dst
modification it should clean the killed flag for base
if base is equal to index.

Otherwise verifier complains about usage of killed register in add instruction.

Reviewers: lsaba, zvi, zansari, aaboud
Reviewed By: lsaba
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D42522

llvm-svn: 323497
2018-01-26 04:49:26 +00:00
Craig Topper
4a4cc9aaaf [X86] Teach Intel syntax InstPrinter to print lock prefixes that have been parsed from the asm parser.
The asm parser puts the lock prefix in the MCInst flags so we need to check that in addition to TSFlags. This matches what the ATT printer does.

llvm-svn: 323469
2018-01-25 21:23:57 +00:00
Craig Topper
59b93a878f [X86] Combine two unnecessarily complicated ifs that had the same body. NFC
llvm-svn: 323468
2018-01-25 21:23:51 +00:00
Simon Pilgrim
d65b7cc7d9 [X86] Apply clang-format to detectUSatPattern. NFCI.
Cleanup from D42544

llvm-svn: 323439
2018-01-25 16:38:56 +00:00
Craig Topper
ce1999a1e1 [X86] Expand IMUL/MUL instregexs in Intel scheduler models. Add load latency to some of them in SkylakeClient model.
The regular expressions and the imul names caused some instructions to be matched by multiple regexs creating unpredictable results.

This changes them all to use explicit instrs instead.

While doing this I also found that some instructions in Skylake were missing load latency so I fixed that too.

llvm-svn: 323406
2018-01-25 06:57:42 +00:00
Craig Topper
3a3e63c149 [X86] Expand IMUL/MUL instregexs in Znver1 scheduler to show what's actually implemented.
The IMUL instruction names mixed with the prefix matching of the instregex lead to some strange matches. The worst being that several memory instructions are using the register form latency.

I don't know what the right answer is, so I've left TODOs and will try to work with the AMD folks to get this cleaned up.

llvm-svn: 323405
2018-01-25 06:57:39 +00:00
Craig Topper
fae7861fa0 [X86] Name the MMX phaddd instruction with 3 Ds instead of just 2. NFC
llvm-svn: 323403
2018-01-25 04:45:32 +00:00
Craig Topper
cc7c2fbcc1 [X86] Remove 64/128/256 from MMX/SSE/AVX instruction names for overall consistency. NFC
MMX instrutions all start with MMX_ so the 64 isn't needed for disambigutation.
SSE/AVX1 instructions are assumed 128-bit so we don't need to say 128.
AVX2 instructions should use a Y to indicate 256-bits.

llvm-svn: 323402
2018-01-25 04:45:30 +00:00
Craig Topper
0a6f713634 [X86] Remove unnecessary '_alt' and '_Int' from scheduler model regular expressions.
These were treated as optional suffixes, but the regular expressions are already prefix matches so this is unnecessary. It breaks the binary search optimization in tablegen due to the top level question mark.

llvm-svn: 323401
2018-01-25 04:45:28 +00:00
Simon Pilgrim
966ca434a4 [X86][SSE] Aggressively use PMADDWD for v4i32 multiplies with 17 or more leading zeros
As discussed in D41484, PMADDWD for 'zero extended' vXi32 is nearly always a better option than PMULLD:
On SNB it will result in code that isn't any faster, but not any slower so we may as well keep it.
On KNL it only has half the throughput, so I've disabled it on there - ideally there'd be a better way than this.

Differential Revision: https://reviews.llvm.org/D42258

llvm-svn: 323367
2018-01-24 19:20:02 +00:00
Craig Topper
838dbb5ab9 [X86] Fix some inconsistencies in the itineraries and Sched for (V)PEXTRW/(V)PINSRW
The weirdest being that PEXTRWrr was tagged as a memory operation.

llvm-svn: 323353
2018-01-24 17:58:57 +00:00
Craig Topper
34467739ab [X86] Adjust names of PINSRW/PEXTRW intructions between MMX/SSE/AVX/AVX512 for consistency and to maybe enable more regular expression compaction in the scheduler models. NFCI
llvm-svn: 323352
2018-01-24 17:58:51 +00:00
Craig Topper
650a89f354 [X86] Remove '(_REV)?' from a bunch of scheduler regular expressions. NFC
The regexs are treated as a prefix match already so the checking for optional text at the end provides no value. Instead it prevents the binary search optimization in tablegen from kicking in due to the top level question mark.

llvm-svn: 323351
2018-01-24 17:58:42 +00:00
Simon Pilgrim
300651f958 [X86][SSE] Avoid calls to combineX86ShufflesRecursively that can't combine to target shuffles (PR32037)
Don't bother making recursive calls to combineX86ShufflesRecursively if we have more shuffle source operands than will be combined together with the remaining recursive depth.

See https://bugs.llvm.org/show_bug.cgi?id=32037#c26 and https://bugs.llvm.org/show_bug.cgi?id=32037#c27 for the reduction in compile times from this patch.

Differential Revision: https://reviews.llvm.org/D42378

llvm-svn: 323320
2018-01-24 11:41:09 +00:00
Craig Topper
6fe5253cf8 [X86] Move 'Y' to correct place in FMA4 regular expression in Znver1 scheduler model.
I think these instructions used to be named differently and the regular expression reflected that. I guess we must have correct itinerary information that made this not matter for the scheduler test?

llvm-svn: 323305
2018-01-24 05:32:51 +00:00
Craig Topper
dba9cc357a [X86] Rename 256-bit VFRCZ instructions to have the Y before the rr/rm to match other instructions. NFC
llvm-svn: 323304
2018-01-24 05:14:39 +00:00
Craig Topper
3e881e0a09 [X86] Remove redundant regular expression from the Znver1 scheduler model. NFC
llvm-svn: 323303
2018-01-24 05:14:33 +00:00
Craig Topper
06e7908feb [X86] Use ISD::SIGN_EXTEND instead of X86ISD::VSEXT for mask to xmm/ymm/zmm conversion
There are a couple tricky things with this patch.

I had to add an override of isVectorLoadExtDesirable to stop DAG combine from combining sign_extend with loads after legalization since we legalize sextload using a load+sign_extend. Overriding this hook actually prevents a lot sextloads from being created in the first place.

I also had to add isel patterns because DAG combine blindly combines sign_extend+truncate to a smaller sign_extend which defeats what legalization was trying to do.

Differential Revision: https://reviews.llvm.org/D42407

llvm-svn: 323301
2018-01-24 04:51:17 +00:00
Zvi Rackover
dd726895b3 X86: Update isVectorShiftByScalarCheap with cases covered by AVX512BW
Summary:
AVX512BW adds support for variable shift amount for 16-bit element
vectors.

Reviewers: craig.topper, RKSimon, spatel

Reviewed By: RKSimon

Subscribers: rengolin, tschuett, llvm-commits

Differential Revision: https://reviews.llvm.org/D42437

llvm-svn: 323292
2018-01-24 01:36:40 +00:00
Craig Topper
a0af84799e [X86] Merge some regular expressions in Zen scheduler model and remove 2 unused classes.
I don't know if the unused classes were intended to be used and that the VEX version is really different than the legacy SSE version. Agner's tables don't show any differences. I'm just cleaning up assuming the current behavior is correct.

llvm-svn: 323263
2018-01-23 21:37:56 +00:00
Craig Topper
19ec2f868b [X86] Remove 'Int_' from instregexs in Zen scheduler model.
No instructions have Int_ at the beginning. It's always at the end now. So it should be picked up as a prefix match

llvm-svn: 323262
2018-01-23 21:37:54 +00:00
Craig Topper
9c69554132 [X86] Move 'Int_' to the end of the name of the VCOMISS/VUCOMISS and instructions to get them picked up by the scheduler model regexs.
All other intrinsic instructions put the _Int on the end. This make these instructions consistent and gets the prefix instregexs in the scheduler models to pick them up.

llvm-svn: 323261
2018-01-23 21:37:51 +00:00
Simon Pilgrim
a95413137c [X86][AVX] LowerBUILD_VECTORAsVariablePermute - add support for VPERMILPV to v2i64/v2f64
Minor refactor to make it possible for LowerBUILD_VECTORAsVariablePermute to be used with a wider variety of shuffles op and types.

I'd have liked to add v4i32/v4f32 support as well but we don't see v4i32 index extractions at the moment (which is why I created D42308)

After this I intend to begin adding scaling support for PSHUFB (v8i16, v4i32, v2i64)) and VPERMPS (v4f64, v4i64).

Differential Revision: https://reviews.llvm.org/D42431

llvm-svn: 323260
2018-01-23 21:33:24 +00:00
Simon Pilgrim
f32ef5b2b9 [X86][SSE] LowerBUILD_VECTORAsVariablePermute - extract subvector from oversized index vectors
llvm-svn: 323223
2018-01-23 17:02:15 +00:00
Craig Topper
367173313a [X86] Rewrite vXi1 element insertion by using a vXi1 scalar_to_vector and inserting into a vXi1 vector.
The existing code was already doing something very similar to subvector insertion so this allows us to remove the nearly duplicate code.

This patch is a little larger than it should be due to differences between the DQI handling between the two today.

llvm-svn: 323212
2018-01-23 15:56:36 +00:00
Simon Pilgrim
7c297560ac [X86][SSE] LowerBUILD_VECTORAsVariablePermute - ensure that the source vector is not larger than the destination
We might be able to support this in the future with VPERMV3, OR(PSHUFB, PSHUFB) etc.

llvm-svn: 323210
2018-01-23 15:51:03 +00:00
Simon Pilgrim
1c3fb2e215 Use EVT::changeVectorElementTypeToInteger() to convert index type to integer
llvm-svn: 323207
2018-01-23 15:30:07 +00:00
Simon Pilgrim
38cee7f608 [X86][SSE] LowerBUILD_VECTORAsVariablePermute - ensure that the index vector has the correct number of elements
llvm-svn: 323206
2018-01-23 15:13:37 +00:00
Craig Topper
759b479987 [X86] Legalize v32i1 without BWI via splitting to v16i1 rather than the default of promoting to v32i8.
Summary:
For the most part its better to keep v32i1 as a mask type of a narrower width than trying to promote it to a ymm register.

I had to add some overrides to the methods that get the types for the calling convention so that we still use v32i8 for argument/return purposes.

There are still some regressions in here. I definitely saw some around shuffles. I think we probably should move vXi1 shuffle from lowering to a DAG combine where I think the extend and truncate we have to emit would be better combined.

I think we also need a DAG combine to remove trunc from (extract_vector_elt (trunc))

Overall this removes something like 13000 CHECK lines from lit tests.

Reviewers: zvi, RKSimon, delena, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42031

llvm-svn: 323201
2018-01-23 14:25:39 +00:00
Craig Topper
407f25d133 [X86] Add missing MOVSX/MOVZX instructions to load folding tables.
I'm not sure there's any way to generate these folding cases especially the movzx ones since even the register form is never emitted by codegen.

I'm just adding them to remove the difference with the autogenerated version of the folding table.

llvm-svn: 323200
2018-01-23 14:09:22 +00:00
Simon Pilgrim
f78c62fe01 [X86][SSE] LowerBUILD_VECTORAsVariablePermute - fix PSHUFB source/index operand ordering
As detailed in rL317463, PSHUFB (like most variable shuffle instructions) uses Op[0] for the source vector and Op[1] for the shuffle index vector, VPERMV works in reverse which is probably where the confusion comes from.

Differential Revision: https://reviews.llvm.org/D42380

llvm-svn: 323190
2018-01-23 11:39:06 +00:00
Craig Topper
4281c34f62 [X86] Don't reorder (srl (and X, C1), C2) if (and X, C1) can be matched as a movzx
Summary:
If we can match as a zero extend there's no need to flip the order to get an encoding benefit. As movzx is 3 bytes with independent source/dest registers. The shortest 'and' we could make is also 3 bytes unless we get lucky in the register allocator and its on AL/AX/EAX which have a 2 byte encoding.

This patch was more impressive before r322957 went in. It removed some of the same Ands that got deleted by that patch.

Reviewers: spatel, RKSimon

Reviewed By: spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42313

llvm-svn: 323175
2018-01-23 05:45:52 +00:00
Craig Topper
ce008671ab [X86] Remove 'NOREX' comment from the printing of _NOREX instructions.
Some of the NOREX instructions are used in 32-bit mode making this printing confusing. It also doesn't provide a lot of value since you can see the h-register being used by the instruction.

llvm-svn: 323174
2018-01-23 05:37:00 +00:00
Craig Topper
daf5eacf49 [X86] Various vXi1 insertion improvements.
Add missing patterns for inserting v1i1 into a zero vector. Use insert_subvector to zero upper bits before inserting an element into a vXi1 vector. Replace kshift based isel pattern with insert_subvector based pattern now that code that caused the pattern has been fixed to emit insert_subvector.

llvm-svn: 323173
2018-01-23 05:36:53 +00:00
Chandler Carruth
5c3f34f10b Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..
Summary:
First, we need to explain the core of the vulnerability. Note that this
is a very incomplete description, please see the Project Zero blog post
for details:
https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

The basis for branch target injection is to direct speculative execution
of the processor to some "gadget" of executable code by poisoning the
prediction of indirect branches with the address of that gadget. The
gadget in turn contains an operation that provides a side channel for
reading data. Most commonly, this will look like a load of secret data
followed by a branch on the loaded value and then a load of some
predictable cache line. The attacker then uses timing of the processors
cache to determine which direction the branch took *in the speculative
execution*, and in turn what one bit of the loaded value was. Due to the
nature of these timing side channels and the branch predictor on Intel
processors, this allows an attacker to leak data only accessible to
a privileged domain (like the kernel) back into an unprivileged domain.

The goal is simple: avoid generating code which contains an indirect
branch that could have its prediction poisoned by an attacker. In many
cases, the compiler can simply use directed conditional branches and
a small search tree. LLVM already has support for lowering switches in
this way and the first step of this patch is to disable jump-table
lowering of switches and introduce a pass to rewrite explicit indirectbr
sequences into a switch over integers.

However, there is no fully general alternative to indirect calls. We
introduce a new construct we call a "retpoline" to implement indirect
calls in a non-speculatable way. It can be thought of loosely as
a trampoline for indirect calls which uses the RET instruction on x86.
Further, we arrange for a specific call->ret sequence which ensures the
processor predicts the return to go to a controlled, known location. The
retpoline then "smashes" the return address pushed onto the stack by the
call with the desired target of the original indirect call. The result
is a predicted return to the next instruction after a call (which can be
used to trap speculative execution within an infinite loop) and an
actual indirect branch to an arbitrary address.

On 64-bit x86 ABIs, this is especially easily done in the compiler by
using a guaranteed scratch register to pass the target into this device.
For 32-bit ABIs there isn't a guaranteed scratch register and so several
different retpoline variants are introduced to use a scratch register if
one is available in the calling convention and to otherwise use direct
stack push/pop sequences to pass the target address.

This "retpoline" mitigation is fully described in the following blog
post: https://support.google.com/faqs/answer/7625886

We also support a target feature that disables emission of the retpoline
thunk by the compiler to allow for custom thunks if users want them.
These are particularly useful in environments like kernels that
routinely do hot-patching on boot and want to hot-patch their thunk to
different code sequences. They can write this custom thunk and use
`-mretpoline-external-thunk` *in addition* to `-mretpoline`. In this
case, on x86-64 thu thunk names must be:
```
  __llvm_external_retpoline_r11
```
or on 32-bit:
```
  __llvm_external_retpoline_eax
  __llvm_external_retpoline_ecx
  __llvm_external_retpoline_edx
  __llvm_external_retpoline_push
```
And the target of the retpoline is passed in the named register, or in
the case of the `push` suffix on the top of the stack via a `pushl`
instruction.

There is one other important source of indirect branches in x86 ELF
binaries: the PLT. These patches also include support for LLD to
generate PLT entries that perform a retpoline-style indirection.

The only other indirect branches remaining that we are aware of are from
precompiled runtimes (such as crt0.o and similar). The ones we have
found are not really attackable, and so we have not focused on them
here, but eventually these runtimes should also be replicated for
retpoline-ed configurations for completeness.

For kernels or other freestanding or fully static executables, the
compiler switch `-mretpoline` is sufficient to fully mitigate this
particular attack. For dynamic executables, you must compile *all*
libraries with `-mretpoline` and additionally link the dynamic
executable and all shared libraries with LLD and pass `-z retpolineplt`
(or use similar functionality from some other linker). We strongly
recommend also using `-z now` as non-lazy binding allows the
retpoline-mitigated PLT to be substantially smaller.

When manually apply similar transformations to `-mretpoline` to the
Linux kernel we observed very small performance hits to applications
running typical workloads, and relatively minor hits (approximately 2%)
even for extremely syscall-heavy applications. This is largely due to
the small number of indirect branches that occur in performance
sensitive paths of the kernel.

When using these patches on statically linked applications, especially
C++ applications, you should expect to see a much more dramatic
performance hit. For microbenchmarks that are switch, indirect-, or
virtual-call heavy we have seen overheads ranging from 10% to 50%.

However, real-world workloads exhibit substantially lower performance
impact. Notably, techniques such as PGO and ThinLTO dramatically reduce
the impact of hot indirect calls (by speculatively promoting them to
direct calls) and allow optimized search trees to be used to lower
switches. If you need to deploy these techniques in C++ applications, we
*strongly* recommend that you ensure all hot call targets are statically
linked (avoiding PLT indirection) and use both PGO and ThinLTO. Well
tuned servers using all of these techniques saw 5% - 10% overhead from
the use of retpoline.

We will add detailed documentation covering these components in
subsequent patches, but wanted to make the core functionality available
as soon as possible. Happy for more code review, but we'd really like to
get these patches landed and backported ASAP for obvious reasons. We're
planning to backport this to both 6.0 and 5.0 release streams and get
a 5.0 release with just this cherry picked ASAP for distros and vendors.

This patch is the work of a number of people over the past month: Eric, Reid,
Rui, and myself. I'm mailing it out as a single commit due to the time
sensitive nature of landing this and the need to backport it. Huge thanks to
everyone who helped out here, and everyone at Intel who helped out in
discussions about how to craft this. Also, credit goes to Paul Turner (at
Google, but not an LLVM contributor) for much of the underlying retpoline
design.

Reviewers: echristo, rnk, ruiu, craig.topper, DavidKreitzer

Subscribers: sanjoy, emaste, mcrosier, mgorny, mehdi_amini, hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D41723

llvm-svn: 323155
2018-01-22 22:05:25 +00:00
Marina Yatsina
a62d1432a7 Fix bug in commit 323096 exposed by test in test-suite-verify-machineinstrs-x86_64h-O3
Change-Id: I0a4b10d0d6c8de606d989c567ec07944ae283a87
llvm-svn: 323126
2018-01-22 15:31:05 +00:00
Simon Pilgrim
327ffd50f0 [X86][SSE] Add ISD::VECTOR_SHUFFLE to faux shuffle decoding (Reapplied)
Primarily, this allows us to use the aggressive extraction mechanisms in combineExtractWithShuffle earlier and make use of UNDEF elements that may be lost during lowering.

Reapplied after rL322279 was reverted at rL322335 due to PR35918, underlying issue was fixed at rL322644.

llvm-svn: 323104
2018-01-22 12:05:17 +00:00
Marina Yatsina
3cb5f18fb0 Break false dependencies for POPCNT, LZCNT, TZCNT
Add POPCNT, LZCNT, TZCNT to the list of instructions that have false dependency.
Add a test to make sure BreakFalseDeps breaks the dependencies for these instructions.
Update affected tests.

This fixes bugzilla https://bugs.llvm.org/show_bug.cgi?id=33869

This is the final of multiple patches that fix this bugzilla.
Most of the patches are intended at refactoring the existent code.

Reviews of the refactoring done to enable this change:
https://reviews.llvm.org/D40330
https://reviews.llvm.org/D40331
https://reviews.llvm.org/D40332
https://reviews.llvm.org/D40333

Differential Revision: https://reviews.llvm.org/D40334

Change-Id: If95cbf1a3f5c7dccff8f1b22ecb397542147303d
llvm-svn: 323096
2018-01-22 10:07:01 +00:00
Marina Yatsina
398a5b7b31 Separate LoopTraversal, ReachingDefAnalysis and BreakFalseDeps into their own files.
This is the one of multiple patches that fix bugzilla https://bugs.llvm.org/show_bug.cgi?id=33869
Most of the patches are intended at refactoring the existent code.

Additional relevant reviews:
https://reviews.llvm.org/D40330
https://reviews.llvm.org/D40331
https://reviews.llvm.org/D40332
https://reviews.llvm.org/D40334

Differential Revision: https://reviews.llvm.org/D40333

Change-Id: Ie5f8eb34d98cfdfae23a3072eb69b5794f0e2d56
llvm-svn: 323095
2018-01-22 10:06:50 +00:00
Marina Yatsina
ccb430969b Rename ExecutionDepsFix files to ExecutionDomainFix
This is the one of multiple patches that fix bugzilla https://bugs.llvm.org/show_bug.cgi?id=33869
Most of the patches are intended at refactoring the existent code.

Additional relevant reviews:
https://reviews.llvm.org/D40330
https://reviews.llvm.org/D40331
https://reviews.llvm.org/D40333
https://reviews.llvm.org/D40334

Differential Revision: https://reviews.llvm.org/D40332

Change-Id: I6a048cca7fdafbfc42fb1bac94343e483befded8
llvm-svn: 323094
2018-01-22 10:06:33 +00:00
Marina Yatsina
994e1aad23 Separate ExecutionDepsFix into 4 parts:
1. ReachingDefsAnalysis - Allows to identify for each instruction what is the “closest” reaching def of a certain register. Used by BreakFalseDeps (for clearance calculation) and ExecutionDomainFix (for arbitrating conflicting domains).
2. ExecutionDomainFix - Changes the variant of the instructions in order to minimize domain crossings.
3. BreakFalseDeps - Breaks false dependencies.
4. LoopTraversal - Creatws a traversal order of the basic blocks that is optimal for loops (introduced in revision L293571). Both ExecutionDomainFix and ReachingDefsAnalysis use this to determine the order they will traverse the basic blocks.

This also included the following changes to ExcecutionDepsFix original logic:
1. BreakFalseDeps and ReachingDefsAnalysis logic no longer restricted by a register class.
2. ReachingDefsAnalysis tracks liveness of reg units instead of reg indices into a given reg class.

Additional changes in affected files:
1. X86 and ARM targets now inherit from ExecutionDomainFix instead of ExecutionDepsFix. BreakFalseDeps also was added to the passes they activate.
2. Comments and references to ExecutionDepsFix replaced with ExecutionDomainFix and BreakFalseDeps, as appropriate.

Additional refactoring changes will follow.

This commit is (almost) NFC.
The only functional change is that now BreakFalseDeps will break dependency for all register classes.
Since no additional instructions were added to the list of instructions that have false dependencies, there is no actual change yet.
In a future commit several instructions (and tests) will be added.

This is the first of multiple patches that fix bugzilla https://bugs.llvm.org/show_bug.cgi?id=33869
Most of the patches are intended at refactoring the existent code.

Additional relevant reviews:
https://reviews.llvm.org/D40331
https://reviews.llvm.org/D40332
https://reviews.llvm.org/D40333
https://reviews.llvm.org/D40334

Differential Revision: https://reviews.llvm.org/D40330

Change-Id: Icaeb75e014eff96a8f721377783f9a3e6c679275
llvm-svn: 323087
2018-01-22 10:05:23 +00:00
Craig Topper
293879ad3d [X86] Add an override of targetShrinkDemandedConstant to limit the damage that shrinkdemandedbits can do to zext_in_reg operations
Summary:
This patch adds an implementation of targetShrinkDemandedConstant that tries to keep shrinkdemandedbits from removing bits that would otherwise have been recognized as a movzx.

We still need a follow patch to stop moving ands across srl if the and could be represented as a movzx before the shift but not after. I think this should help with some of the cases that D42088 ended up removing during isel.

Reviewers: spatel, RKSimon

Reviewed By: spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42265

llvm-svn: 323048
2018-01-20 18:50:09 +00:00
Simon Pilgrim
6320b01ed3 [X86][SSE] Check for out of bounds PEXTR/PINSR indices during faux shuffle combining.
llvm-svn: 323045
2018-01-20 17:16:01 +00:00
Craig Topper
eb91b245e3 [X86] Teach X86 codegen to use vector width preference to avoid promoting to 512-bit types when VLX is enabled and the preference is for a smaller size.
This change applies to places where we would turn 128/256-bit code into 512-bit in order to get a wider element type through sext/zext. Any 512-bit types that already existed in the IR/DAG will be left that way.

The width preference has no effect on codegen behavior when the target does not have AVX512 enabled. So AVX/AVX2 codegen cannot be limited via this mechanism yet.

If the preference is lower than 256 we may still use a 256 bit type to do the operation. Constraining to 128 bits makes it much more difficult to support some operations. For many of these cases we need to change element width while keeping element count constant which is easiest done by switching between 256 and 128 bit.

The preference is only obeyed when AVX512 and VLX are available. This means the preference is not obeyed for KNL, but is obeyed for SKX, Cannonlake, and Icelake. For KNL, the only way to do masked operation is on 512-bit registers so we would have to completely disable masking to obey the preference. We would also lose support for gather, scatter, ctlz, vXi64 multiplies, etc. This may change in the future, but this simplifies the initial implementation.

Differential Revision: https://reviews.llvm.org/D41895

llvm-svn: 323016
2018-01-20 00:26:12 +00:00
Craig Topper
bca85d8144 [X86] Add support for passing 'prefer-vector-width' function attribute into X86Subtarget and exposing via X86's getRegisterWidth TTI interface.
This will cause the vectorizers to do some limiting of the vector widths they create. This is not a strict limit. There are reasons I know of that the loop vectorizer will generate larger vectors for.

I've written this in such a way that the interface will only return a properly supported width(0/128/256/512) even if the attribute says something funny like 384 or 10.

This has been split from D41895 with the remainder in a follow up commit.

llvm-svn: 323015
2018-01-20 00:26:08 +00:00
Daniel Neilson
f59acc15ad Remove alignment argument from memcpy/memmove/memset in favour of alignment attributes (Step 1)
Summary:
 This is a resurrection of work first proposed and discussed in Aug 2015:
   http://lists.llvm.org/pipermail/llvm-dev/2015-August/089384.html
and initially landed (but then backed out) in Nov 2015:
   http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151109/312083.html

 The @llvm.memcpy/memmove/memset intrinsics currently have an explicit argument
which is required to be a constant integer. It represents the alignment of the
dest (and source), and so must be the minimum of the actual alignment of the
two.

 This change is the first in a series that allows source and dest to each
have their own alignments by using the alignment attribute on their arguments.

 In this change we:
1) Remove the alignment argument.
2) Add alignment attributes to the source & dest arguments. We, temporarily,
   require that the alignments for source & dest be equal.

 For example, code which used to read:
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dest, i8* %src, i32 100, i32 4, i1 false)
will now read
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %dest, i8* align 4 %src, i32 100, i1 false)

 Downstream users may have to update their lit tests that check for
@llvm.memcpy/memmove/memset call/declaration patterns. The following extended sed script
may help with updating the majority of your tests, but it does not catch all possible
patterns so some manual checking and updating will be required.

s~declare void @llvm\.mem(set|cpy|move)\.p([^(]*)\((.*), i32, i1\)~declare void @llvm.mem\1.p\2(\3, i1)~g
s~call void @llvm\.memset\.p([^(]*)i8\(i8([^*]*)\* (.*), i8 (.*), i8 (.*), i32 [01], i1 ([^)]*)\)~call void @llvm.memset.p\1i8(i8\2* \3, i8 \4, i8 \5, i1 \6)~g
s~call void @llvm\.memset\.p([^(]*)i16\(i8([^*]*)\* (.*), i8 (.*), i16 (.*), i32 [01], i1 ([^)]*)\)~call void @llvm.memset.p\1i16(i8\2* \3, i8 \4, i16 \5, i1 \6)~g
s~call void @llvm\.memset\.p([^(]*)i32\(i8([^*]*)\* (.*), i8 (.*), i32 (.*), i32 [01], i1 ([^)]*)\)~call void @llvm.memset.p\1i32(i8\2* \3, i8 \4, i32 \5, i1 \6)~g
s~call void @llvm\.memset\.p([^(]*)i64\(i8([^*]*)\* (.*), i8 (.*), i64 (.*), i32 [01], i1 ([^)]*)\)~call void @llvm.memset.p\1i64(i8\2* \3, i8 \4, i64 \5, i1 \6)~g
s~call void @llvm\.memset\.p([^(]*)i128\(i8([^*]*)\* (.*), i8 (.*), i128 (.*), i32 [01], i1 ([^)]*)\)~call void @llvm.memset.p\1i128(i8\2* \3, i8 \4, i128 \5, i1 \6)~g
s~call void @llvm\.memset\.p([^(]*)i8\(i8([^*]*)\* (.*), i8 (.*), i8 (.*), i32 ([0-9]*), i1 ([^)]*)\)~call void @llvm.memset.p\1i8(i8\2* align \6 \3, i8 \4, i8 \5, i1 \7)~g
s~call void @llvm\.memset\.p([^(]*)i16\(i8([^*]*)\* (.*), i8 (.*), i16 (.*), i32 ([0-9]*), i1 ([^)]*)\)~call void @llvm.memset.p\1i16(i8\2* align \6 \3, i8 \4, i16 \5, i1 \7)~g
s~call void @llvm\.memset\.p([^(]*)i32\(i8([^*]*)\* (.*), i8 (.*), i32 (.*), i32 ([0-9]*), i1 ([^)]*)\)~call void @llvm.memset.p\1i32(i8\2* align \6 \3, i8 \4, i32 \5, i1 \7)~g
s~call void @llvm\.memset\.p([^(]*)i64\(i8([^*]*)\* (.*), i8 (.*), i64 (.*), i32 ([0-9]*), i1 ([^)]*)\)~call void @llvm.memset.p\1i64(i8\2* align \6 \3, i8 \4, i64 \5, i1 \7)~g
s~call void @llvm\.memset\.p([^(]*)i128\(i8([^*]*)\* (.*), i8 (.*), i128 (.*), i32 ([0-9]*), i1 ([^)]*)\)~call void @llvm.memset.p\1i128(i8\2* align \6 \3, i8 \4, i128 \5, i1 \7)~g
s~call void @llvm\.mem(cpy|move)\.p([^(]*)i8\(i8([^*]*)\* (.*), i8([^*]*)\* (.*), i8 (.*), i32 [01], i1 ([^)]*)\)~call void @llvm.mem\1.p\2i8(i8\3* \4, i8\5* \6, i8 \7, i1 \8)~g
s~call void @llvm\.mem(cpy|move)\.p([^(]*)i16\(i8([^*]*)\* (.*), i8([^*]*)\* (.*), i16 (.*), i32 [01], i1 ([^)]*)\)~call void @llvm.mem\1.p\2i16(i8\3* \4, i8\5* \6, i16 \7, i1 \8)~g
s~call void @llvm\.mem(cpy|move)\.p([^(]*)i32\(i8([^*]*)\* (.*), i8([^*]*)\* (.*), i32 (.*), i32 [01], i1 ([^)]*)\)~call void @llvm.mem\1.p\2i32(i8\3* \4, i8\5* \6, i32 \7, i1 \8)~g
s~call void @llvm\.mem(cpy|move)\.p([^(]*)i64\(i8([^*]*)\* (.*), i8([^*]*)\* (.*), i64 (.*), i32 [01], i1 ([^)]*)\)~call void @llvm.mem\1.p\2i64(i8\3* \4, i8\5* \6, i64 \7, i1 \8)~g
s~call void @llvm\.mem(cpy|move)\.p([^(]*)i128\(i8([^*]*)\* (.*), i8([^*]*)\* (.*), i128 (.*), i32 [01], i1 ([^)]*)\)~call void @llvm.mem\1.p\2i128(i8\3* \4, i8\5* \6, i128 \7, i1 \8)~g
s~call void @llvm\.mem(cpy|move)\.p([^(]*)i8\(i8([^*]*)\* (.*), i8([^*]*)\* (.*), i8 (.*), i32 ([0-9]*), i1 ([^)]*)\)~call void @llvm.mem\1.p\2i8(i8\3* align \8 \4, i8\5* align \8 \6, i8 \7, i1 \9)~g
s~call void @llvm\.mem(cpy|move)\.p([^(]*)i16\(i8([^*]*)\* (.*), i8([^*]*)\* (.*), i16 (.*), i32 ([0-9]*), i1 ([^)]*)\)~call void @llvm.mem\1.p\2i16(i8\3* align \8 \4, i8\5* align \8 \6, i16 \7, i1 \9)~g
s~call void @llvm\.mem(cpy|move)\.p([^(]*)i32\(i8([^*]*)\* (.*), i8([^*]*)\* (.*), i32 (.*), i32 ([0-9]*), i1 ([^)]*)\)~call void @llvm.mem\1.p\2i32(i8\3* align \8 \4, i8\5* align \8 \6, i32 \7, i1 \9)~g
s~call void @llvm\.mem(cpy|move)\.p([^(]*)i64\(i8([^*]*)\* (.*), i8([^*]*)\* (.*), i64 (.*), i32 ([0-9]*), i1 ([^)]*)\)~call void @llvm.mem\1.p\2i64(i8\3* align \8 \4, i8\5* align \8 \6, i64 \7, i1 \9)~g
s~call void @llvm\.mem(cpy|move)\.p([^(]*)i128\(i8([^*]*)\* (.*), i8([^*]*)\* (.*), i128 (.*), i32 ([0-9]*), i1 ([^)]*)\)~call void @llvm.mem\1.p\2i128(i8\3* align \8 \4, i8\5* align \8 \6, i128 \7, i1 \9)~g

 The remaining changes in the series will:
Step 2) Expand the IRBuilder API to allow creation of memcpy/memmove with differing
   source and dest alignments.
Step 3) Update Clang to use the new IRBuilder API.
Step 4) Update Polly to use the new IRBuilder API.
Step 5) Update LLVM passes that create memcpy/memmove calls to use the new IRBuilder API,
        and those that use use MemIntrinsicInst::[get|set]Alignment() to use
        getDestAlignment() and getSourceAlignment() instead.
Step 6) Remove the single-alignment IRBuilder API for memcpy/memmove, and the
        MemIntrinsicInst::[get|set]Alignment() methods.

Reviewers: pete, hfinkel, lhames, reames, bollu

Reviewed By: reames

Subscribers: niosHD, reames, jholewinski, qcolombet, jfb, sanjoy, arsenm, dschuff, dylanmckay, mehdi_amini, sdardis, nemanjai, david2050, nhaehnle, javed.absar, sbc100, jgravelle-google, eraman, aheejin, kbarton, JDevlieghere, asb, rbar, johnrusso, simoncook, jordy.potman.lists, apazos, sabuasal, llvm-commits

Differential Revision: https://reviews.llvm.org/D41675

llvm-svn: 322965
2018-01-19 17:13:12 +00:00
Sanjay Patel
e68c91d039 [x86] shrink 'and' immediate values by setting the high bits (PR35907)
Try to reverse the constant-shrinking that happens in SimplifyDemandedBits()
for 'and' masks when it results in a smaller sign-extended immediate.

We are also able to detect dead 'and' ops here (the mask is all ones). In
that case, we replace and return without selecting the 'and'.

Other targets might want to share some of this logic by enabling this under a
target hook, but I didn't see diffs for simple cases with PowerPC or AArch64,
so they may already have some specialized logic for this kind of thing or have
different needs.

This should solve PR35907:
https://bugs.llvm.org/show_bug.cgi?id=35907

Differential Revision: https://reviews.llvm.org/D42088

llvm-svn: 322957
2018-01-19 16:37:25 +00:00
Nirav Dave
1bceaa6e21 [X86] Extend load-op-store fusion merge to ADC/SBB.
Summary: Add handling of EFLAG input to X86 Load-op-store fusion checking.

Reviewers: craig.topper, RKSimon

Subscribers: llvm-commits, hiraditya

Differential Revision: https://reviews.llvm.org/D42128

llvm-svn: 322952
2018-01-19 15:37:57 +00:00
Craig Topper
1b2403c0cf [X86] Make better use of instregex for cmovcc/setcc/jcc instructions in the Intel scheduler models.
Combine all the separate condition codes into a singular expression when possible.

llvm-svn: 322924
2018-01-19 05:47:32 +00:00
Craig Topper
1c2f80aef6 [X86] Add intrinsic support for the RDPID instruction
This adds a new instrinsic to support the rdpid instruction. The implementation is a bit weird because the intrinsic is defined as always returning 32-bits, but the assembler support thinks the instruction produces a 64-bit register in 64-bit mode. But really it zeros the upper 32 bits. So I had to add separate patterns where 64-bit mode uses an extract_subreg.

Differential Revision: https://reviews.llvm.org/D42205

llvm-svn: 322910
2018-01-18 23:52:31 +00:00
Craig Topper
672594bd71 [X86] Use vmovdqu64/vmovdqa64 for unmasked integer vector stores for consistency with loads.
Previously we used 64 for vXi64 stores and 32 for everything else. This change uses 64 for everything just like do for loads.

llvm-svn: 322820
2018-01-18 07:44:09 +00:00
Craig Topper
5585e2235d [X86] Remove isel patterns for using unmasked vmovdqa32/vmovdqu32 for integer vector loads.
These patterns were just looking for a vXi64 bitcasted to vXi32, but there is no advantage to using vmovdqa32 over vmovdqa64.

llvm-svn: 322819
2018-01-18 07:44:06 +00:00
Reid Kleckner
8873e1db2a [CodeGen] Hoist common AsmPrinter code out of X86, ARM, and AArch64
Every known PE COFF target emits /EXPORT: linker flags into a .drective
section. The AsmPrinter should handle this.

While we're at it, use global_values() and emit each export flag with
its own .ascii directive. This should make the .s file output more
readable.

llvm-svn: 322788
2018-01-17 23:55:23 +00:00
Simon Pilgrim
af727ae93a [X86][BTVER2] Reduce instregex usage (PR35955)
Most are just replaced with instrs lists, but a few regexps have been further generalized to match more instructions with a single pattern.

llvm-svn: 322734
2018-01-17 19:12:48 +00:00
Craig Topper
1c79d356f8 [X86] Teach LowerBUILD_VECTOR to recognize pair-wise splats of 32-bit elements and use a 64-bit broadcast
If we are splatting pairs of 32-bit elements, we can use a 64-bit broadcast to get the job done.

We could probably could probably do this with other sizes too, for example four 16-bit elements. Or we could broadcast pairs of 16-bit elements using a 32-bit element broadcast. But I've left that as a future improvement.

I've also restricted this to AVX2 only because we can only broadcast loads under AVX.

Differential Revision: https://reviews.llvm.org/D42086

llvm-svn: 322730
2018-01-17 18:58:22 +00:00
Craig Topper
85cb5c17f6 [X86] When legalizing (v64i1 select i8, v64i1, v64i1) make sure not to introduce bitcasts to i64 in 32-bit mode
We legalize selects of masks with scalar conditions using a bitcast to an integer type. But if we are in 32-bit mode we can't convert v64i1 to i64. So instead split the v64i1 to v32i1 and concat it back together. Each half will then be legalized by bitcasting to i32 which is fine.

The test case is a little indirect. If we have the v64i1 select in IR it will get legalized by legalize vector ops which has a run of type legalization after it. That type legalization run is able to fix this i64 bitcast. So in order to avoid that we need a build_vector of a splat which legalize vector ops will ignore. Legalize DAG will then turn that into a select via LowerBUILD_VECTORvXi1. And the select will get legalized. In this case there is no type legalizer run to cleanup the bitcast.

This fixes pr35972.

llvm-svn: 322724
2018-01-17 18:46:01 +00:00
Benjamin Kramer
5dc3847399 [X86] Don't mutate shuffle arguments after early-out for AVX512
The match* functions have the annoying behavior of modifying its inputs.
Save and restore the inputs, just in case the early out for AVX512 is
hit. This is still not great and its only a matter of time this kind of
bug happens again, but I couldn't come up with a better pattern without
rewriting significant chunks of this code. Fixes PR35977.

llvm-svn: 322644
2018-01-17 13:01:06 +00:00
Benjamin Kramer
00cc7e3e32 [X86] Constify DebugLoc parameters. No functionality change.
llvm-svn: 322643
2018-01-17 13:00:58 +00:00
Andrew V. Tischenko
1fb380cedb Allow usage of X86-prefixes as separate instrs.
Differential Revision: https://reviews.llvm.org/D42102

llvm-svn: 322623
2018-01-17 10:12:06 +00:00
Craig Topper
94bc93e190 [X86] In LowerBUILD_VECTOR, rename ExtVT to EltVT so it makes sense.
llvm-svn: 322616
2018-01-17 03:58:21 +00:00
Craig Topper
aa7c3ff71e [X86] Remove duplicate lines from scheduler models. NFC
llvm-svn: 322615
2018-01-17 03:50:21 +00:00
Simon Pilgrim
15dfa1bfc1 [X86][BTVER2] Fix scheduling of VCMPSD/VCMPSS instructions
For some reason they don't have a trailing i like the packed equivalents.

llvm-svn: 322600
2018-01-16 22:15:41 +00:00
Simon Pilgrim
bbd306f411 [X86][BTVER2] Use instrs instead of instregex for low match counts (PR35955)
llvm-svn: 322598
2018-01-16 22:08:43 +00:00
Simon Pilgrim
6ddd65608e [X86][BTVER2] Use instrs instead of instregex for single use matches (PR35955)
llvm-svn: 322597
2018-01-16 21:44:48 +00:00
Simon Pilgrim
3c09cf2a76 [X86][MMX] Accept UNDEF upper bits for MOVD GR32->MMX
llvm-svn: 322574
2018-01-16 17:01:31 +00:00
Simon Pilgrim
d94666a84c [X86][MMX] Improve MMX constant generation
Extend the MMX zero code to take any constant with zero'd upper 32-bits

llvm-svn: 322553
2018-01-16 14:21:28 +00:00
Craig Topper
73a154cef6 [X86] Revisit the fix I made years ago to make 'xchgl %eax, %eax' not encode using the 0x90 encoding in 64-bit mode.
Prior to this we had a separate instruction and register class that excluded eax to prevent matching the instruction that would encode with 0x90.

This patch changes this to just use an InstAlias to force xchgl %eax, %eax to use XCHG32rr instruction in 64-bit mode. This gets rid of the separate instruction and register class.

llvm-svn: 322532
2018-01-16 06:07:16 +00:00
Craig Topper
c065faaccf [X86] Make 'xchgq %rax, %rax' an alias for the 0x90 nop encoding to match gas.
Previously we encoded it as 0x48 0x90.

llvm-svn: 322531
2018-01-16 06:07:14 +00:00
Simon Pilgrim
8a7237fd4f Avoid Wparentheses warning.
llvm-svn: 322526
2018-01-15 22:40:06 +00:00
Simon Pilgrim
a8d108687b [X86][MMX] Add support for MMX zero vector creation
As mentioned on PR35869, (and came up recently on D41517) we don't create a MMX zero register via the PXOR but instead perform a spill to stack from a XMM zero register.

This patch adds support for direct MMX zero vector creation and should make it easier to add better constant vector creation in the future as well.

Differential Revision: https://reviews.llvm.org/D41908

llvm-svn: 322525
2018-01-15 22:32:40 +00:00
Simon Pilgrim
9a1d5eeb3e [X86][SSE] Add custom execution domain fixing for BLENDPD/BLENDPS/PBLENDD/PBLENDW (PR34873)
Add support for custom execution domain fixing and implement support for BLENDPD/BLENDPS/PBLENDD/PBLENDW.

Differential Revision: https://reviews.llvm.org/D42042

llvm-svn: 322524
2018-01-15 22:18:45 +00:00
Craig Topper
4a0f80b0a4 [X86] Use MVT::getVectorVT instead of EVT::getVectorVT when splitting 256/512 bit build_vectors. NFC
We must be creating a legal type here which means it can be an MVT.

llvm-svn: 322512
2018-01-15 20:33:53 +00:00
Craig Topper
a5860c1159 [X86] Generalize some code in LowerBUILD_VECTOR. NFC
llvm-svn: 322511
2018-01-15 20:33:52 +00:00
Craig Topper
a103183a87 [X86] Remove unnecessary if statement from LowerBUILD_VECTOR. NFCI
We were checking for 128, 256, or 512 bit vectors, but those are the only types that can get here.

llvm-svn: 322510
2018-01-15 20:33:50 +00:00
Simon Pilgrim
91f050f700 [X86] Fix typos in WriteVMOVNTDQSt and WriteVMOVNTPYSt pattern names. NFCI.
llvm-svn: 322495
2018-01-15 17:55:21 +00:00
Clement Courbet
e370a26012 [X86] Add missing predicates for VRNDSCALES{D,S}{m,r}
Summary: This is similar to https://reviews.llvm.org/D41983.

Reviewers: gchatelet

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42069

llvm-svn: 322486
2018-01-15 14:24:07 +00:00
Andrew V. Tischenko
99ef3709ba Update BTVER2 sched numbers for some AVX instructions (xmm version).
Differential Revision: https://reviews.llvm.org/D40067

llvm-svn: 322485
2018-01-15 14:21:11 +00:00
Clement Courbet
ededcd269c [X86]Add missing predicates for VMOVDQUYrm,VMOVDQUYmr.
Summary:
Due to missing parentheses.

This is similar to https://reviews.llvm.org/D41983.

Reviewers: gchatelet

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42062

llvm-svn: 322483
2018-01-15 13:37:05 +00:00
Clement Courbet
53fea2956b [X86] Fix missing predicates HasAVX512 Predicates in avx512_sqrt_scalar.
Summary:
For example, VSQRTSDZr and VSQRTSSZr were missing the predicate.
Also fix braces indentation and braces for consistency.

Reviewers: craig.topper, RKSimon

Suscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41983

llvm-svn: 322478
2018-01-15 12:05:33 +00:00
Simon Pilgrim
6556cecd6b [X86][SSE] Support combining MOVLHPS undef inputs
llvm-svn: 322459
2018-01-14 18:50:34 +00:00
Craig Topper
7095915282 [X86] Use ISD::TRUNCATE instead of X86ISD::VTRUNC when input and output types have the same number of elements.
llvm-svn: 322455
2018-01-14 08:11:36 +00:00
Craig Topper
f1f9a8c7f7 [X86] Add X86ISD::VTRUNC to computeKnownBitsForTargetNode.
We have to take special care to avoid the cases where the result of the truncate would be padded with zero elements.

Ideally we'd just use ISD::TRUNCATE for these cases instead.

llvm-svn: 322454
2018-01-14 08:11:33 +00:00
Craig Topper
767d0f3bfe [X86] Improve legalization of vXi16/vXi8 selects.
Extend vXi1 conditions of vXi8/vXi16 selects even before type legalization gets a chance to split wide vectors. Previously we would only extend 128 and 256 bit vectors. But if we start with a 512 bit vector or wider that needs to be split we wouldn't extend until after the split had taken place. By extending early we improve the results of type legalization.

Don't widen condition of 128/256 bit vXi16/vXi8 selects when we have BWI but not VLX. We can still use a mask register by widening the select to 512-bits instead. This is similar to what we do for compares already.

llvm-svn: 322450
2018-01-14 02:05:51 +00:00
Zvi Rackover
441282dd8c X86: Add pattern matching for PMADDWD
In addition to the existing match as part of a loop-reduction, add a
straightforward pattern match for DAG-contained patterns.

Reviewers: RKSimon, craig.topper

Subscribers: llvm-commits

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D41811

llvm-svn: 322446
2018-01-13 17:42:19 +00:00
Craig Topper
297e87e001 [X86] Add DAG combine to promote vXi1 result of a vXi8/vXi16 setcc when we have AVX512 but not BWI.
This avoids having the result type stick around until lowering where we have to extend the setcc and insert a truncate. If we get the types converted early we can do more to optimize it.

llvm-svn: 322432
2018-01-13 06:24:46 +00:00
Craig Topper
aa0ccaf5cd [X86] Remove unused isel pattern for zero extend from v16i1/v8i1 to v16i32/v8i64.
We have custom lowering on vzext that produces a vselect and a build vector. So zext never gets to isel.

llvm-svn: 322381
2018-01-12 17:34:09 +00:00
Craig Topper
46f6d0692f [X86] Don't allow lods/stos/scas/cmps/movs to be parsed without a suffix and only memory operand in at&t syntax.
Without a register with a size being mentioned the instruction is ambiguous in at&t syntax. With Intel syntax the memory operation caries a size that can be used to disambiguate.

llvm-svn: 322356
2018-01-12 06:48:26 +00:00
Craig Topper
67c7573d58 [X86] Don't require suffix on 'clr' mnemonic in intel syntax
llvm-svn: 322355
2018-01-12 06:48:24 +00:00
Craig Topper
2e17caf317 [X86] Add 'l' and 'q' suffixes to the tbm instruction mnemonics.
While the suffix isn't required to disambiguate the instructions, it is required in order to parse the instructions when the suffix is specified in order to match the GNU assembler.

llvm-svn: 322354
2018-01-12 06:21:36 +00:00
Craig Topper
6509d82de1 [X86] Disable sldtq parsing in 64-bit mode.
llvm-svn: 322353
2018-01-12 05:38:15 +00:00
Craig Topper
2a983cee68 [X86] Disable movsq/stosq/scasqcmpsq/lodsq parsing in 64-bit mode.
llvm-svn: 322352
2018-01-12 05:38:14 +00:00
David L. Jones
3c7677b5c6 Revert r322279 due to Skylake miscompile.
Summary:
This revision causes Skylake (and apparently, only Skylake) codegen to fail in
certain cases. Details: https://bugs.llvm.org/show_bug.cgi?id=35918

Subscribers: sanjoy, llvm-commits

Differential Revision: https://reviews.llvm.org/D41972

llvm-svn: 322335
2018-01-12 00:17:38 +00:00
Craig Topper
06f7bf5c4f [X86] Legalize 128/256 gathers/scatters on KNL by using widening rather than sign extending the index.
We can just widen the vectors with undef and zero extend the mask.

llvm-svn: 322308
2018-01-11 19:38:30 +00:00
Zvi Rackover
94be74ee9c X86: Refactor type-splitting to target-legal size vector to a helper function
Summary: This is a preparatory step for D41811: refactoring code for breaking vector operands of binary operation to legal-types.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41925

llvm-svn: 322296
2018-01-11 17:29:47 +00:00
Simon Pilgrim
f195f0dd08 [X86][SSE] Add ISD::VECTOR_SHUFFLE to faux shuffle decoding
Primarily, this allows us to use the aggressive extraction mechanisms in combineExtractWithShuffle earlier and make use of UNDEF elements that may be lost during lowering.

llvm-svn: 322279
2018-01-11 14:25:18 +00:00
Zvi Rackover
2024c53178 X86: Fix LowerBUILD_VECTORAsVariablePermute for case Src is smaller than Indices
Summary:
As RKSimon suggested in pr35820, in the case that Src is smaller in
bit-size than Indices, need to widen Src to avoid type mismatch.

Fixes pr35820

Reviewers: RKSimon, craig.topper

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41865

llvm-svn: 322272
2018-01-11 12:26:52 +00:00
Andrew V. Tischenko
020c50b67a Implementation of X86Operand::print.
Differential Revision: https://reviews.llvm.org/D41610

llvm-svn: 322267
2018-01-11 10:31:01 +00:00
Craig Topper
0b2f3f6e7a [X86] Fix unused variable in release builds.
llvm-svn: 322262
2018-01-11 07:19:29 +00:00
Craig Topper
119ce7b99a [X86] Optimize v2i32/v2f32 scatters.
If the index is v2i64 we can use the scatter instruction that has v4i32/v4f32 data register, v2i64 index, and v2i1 mask. Similar was already done for gather.

Implement custom widening for v2i32 data to remove the code that reverses type legalization during lowering.

llvm-svn: 322254
2018-01-11 06:31:28 +00:00
Craig Topper
ee1a3c5643 [X86] Move HasNOPL to a subtarget feature bit. Plumb MCSubtargetInfo through the MCAsmBackend constructor
After D41349, we can no get a MCSubtargetInfo into the MCAsmBackend constructor. This allows us to get NOPL from a subtarget feature rather than a CPU name blacklist.

Differential Revision: https://reviews.llvm.org/D41721

llvm-svn: 322227
2018-01-10 22:07:16 +00:00
Craig Topper
8d3d87a0cc [SelectionDAG][X86] Explicitly store the scale in the gather/scatter ISD nodes
Currently we infer the scale at isel time by analyzing whether the base is a constant 0 or not. If it is we assume scale is 1, else we take it from the element size of the pass thru or stored value. This seems a little weird and I think it makes more sense to make it explicit in the DAG rather than doing tricky things in the backend.

Most of this patch is just making sure we copy the scale around everywhere.

Differential Revision: https://reviews.llvm.org/D40055

llvm-svn: 322210
2018-01-10 19:16:05 +00:00
Simon Pilgrim
0166d0b510 [X86][MMX] Pull out common MMX VT test. NFCI.
llvm-svn: 322195
2018-01-10 15:32:19 +00:00
Alexey Bataev
c3619b2825 [COST]Fix PR35865: Fix cost model evaluation for shuffle on X86.
Summary:
If the vector type is transformed to non-vector single type, the compile
may crash trying to get vector information about non-vector type.

Reviewers: RKSimon, spatel, mkuper, hfinkel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41862

llvm-svn: 322106
2018-01-09 19:08:22 +00:00
Craig Topper
3d19c1e4f2 [X86] Add a DAG combine to combine (sext (setcc)) with VLX
Normally target independent DAG combine would do this combine based on getSetCCResultType, but with VLX getSetCCResultType returns a vXi1 type preventing the DAG combining from kicking in.

But doing this combine can allow us to remove the explicit sign extend that would otherwise be emitted.

This patch adds a target specific DAG combine to combine the sext+setcc when the result type is the same size as the input to the setcc. I've restricted this to FP compares and things that can be represented with PCMPEQ and PCMPGT since we don't have full integer compare support on the older ISAs.

Differential Revision: https://reviews.llvm.org/D41850

llvm-svn: 322101
2018-01-09 18:14:22 +00:00
Oren Ben Simhon
b468feac4d Instrument Control Flow For Indirect Branch Tracking
CET (Control-Flow Enforcement Technology) introduces a new mechanism called IBT (Indirect Branch Tracking).
According to IBT, each Indirect branch should land on dedicated ENDBR instruction (End Branch).
The new pass adds ENDBR instructions for every indirect jmp/call (including jumps using jump tables / switches).
For more information, please see the following:
https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

Differential Revision: https://reviews.llvm.org/D40482

Change-Id: Icb754489faf483a95248f96982a4e8b1009eb709
llvm-svn: 322062
2018-01-09 08:51:18 +00:00
Craig Topper
922dfb8775 [X86] Allow more cmpps/pd immediate encodings to be commuted during isel.
The code that checks the immediate wasn't masking to the lower 3-bits like the code in X86InstrInfo.cpp that's used by the peephole pass does.

llvm-svn: 322060
2018-01-09 07:09:34 +00:00
Craig Topper
63aae39c34 [X86] Remove llvm.x86.avx512.cvt*2mask.* intrinsics and autoupgrade to (icmp slt X, 0)
I had to drop fast-isel-abort from a test because we can't fast isel some of the mask stuff. When we used intrinsics we implicitly fell back to SelectionDAG for the intrinsic call without triggering the abort error. But with native IR that doesn't happen the same way.

llvm-svn: 322050
2018-01-09 00:50:47 +00:00
Craig Topper
c5dbf5a582 [X86] Remove unnecessary isel pattern that is a combination of two other patterns.
The pattern was this

 def : Pat<(i32 (zext (i8 (bitconvert (v8i1 VK8:$src))))),
           (MOVZX32rr8 (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS VK8:$src, GR32)), sub_8bit))>, Requires<[NoDQI]>;

but if you just let (i32 (zext X)) match byte itself you'll get MOVZX32rr8. And if you let (i8 (bitconvert (v8i1 VK8:$src))) match by itself you'll get (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS VK8:$src, GR32)), sub_8bit).

So we can just let isel do the two patterns naturally.

llvm-svn: 322049
2018-01-09 00:50:42 +00:00
Jessica Paquette
3075bb4222 [MachineOutliner] AArch64: Handle instrs that use SP and will never need fixups
This commit does two things. Firstly, it adds a collection of flags which can
be passed along to the target to encode information about the MBB that an
instruction lives in to the outliner.

Second, it adds some of those flags to the AArch64 outliner in order to add
more stack instructions to the list of legal instructions that are handled
by the outliner. The two flags added check if

- There are calls in the MachineBasicBlock containing the instruction
- The link register is available in the entire block

If the link register is available and there are no calls, then a stack
instruction can always be outlined without fixups, regardless of what it is,
since in this case, the outliner will never modify the stack to create a
call or outlined frame.

The motivation for doing this was checking which instructions are most often
missed by the outliner. Instructions like, say

%sp<def> = ADDXri %sp, 32, 0; flags: FrameDestroy

are very common, but cannot be outlined in the case that the outliner might
modify the stack. This commit allows us to outline instructions like this.
  

llvm-svn: 322048
2018-01-09 00:26:18 +00:00
Francis Visoiu Mistrih
8198e11651 [X86] Remove side-effects from determineCalleeSaves
(Target)FrameLowering::determineCalleeSaves can be called multiple
times. I don't think it should have side-effects as creating stack
objects and setting global MachineFunctionInfo state as it is doing
today (in other back-ends as well).

This moves the creation of stack objects from determineCalleeSaves to
assignCalleeSavedSpillSlots.

Differential Revision: https://reviews.llvm.org/D41703

llvm-svn: 321987
2018-01-08 10:46:05 +00:00
Craig Topper
573f223715 [X86] Replace CVT2MASK ISD opcode with PCMPGTM compared to zero.
CVT2MASK is just checking the sign bit which can be represented with a comparison with zero.

llvm-svn: 321985
2018-01-08 06:53:54 +00:00
Craig Topper
2d72cefc7c [X86] Add patterns to allow 512-bit BWI compare instructions to be used for 128/256-bit compares when VLX is not available.
llvm-svn: 321984
2018-01-08 06:53:52 +00:00
Craig Topper
5116b57234 [X86] Simplify some code in lower1BitVectorShuffle by relying on getNode's ability to constant fold vector SIGN_EXTEND.
llvm-svn: 321979
2018-01-07 23:56:37 +00:00
Craig Topper
df5dbc87a9 [X86] Add VSHUFF32X4 and similar instructions to load folding tables.
llvm-svn: 321978
2018-01-07 23:30:20 +00:00
Craig Topper
4d10a572ab [X86] Revert accidental change to CMakeLists.txt in r321952
I had removed the qualifiers around the autogenerated folding table so I could compare with the manual table, but didn't intend to commit the change.

llvm-svn: 321971
2018-01-07 21:03:43 +00:00
Craig Topper
4ebaa3ea4d [X86] Remove unneeded code from combineGatherScatter that used to delte SIGN_EXTEND_INREG nodes created during legalization of v2i1/v4i1 masks on KNL.
v2i1/v4i1 are now legal on KNL so no sign_extend_inreg is generated.

llvm-svn: 321968
2018-01-07 18:34:08 +00:00
Craig Topper
971ece4f26 [X86] Make v2i1 and v4i1 legal types without VLX
Summary:
There are few oddities that occur due to v1i1, v8i1, v16i1 being legal without v2i1 and v4i1 being legal when we don't have VLX. Particularly during legalization of v2i32/v4i32/v2i64/v4i64 masked gather/scatter/load/store. We end up promoting the mask argument to these during type legalization and then have to widen the promoted type to v8iX/v16iX and truncate it to get the element size back down to v8i1/v16i1 to use a 512-bit operation. Since need to fill the upper bits of the mask we have to fill with 0s at the promoted type.

It would be better if we could just have the v2i1/v4i1 types as legal so they don't undergo any promotion. Then we can just widen with 0s directly in a k register. There are no real v4i1/v2i1 instructions anyway. Everything is done on a larger register anyway.

This also fixes an issue that we couldn't implement a masked vextractf32x4 from zmm to xmm properly.

We now have to support widening more compares to 512-bit to get a mask result out so new tablegen patterns got added.

I had to hack the legalizer for widening the operand of a setcc a bit so it didn't try create a setcc returning v4i32, extract from it, then try to promote it using a sign extend to v2i1. Now we create the setcc with v4i1 if the original setcc's result type is v2i1. Then extract that and don't sign extend it at all.

There's definitely room for improvement with some follow up patches.

Reviewers: RKSimon, zvi, guyblank

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41560

llvm-svn: 321967
2018-01-07 18:20:37 +00:00
Craig Topper
0141386f6a [X86] Add the 16 and 8-bit CRC32 instructions to the load folding tables.
llvm-svn: 321958
2018-01-07 06:48:20 +00:00
Craig Topper
769fb85b49 [X86] Correct the load folding flags for xmm fp->mmx conversion instructions.
The instructions that load 64-bits or an xmm register should be TB_NO_REVERSE to avoid the load being widened during unfold. The instructions that load 128-bits need to ensure 128-bit alignment.

llvm-svn: 321956
2018-01-07 06:24:30 +00:00
Craig Topper
fbb7173edc [X86] Add TB_NO_REVERSE to some scalar intrinsic instructions in the load folding table.
llvm-svn: 321955
2018-01-07 06:24:29 +00:00
Craig Topper
3dce50d8f0 [X86] Don't put any EVEX_B instructions in the tablegen generated load folding tables.
EVEX_B means different things for memory and register forms. The instructions should not be considered equivalent.

llvm-svn: 321954
2018-01-07 06:24:28 +00:00
Craig Topper
dfe08ed09d [X86] Add 128 and 256-bit VPOPCNTD/Q instructions to load folding tables.
llvm-svn: 321953
2018-01-07 06:24:27 +00:00
Craig Topper
f03febaf72 [X86] Add some 8 and 16-bit instructions to the load folding tables.
llvm-svn: 321952
2018-01-07 06:24:25 +00:00
Craig Topper
b7cd82f9cb [X86] Add EVEX vcvtph2ps to the load folding tables.
llvm-svn: 321951
2018-01-07 06:24:24 +00:00
Craig Topper
b7cde4edac [X86] Remove cvtps2ph xmm->xmm from store folding tables. Add the evex versions of cvtps2ph to the store folding tables.
The memory form of the xmm->xmm version only writes 64-bits. If we use it in the folding tables and its get used for a stack spill, only half the slot will be written. Then a reload may read all 128-bits which will pull in garbage. But without the spill the upper bits of the register would have been zero. By not folding we would preserve the zeros.

llvm-svn: 321950
2018-01-07 06:24:23 +00:00
Craig Topper
c8b420137b [X86] Add CMP8ri8 to load folding tables.
llvm-svn: 321949
2018-01-07 06:24:21 +00:00
Craig Topper
06d343d64a [X86] Remove assembler predicates from all AVX512 related feature flags.
We don't do fine grained feature control like this on features prior to AVX512.

We do still have checks in place in the assembly parser itself that prevents %zmm references or %xmm16-31 from being parsed without at least -mattr=avx512f. Same for rounding control and mask operands. That will prevent the table matcher from matching for any instructions that need those features and that's probably good enough.

llvm-svn: 321947
2018-01-06 21:45:30 +00:00
Craig Topper
6bda84e46d [X86] Remove memory forms of EVEX encoded vcvttss2si/vcvttsd2si from asm matcher table.
This is also needed to fix PR35837.

llvm-svn: 321946
2018-01-06 21:27:25 +00:00
Craig Topper
72ad66d6f2 [X86] Add load folding pattern to EVEX vcvttss2si/vcvtsd2si.
llvm-svn: 321945
2018-01-06 21:02:26 +00:00
Craig Topper
e6c2945af8 [X86] Remove an unnecessary VCVTTSD2SIrrb/VCVTSS2SIrrb instruction with no isel pattern that only existed for the assembler. Use VCVTTSD2SIrrb_Int instead.
For consistency use the _Int version of VCVTTSD2SIrr_Int and VCVTTSD2SIrm_Int for the assembler as well.

llvm-svn: 321944
2018-01-06 21:02:22 +00:00
Craig Topper
725d7b811b [X86] Remove memory forms of EVEX encoded vcvtsd2si/vcvtss2si from the assembler matcher table
We should always prefer the VEX encoded version of these instructions. There is no advantage to the EVEX version.

Fixes PR35837.

llvm-svn: 321939
2018-01-06 19:20:33 +00:00
Sanjay Patel
eaf67121dd [x86, MemCmpExpansion] allow 2 pairs of loads per block (PR33325)
This is the last step needed to fix PR33325:
https://bugs.llvm.org/show_bug.cgi?id=33325

We're trading branch and compares for loads and logic ops. 
This makes the code smaller and hopefully faster in most cases.

The 24-byte test shows an interesting construct: we load the trailing scalar 
elements into vector registers and generate the same pcmpeq+movmsk code that 
we expected for a pair of full vector elements (see the 32- and 64-byte tests).

Differential Revision: https://reviews.llvm.org/D41714

llvm-svn: 321934
2018-01-06 16:16:04 +00:00
Craig Topper
3200a63add [X86] Rename the EVEX encoded GFNI instructions to start with a 'V'. NFC
This makes the names consistent with the mnemonics like every other instruction.

llvm-svn: 321931
2018-01-06 07:18:08 +00:00
Craig Topper
f8e5fe3847 [X86] When parsing rounding mode operands, provide a proper end location so we don't crash when trying to print an error message using it.
llvm-svn: 321930
2018-01-06 06:41:07 +00:00
Craig Topper
c07cea6d51 [X86] Call lowerShuffleAsRepeatedMaskAndLanePermute from lowerV4I64VectorShuffle.
llvm-svn: 321929
2018-01-06 06:08:04 +00:00
Craig Topper
75b8a76137 [X86] Add vcvtsd2sil/vcvtsd2siq etc. InstAliases to the EVEX-encoded instructions.
This matches their VEX equivalents.

llvm-svn: 321912
2018-01-05 23:13:54 +00:00
Craig Topper
3e111f3f21 [X86] Add InstAliases for 'vmovd' with GR64 registers to select EVEX encoded instructions as well.
Without this we allow "vmovd %rax, %xmm0", but not "vmovd %rax, %xmm16"

This exists due to continue a silly bug where really old versions of the GNU assembler required movd instead of movq on these instructions. This compatibility hack then crept forward to avx version too, but we didn't propagate it to avx512.

llvm-svn: 321903
2018-01-05 21:57:23 +00:00
Craig Topper
8df61fd902 [X86] Stop printing moves between VR64 and GR64 with 'movd' mnemonic. Use 'movq' instead.
This behavior existed to work with an old version of the gnu assembler on MacOS that only accepted this form. Newer versions of GNU assembler and the current LLVM derived version of the assembler on MacOS support movq as well.

llvm-svn: 321898
2018-01-05 20:55:12 +00:00
Craig Topper
d97e3dfb3c [X86] Correct the execution domain for AVX1 VBROADCASTF128 to be FP instead of integer.
llvm-svn: 321821
2018-01-04 20:56:21 +00:00
Craig Topper
2171c76a73 [X86] Remove 'else' after 'return' I forgot to cleanup before committing D41691.
llvm-svn: 321755
2018-01-03 19:15:43 +00:00
Craig Topper
4e2a31fc55 [X86] Remove useless custom inserter for 64-bit TAILJMP and TCRETURN opcodes
This custom inserter was added in r124272 at which time it added about bunch of Defs for Win64. In r150708, those defs were removed leaving only the "return BB". So I think this means the custom inserter is a NOP these days.

This patch removes the remaining code and stops tagging the instructions for custom insertion

Differential Revision: https://reviews.llvm.org/D41671

llvm-svn: 321747
2018-01-03 18:20:36 +00:00
Craig Topper
9708d5ef85 [X86] Use ANY_EXTEND instead of SIGN_EXTEND in lowerMasksToReg
Currently we use SIGN_EXTEND in lowerMasksToReg as part of calling convention setup, but we don't require a specific value for the upper bits.

This patch changes it to ANY_EXTEND which will be lowered as SIGN_EXTEND if it ends up sticking around.

llvm-svn: 321746
2018-01-03 18:11:01 +00:00
Hans Wennborg
2f8417e804 Remove left-over debug printout from r321692
Besides the unsightly print-out, it was causing some buildbots to fail,
e.g. http://lab.llvm.org:8011/builders/clang-x86-windows-msvc2015/builds/9311

llvm-svn: 321711
2018-01-03 14:48:19 +00:00
Alex Bradbury
07f78926fb Thread MCSubtargetInfo through Target::createMCAsmBackend
Currently it's not possible to access MCSubtargetInfo from a TgtMCAsmBackend. 
D20830 threaded an MCSubtargetInfo reference through 
MCAsmBackend::relaxInstruction, but this isn't the only function that would 
benefit from access. This patch removes the Triple and CPUString arguments 
from createMCAsmBackend and replaces them with MCSubtargetInfo.

This patch just changes the interface without making any intentional 
functional changes. Once in, several cleanups are possible:
* Get rid of the awkward MCSubtargetInfo handling in ARMAsmBackend
* Support 16-bit instructions when valid in MipsAsmBackend::writeNopData
* Get rid of the CPU string parsing in X86AsmBackend and just use a SubtargetFeature for HasNopl
* Emit 16-bit nops in RISCVAsmBackend::writeNopData if the compressed instruction set extension is enabled (see D41221)

This change initially exposed PR35686, which has since been resolved in r321026.

Differential Revision: https://reviews.llvm.org/D41349

llvm-svn: 321692
2018-01-03 08:53:05 +00:00
Andrew Kaylor
656d44c5da Handle the case of live 16-bit subregisters in X86FixupBWInsts
Differential Revision: https://reviews.llvm.org/D40524

Change-Id: Ie3a405b28503ceae999f5f3ba07a68fa733a2400
llvm-svn: 321674
2018-01-02 21:04:38 +00:00
Sanjay Patel
23950707ec [x86] allow pairs of PCMPEQ for vector-sized integer equality comparisons (PR33325)
This is an extension of D31156 with the goal that we'll allow memcmp() == 0 expansion 
for x86 to use 2 pairs of loads per block.

The memcmp expansion pass (formerly part of CGP) will generate this kind of pattern 
with oversized integer compares, so we want to transform these into x86-specific vector
nodes before legalization splits things into scalar chunks.

See PR33325 for more details:
https://bugs.llvm.org/show_bug.cgi?id=33325

Differential Revision: https://reviews.llvm.org/D41618

llvm-svn: 321656
2018-01-02 16:38:29 +00:00
Simon Pilgrim
3437b0b5f1 Strip trailing whitespace. NFCI
llvm-svn: 321644
2018-01-02 12:41:29 +00:00
Craig Topper
9b2064424f [X86] Promote vXi1 fp_to_uint/fp_to_sint to vXi32 to avoid scalarization.
llvm-svn: 321632
2018-01-01 21:12:18 +00:00
Craig Topper
22e956b139 [X86] Replace custom lowering of vXi1 SINT_TO_FP/UINT_TO_FP with promotion.
The custom lowering was just doing the same thing promotion would do.

llvm-svn: 321630
2018-01-01 20:08:43 +00:00
Craig Topper
8504e9c49b [SelectionDAG][X86][AArch64] Require targets to specify the promotion type when using setOperationAction Promote for INT_TO_FP and FP_TO_INT
Currently the promotion for these ignores the normal getTypeToPromoteTo and instead just tries to double the element width. This is because the default behavior of getTypeToPromote to just adds 1 to the SimpleVT, which has the affect of increasing the element count while keeping the scalar size the same.

If multiple steps are required to get to a legal operation type, int_to_fp will be promoted multiple times. And fp_to_int will keep trying wider types in a loop until it finds one that works.

getTypeToPromoteTo does have the ability to query a promotion map to get the type and not do the increasing behavior. It seems better to just let the target specify the promotion type in the map explicitly instead of letting the legalizer iterate via widening.

FWIW, it's worth I think for any other vector operations that need to be promoted, we have to specify the type explicitly because the default behavior of getTypeToPromote isn't useful for vectors. The other types of promotion already require either the element count is constant or the total vector width is constant, but neither happens by incrementing the SimpleVT enum.

Differential Revision: https://reviews.llvm.org/D40664

llvm-svn: 321629
2018-01-01 19:21:35 +00:00
Craig Topper
29a4ad935c [X86] In LowerTruncateVecI1, don't add SHL if the input is known to be all sign bits.
If the input is all sign bits then the LSB through MSB are all the same so we don't need to be move the LSB to the MSB.

llvm-svn: 321617
2018-01-01 04:52:58 +00:00
Craig Topper
241d0e2997 [X86] Add missing NoVLX predicate around some patterns that use zmm registers to implement 128/256-bit operations without VLX.
llvm-svn: 321613
2018-01-01 01:11:32 +00:00
Craig Topper
dfd7604020 [X86] Add patterns for using zmm registers for v8i32/v8f32 vselect with the false input being zero.
We can use zmm move with zero masking for this. We already had patterns for using a masked move, but we didn't check for the zero masking case separately.

llvm-svn: 321612
2018-01-01 01:11:29 +00:00
Craig Topper
df0b012274 [X86] Use CONCAT_VECTORS instead of INSERT_SUBVECTOR for padding v4i1/v2i1 vector to v8i1 pre-legalize.
The CONCAT_VECTORS will be lowered to INSERT_SUBVECTOR later. In the modified cases this seems to be enough to trick a later DAG combine into running in a different order than allows the ANDs to be removed.

I'll admit this is a bit of a hack that happens to work, but using CONCAT_VECTORS is more consistent with other legalization code anyway.

llvm-svn: 321611
2017-12-31 19:17:52 +00:00
Simon Pilgrim
3a5f36f0a8 [X86][AVX2] Combine extract(broadcast(scalar_value)) --> scalar_value
As it has a scalar source we don't treat it as a target shuffle so needs special handling.

llvm-svn: 321610
2017-12-31 18:59:30 +00:00
Simon Pilgrim
4ec7be09ea [X86][SSE] Don't vectorize splat buildvector of binops (PR30780)
Don't combine buildvector(binop(),binop(),binop(),binop()) -> binop(buildvector(), buildvector()) if its a splat - keep the binop scalar and just splat the result to avoid large vector constants.

llvm-svn: 321607
2017-12-31 17:07:47 +00:00
Craig Topper
98a2e44b5b [X86] Add a DAG combine to widen (i4 (bitcast (v4i1))) before type legalization sees the i4 and changes to load/store.
Same for v2i1 and i2.

llvm-svn: 321602
2017-12-31 09:50:38 +00:00
Craig Topper
e8a65f8f62 [X86] Add a DAG combine to fix (v4i1 (bitcast (i4))) before type legalization sees the i4 and changes to load/store.
Same for i2 and v2i1.

llvm-svn: 321601
2017-12-31 08:25:50 +00:00
Craig Topper
56d68cd419 [X86] Prevent combining (v8i1 (bitconvert (i8 load)))->(v8i1 load) if we don't have DQI.
We end up using an i8 load via an isel pattern from v8i1 anyway. This just makes it more explicit. This seems to improve codgen in some cases and I'd like to kill off some of the load patterns.

llvm-svn: 321598
2017-12-31 07:38:41 +00:00
Craig Topper
b75697d9f1 [X86] Remove patterns for load/store of vXi with bitcasts to/from integer.
This is better handled by a DAG combine if its not already being done. No lit tests fail from the removal of these patterns.

llvm-svn: 321597
2017-12-31 07:38:36 +00:00
Craig Topper
764a105490 [X86] Remove AND32ri8 from pattern for v1i1 load.
I don't think anything would actually expect the other bits to be zero.

llvm-svn: 321596
2017-12-31 07:38:33 +00:00
Craig Topper
8129ef9738 [X86] Fix a crash when returning a <1 x i1> value>
llvm-svn: 321595
2017-12-31 07:38:30 +00:00
Craig Topper
b994460b61 [X86] Cleanup store splitting in LowerTruncatingStore
Use getMemBasePlusOffset and calculate proper pointer info and alignment for the second store.

llvm-svn: 321594
2017-12-31 07:38:26 +00:00
Craig Topper
4797a1d955 [X86] Remove isel patterns for kshifts with types that don't support kshift natively.
We should only be creating natively supported kshifts now.

llvm-svn: 321577
2017-12-30 06:45:46 +00:00
Craig Topper
a7ef7519d1 [X86] Custom legalize vXi1 extract_subvector with KSHIFTR.
This allows us to remove some isel patterns.

This is mostly NFC, but we now use KSHIFTB instead of KSHIFTW with DQI.

llvm-svn: 321576
2017-12-30 06:45:43 +00:00
Simon Pilgrim
5e8c5c35c2 [X86][SSE] Match PSHUFLW/PSHUFHW + PSHUFD vXi16 shuffle patterns (PR34686)
As noted in PR34686, we are relying on a PSHUFD+PSHUFLW+PSHUFHW shuffle chain for most general vXi16 unary shuffles.

This patch checks for simpler PSHUFLW+PSHUFD and PSHUFHW+PSHUFD cases beforehand, building on some existing code that just handled splat shuffles.

By doing so we also prevent premature use of PSHUFB shuffles which can be slower and require the creation/loading of constant shuffle masks.

We now have the 'fast-variable-shuffle' option for hardware that prefers combining 2 or more shuffles to VPSHUFB etc.

Differential Revision: https://reviews.llvm.org/D38318

llvm-svn: 321553
2017-12-29 14:41:50 +00:00
Andrew V. Tischenko
086e74e04e Fix incorrect operand sizes for some MMX instructions: punpcklwd, punpcklbw and punpckldq.
Differential Revision: https://reviews.llvm.org/D41595

llvm-svn: 321549
2017-12-29 08:31:01 +00:00
Craig Topper
750df90652 [X86] When lowering extending loads from v2i1/v4i1, if we have VLX, use a narrower extend.
Previously we used an extend from v8i1 to v8i32/v8i64. Then extracted to the final width. But if we have VLX we should extract first. This way we don't end up with an overly large extend.

This allows us to use vcmpeq to make all ones for the sign extend when DQI isn't available. Otherwise we get a VPTERNLOG.

If we make v2i1/v4i1 legal like proposed in D41560, we could always do this and rely on the lowering of the extend to widen when necessary.

llvm-svn: 321538
2017-12-28 19:46:11 +00:00
Craig Topper
bc8b79278a [X86] Use ISD::CONCAT_VECTORS when splitting 256-bit loads in combineLoad.
llvm-svn: 321537
2017-12-28 19:46:06 +00:00
Craig Topper
312f0d549b [X86] Fix inconsistencies in different places where we split loads/stores.
-Use MinAlign instead of std::min.
-Use SelectionDAG::getMemBasePlusOffset.
-Apply offset to the pointer info for the second load/store created.

llvm-svn: 321536
2017-12-28 19:46:03 +00:00
Craig Topper
92532721d6 [X86] Emit ISD::TRUNCATE instead of X86ISD::VTRUNC from LowerZERO_EXTEND_Mask/LowerSIGN_EXTEND_Mask.
The truncate will be lowered X86ISD::VTRUNC later.

llvm-svn: 321534
2017-12-28 19:45:58 +00:00
Craig Topper
245e81c4ca [X86] Remove unnecessary patterns for sign extending vXi1 without VLX.
The custom lowering already widens the result type to 512-bits if VLX isn't supported.

llvm-svn: 321533
2017-12-28 19:45:55 +00:00
Reid Kleckner
879fd24edc [WinEH] Don't emit state stores or EH thunks for available_externally functions
The exception handler thunk needs to reference the LSDA of the parent
function, which won't be emitted if it's available_externally.

Fixes PR35736. ThinLTO ends up producing available_externally functions
that use _CxxFrameHandler3.

llvm-svn: 321532
2017-12-28 18:41:31 +00:00
Simon Pilgrim
1957cc1a74 [X86][SSE] Use PMADDWD for v4i32 multiplies with 17 or more leading zeros
If there are 17 or more leading zeros to the v4i32 elements, then we can use PMADD for the integer multiply when PMULLD is unavailable or slow.

The 17 bits need to be zero as the PMADDWD performs a v8i16 signed-mul-extend + pairwise-add - the upper 16 so we're adding a zero pair and the 17th bit so we don't incorrectly sign extend.

Differential Revision: https://reviews.llvm.org/D41484

llvm-svn: 321516
2017-12-28 10:05:49 +00:00
Craig Topper
0cde7689d4 [X86] Add CLWB to icelake.
Per Table 1-1 in October 2017 edition of Intel® Architecture Instruction Set Extensions and Future Features

llvm-svn: 321501
2017-12-27 22:04:04 +00:00
Craig Topper
b66530c999 [X86] Reimplement r321437 using custom lowering instead of as a DAG combine.
My original implementation ran as a DAG combine post type legalization, but it turns out we don't run that DAG combine step if type legalization didn't change anything. Attempts to make the combine run before type legalization as well hit other issues.

So just do it in LowerMUL where we can catch more cases.

llvm-svn: 321496
2017-12-27 19:09:40 +00:00
Benjamin Kramer
3f0563fa89 [X86] Fix vmul combine for AVX1 targets.
v8i32 is legal von AVX1, but it doesn't have pmuludq for it.

llvm-svn: 321490
2017-12-27 13:31:50 +00:00
Craig Topper
dfd8ac6a6a [X86] Return SDValue(N, 0) instead of an SDValue() after a successful combine.
Returning SDValue() means nothing changed, SDValue(N,0) means there was a change but the worklist management was taken care of.

I don't know if this has a real effect other than making sure the combine counter in the DAG combiner gets updated, but it is the correct thing to do.

llvm-svn: 321463
2017-12-26 22:22:58 +00:00
Andrew V. Tischenko
113f912b68 It's a fix for Bug 35741 - can't use comments after x86 prefixes.
Differential Revision: https://reviews.llvm.org/D41579

llvm-svn: 321459
2017-12-26 18:29:52 +00:00
Craig Topper
8032e9b258 [X86] Pass itins.rr/itins.rm through properly for some instructions.
llvm-svn: 321452
2017-12-26 05:43:05 +00:00
Craig Topper
a225ee1dea [X86] Use SSE_INTMUL_ITINS_P for the AVX-512 MUL instructions to match their SSE/AVX counterparts.
llvm-svn: 321451
2017-12-26 05:43:04 +00:00
Craig Topper
ccab687ac6 [X86] Fix typo in assert message.
llvm-svn: 321450
2017-12-26 05:43:02 +00:00
Craig Topper
242e8ec211 [X86] Add a DAG combines to turn vXi64 muls into VPMULDQ/VPMULUDQ if the upper bits are all sign bits or zeros.
Normally we catch this during lowering, but vXi64 mul is considered legal when we have AVX512DQ.

This DAG combine allows us to avoid PMULLQ with AVX512DQ if we can prove its unnecessary. PMULLQ is 3 uops that take 4 cycles each. While pmuldq/pmuludq is only one 4 cycle uop.

llvm-svn: 321437
2017-12-25 06:47:10 +00:00
Craig Topper
18c5d6d384 [X86] Make some helper methods static functions instead. NFC
llvm-svn: 321433
2017-12-25 00:54:53 +00:00
Craig Topper
08af9fe055 [X86] Use SelectionDAG::getFPExtendOrRound to simplify some code.
llvm-svn: 321432
2017-12-25 00:54:51 +00:00
Simon Pilgrim
7bfc03133a [X86][X87] Mark pseudo memory fold instructions as load/sideeffects (PR21160, PR34080, PR34454).
Match regular x87 memory fold instructions with load/sideeffects tags, to prevent the schedulers from re-ordering them across the fnstcw/fldcw sequences for truncating stores while they are still pseudo during the stack conversion pass.

llvm-svn: 321424
2017-12-24 12:20:21 +00:00
Craig Topper
7136c64e49 [X86] Fix (v2f64 (s/uint_to_fp (v2i1))) to avoid scalarization without AVX512DQ.
Previously we extended v2i1 to v2f64 and then tried to use cvtuqq2pd/cvtqq2pd, but that only works with avx512dq. So we ended up scalarizing it. Now we widen to v4i1 first and extend to v4i32.

llvm-svn: 321420
2017-12-24 06:51:36 +00:00
Craig Topper
7679c60646 [X86] Add assembler predicates to BITALG/VBMI2/VNNI features to be consistent with the other AVX512 ISAs.
llvm-svn: 321416
2017-12-24 02:05:17 +00:00
Craig Topper
3628945622 [X86] Teach WidenMaskArithmetic to handle any constant buildvector on the RHS not just all zeros/ones.
llvm-svn: 321415
2017-12-24 01:03:31 +00:00
Craig Topper
a03424730d [X86] Remove type restrictions from WidenMaskArithmetic.
This can help AVX-512 code where mask types are legal allowing us to remove extends and truncates to/from mask types.

llvm-svn: 321408
2017-12-23 18:53:05 +00:00