1
0
mirror of https://github.com/RPCS3/llvm-mirror.git synced 2024-10-22 20:43:44 +02:00
Commit Graph

3136 Commits

Author SHA1 Message Date
Yi Kong
4d91bb0380 [ARM] Add Kryo to available targets
Summary:
Host CPU detection now supports Kryo, so we need to recognize it in ARM
target.

Reviewers: mcrosier, t.p.northover, rengolin, echristo, srhines

Reviewed By: t.p.northover, echristo

Subscribers: aemerson

Differential Revision: https://reviews.llvm.org/D31775

llvm-svn: 299674
2017-04-06 18:10:08 +00:00
Sanjay Patel
bb37f0efa2 [DAGCombiner] add and use TLI hook to convert and-of-seteq / or-of-setne to bitwise logic+setcc (PR32401)
This is a generic combine enabled via target hook to reduce icmp logic as discussed in:
https://bugs.llvm.org/show_bug.cgi?id=32401

It's likely that other targets will want to enable this hook for scalar transforms, 
and there are probably other patterns that can use bitwise logic to reduce comparisons.

Note that we are missing an IR canonicalization for these patterns, and we will probably
prefer the pair-of-compares form in IR (shorter, more likely to fold).

Differential Revision: https://reviews.llvm.org/D31483

llvm-svn: 299542
2017-04-05 14:09:39 +00:00
Sanjay Patel
04904857ac add/move codegen tests for and/or of setcc; NFC
llvm-svn: 299396
2017-04-03 22:45:46 +00:00
Sanjay Patel
5936cb0926 [CodeGen] clean up and add tests for scalar and-of-setcc; NFC
https://bugs.llvm.org/show_bug.cgi?id=32401

llvm-svn: 299034
2017-03-29 21:58:52 +00:00
Javed Absar
3e2ba99436 Improve machine schedulers for in-order processors
This patch enables schedulers to specify instructions that 
cannot be issued with any other instructions.
It also fixes BeginGroup/EndGroup.

Reviewed by: Andrew Trick
Differential Revision: https://reviews.llvm.org/D30744

llvm-svn: 298885
2017-03-27 20:46:37 +00:00
Eli Friedman
a062aefd6c [ARM] Fix mixup between Lo and Hi in SMLALBB formation.
llvm-svn: 298752
2017-03-25 00:13:24 +00:00
Pirama Arumuga Nainar
fdf9c3959d [ARM] Fix computeKnownBits for ARMISD::CMOV
Summary:
The true and false operands for the CMOV are operands 0 and 1.
ARMISelLowering.cpp::computeKnownBits was looking at operands 1 and 2
instead.  This can cause CMOV instructions to be incorrectly folded into
BFI if value set by the CMOV is another CMOV, whose known bits are
computed incorrectly.

This patch fixes the issue and adds a test case.

Reviewers: kristof.beyls, jmolloy

Subscribers: llvm-commits, aemerson, srhines, rengolin

Differential Revision: https://reviews.llvm.org/D31265

llvm-svn: 298624
2017-03-23 16:47:47 +00:00
Volkan Keles
bc9b38e9f1 [GlobalISel] Fix shufflevector tests
clang-lld-x86_64-2stage fails because of the order
of the instructions. `CHECK-DAG` directives should
fix the problem.

llvm-svn: 298367
2017-03-21 13:12:59 +00:00
Volkan Keles
4d84f9bcb3 [GlobalISel] Translate shufflevector
Reviewers: qcolombet, aditya_nandakumar, t.p.northover, javed.absar, ab, dsanders

Reviewed By: javed.absar

Subscribers: dberris, rovka, llvm-commits, kristof.beyls

Differential Revision: https://reviews.llvm.org/D30962

llvm-svn: 298347
2017-03-21 08:44:13 +00:00
Vadzim Dambrouski
c2df2c846d [ARM] Fix PR32130: Handle promotion of zero sized constants.
The special case of zero sized values was previously not handled correctly.
This patch handles this by not promoting if the size is zero.

Patch by Tim Neumann.

Differential Revision: https://reviews.llvm.org/D31116

llvm-svn: 298320
2017-03-20 22:59:57 +00:00
Diana Picus
bb031998bf [GlobalISel] Use the correct calling conv for calls
This commit adds a parameter that lets us pass in the calling convention
of the call to CallLowering::lowerCall. This allows us to handle
situations where the calling convetion of the callee is different from
that of the caller.

Differential Revision: https://reviews.llvm.org/D31039

llvm-svn: 298254
2017-03-20 14:40:18 +00:00
Ahmed Bougacha
0d77054e90 [GlobalISel] Don't select trivially dead instructions.
Folding instructions when selecting can cause them to become dead.
Don't select these dead instructions (if they don't have other side
effects, and don't define physical registers).

Preserve existing tests by adding COPYs.

In some tests, the G_CONSTANT vregs never get constrained to a class:
the only use of the vreg was folded into another instruction, so the
G_CONSTANT, now dead, never gets selected.

llvm-svn: 298224
2017-03-19 16:13:00 +00:00
Eli Friedman
c0e6994e99 [SelectionDAG] Remove redundant stores more aggressively.
Handle TokenFactors more aggressively in
SDValue::reachesChainWithoutSideEffects.  This isn't really a
very effective change anymore because of other changes to
chain handling, but it's a cheap check, and the expanded
comments are still useful.

It might be possible to loosen the hasOneUse() requirement with a
deeper analysis, but a naive implementation of that check would be
expensive.

Differential Revision: https://reviews.llvm.org/D29845

llvm-svn: 298156
2017-03-17 22:15:50 +00:00
Eli Friedman
82514bcbc5 [ARM] Use alias analysis in ARMPreAllocLoadStoreOpt.
This allows the optimization to rearrange loads and stores more
aggressively. This doesn't really affect performance, but it helps
codesize.

Differential Revision: https://reviews.llvm.org/D30839

llvm-svn: 298021
2017-03-17 00:34:26 +00:00
Adrian Prantl
018bf36c06 PR32288: More efficient encoding for DWARF expr subregister access.
Citing http://bugs.llvm.org/show_bug.cgi?id=32288

  The DWARF generated by LLVM includes this location:

  0x55 0x93 0x04 DW_OP_reg5 DW_OP_piece(4) When GCC's DWARF is simply
  0x55 (DW_OP_reg5) without the DW_OP_piece. I believe it's reasonable
  to assume the DWARF consumer knows which part of a register
  logically holds the value (low bytes, high bytes, how many bytes,
  etc) for a primitive value like an integer.

This patch gets rid of the redundant DW_OP_piece when a subregister is
at offset 0. It also adds previously missing subregister masking when
a subregister is followed by another operation.

(This reapplies r297960 with two additional testcase updates).

rdar://problem/31069390
https://reviews.llvm.org/D31010

llvm-svn: 297965
2017-03-16 17:14:56 +00:00
Jonas Paulsson
cbcaf13b31 [SelectionDAG] Optimize VSELECT->SETCC of incompatible or illegal types.
Don't scalarize VSELECT->SETCC when operands/results needs to be widened,
or when the type of the SETCC operands are different from those of the VSELECT.

(VSELECT SETCC) and (VSELECT (AND/OR/XOR (SETCC,SETCC))) are handled.

The previous splitting of VSELECT->SETCC in DAGCombiner::visitVSELECT() is
no longer needed and has been removed.

Updated tests:

test/CodeGen/ARM/vuzp.ll
test/CodeGen/NVPTX/f16x2-instructions.ll
test/CodeGen/X86/2011-10-19-widen_vselect.ll
test/CodeGen/X86/2011-10-21-widen-cmp.ll
test/CodeGen/X86/psubus.ll
test/CodeGen/X86/vselect-pcmp.ll

Review: Eli Friedman, Simon Pilgrim
https://reviews.llvm.org/D29489

llvm-svn: 297930
2017-03-16 07:17:12 +00:00
Tim Northover
f6b21811fd ARM: avoid clobbering register in v6 jump-table expansion.
If we got unlucky with register allocation and actual constpool placement, we
could end up producing a tTBB_JT with an index that's already been clobbered.

Technically, we might be able to fix this situation up with a MOV, but I think
the constant islands pass is complex enough without having to deal with more
weird edge-cases.

llvm-svn: 297871
2017-03-15 18:38:13 +00:00
Sam Parker
1c17803a4b [ARM] Enable SMLAL[B|T] isel
Enable the selection of the 64-bit signed multiply accumulate
instructions which operate on 16-bit operands. These are enabled for
ARMv5TE onwards for ARM and for V6T2 and other DSP enabled Thumb
architectures.

Differential Revision: https://reviews.llvm.org/D30044

llvm-svn: 297809
2017-03-15 08:27:11 +00:00
Sam Parker
1cba526c4c [ARM] Move SMULW[B|T] isel to DAG Combine
Create nodes for smulwb and smulwt and move their selection from
DAGToDAG to DAG combine. smlawb and smlawt can then be selected
using tablegen. Added some helper functions to detect shift patterns
as well as a wrapper around SimplifyDemandBits. Added a couple of
extra tests.

Differential Revision: https://reviews.llvm.org/D30708

llvm-svn: 297716
2017-03-14 09:13:22 +00:00
Nirav Dave
889cd22a6a In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Recommiting with compiler time improvements

    Recommitting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.

    * Simplify Consecutive Merge Store Candidate Search

    Now that address aliasing is much less conservative, push through
    simplified store merging search and chain alias analysis which only
    checks for parallel stores through the chain subgraph. This is cleaner
    as the separation of non-interfering loads/stores from the
    store-merging logic.

    When merging stores search up the chain through a single load, and
    finds all possible stores by looking down from through a load and a
    TokenFactor to all stores visited.

    This improves the quality of the output SelectionDAG and the output
    Codegen (save perhaps for some ARM cases where we correctly constructs
    wider loads, but then promotes them to float operations which appear
    but requires more expensive constant generation).

    Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)

    Additional Minor Changes:

      1. Finishes removing unused AliasLoad code

      2. Unifies the chain aggregation in the merged stores across code
         paths

      3. Re-add the Store node to the worklist after calling
         SimplifyDemandedBits.

      4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
         arbitrary, but seems sufficient to not cause regressions in
         tests.

      5. Remove Chain dependencies of Memory operations on CopyfromReg
         nodes as these are captured by data dependence

      6. Forward loads-store values through tokenfactors containing
          {CopyToReg,CopyFromReg} Values.

      7. Peephole to convert buildvector of extract_vector_elt to
         extract_subvector if possible (see
         CodeGen/AArch64/store-merge.ll)

      8. Store merging for the ARM target is restricted to 32-bit as
         some in some contexts invalid 64-bit operations are being
         generated. This can be removed once appropriate checks are
         added.

    This finishes the change Matt Arsenault started in r246307 and
    jyknight's original patch.

    Many tests required some changes as memory operations are now
    reorderable, improving load-store forwarding. One test in
    particular is worth noting:

      CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
      forwarding converts a load-store pair into a parallel store and
      a memory-realized bitcast of the same value. However, because we
      lose the sharing of the explicit and implicit store values we
      must create another local store. A similar transformation
      happens before SelectionDAG as well.

    Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle

llvm-svn: 297695
2017-03-14 00:34:14 +00:00
Diana Picus
48656d6f1e [ARM] GlobalISel: Support SP in regbankselect
We used to hit an unreachable in getRegBankFromRegClass when dealing with the
stack pointer. This commit adds support for the GPRsp reg class.

llvm-svn: 297621
2017-03-13 14:28:34 +00:00
Eli Friedman
c2710dc724 [DAGCombine] Simplify ISD::AND in GetDemandedBits.
This helps in cases involving bitfields where an AND is exposed by
legalization.

Differential Revision: https://reviews.llvm.org/D30472

llvm-svn: 297249
2017-03-08 00:56:35 +00:00
Arnold Schwaighofer
2d32893f55 SjLjEHPrepare: Fix the pass for swifterror arguments
We cannot leave the identity copies 'select true, arg, undef' that this pass
inserts for arguments to simplify handling of values on swifterror arguments.

swifterror arguments have restrictions on their uses.

rdar://30839288

llvm-svn: 297197
2017-03-07 20:29:02 +00:00
Ranjeet Singh
1952e5932f [ARM] Reapply r296865 "[ARM] fpscr read/write intrinsics not aware of each other""
The original patch r296865 was reverted as it broke the chromium builds for
Android https://bugs.llvm.org/show_bug.cgi?id=32134, this patch reapplies
r296865 with a fix to make sure it doesn't cause the build regression.

The problem was that intrinsic selection on int_arm_get_fpscr was failing in
ISel this was because the code to manually select this intrinsic still thought
it was the version with no side-effects (INTRINSIC_WO_CHAIN) which is wrong as
it doesn't semantically match the definition in the tablegen code which says it
does have side-effects, I've fixed this by updating the intrinsic type to
INTRINSIC_W_CHAIN (has side-effects). I've also added a test for this based on
Hans original reproducer.

Differential Revision: https://reviews.llvm.org/D30645

llvm-svn: 297137
2017-03-07 11:17:53 +00:00
Artyom Skrobov
9e4b511e37 In Thumb1, materialize a move between low registers as a movs, if CPSR isn't live.
Summary: Previously, it had always been materialized as a push/pop sequence.

Reviewers: labrinea, jroelofs

Reviewed By: jroelofs

Subscribers: llvm-commits, rengolin

Differential Revision: https://reviews.llvm.org/D30648

llvm-svn: 297134
2017-03-07 09:38:16 +00:00
Tim Northover
948ccd8eb0 GlobalISel: restrict G_EXTRACT instruction to just one operand.
A bit more painful than G_INSERT because it was more widely used, but this
should simplify the handling of extract operations in most locations.

llvm-svn: 297100
2017-03-06 23:50:28 +00:00
Hans Wennborg
c1102a606f Revert r296865 "[ARM] fpscr read/write intrinsics not aware of each other"
It caused PR32134: "Cannot select: intrinsic %llvm.arm.get.fpscr".

llvm-svn: 296926
2017-03-03 23:19:31 +00:00
Ranjeet Singh
e9e99b10c8 [ARM] fpscr read/write intrinsics not aware of each other
The intrinsics __builtin_arm_get_fpscr and __builtin_arm_set_fpscr read and
write to the fpscr (Floating-Point Status and Control Register) register.

A bug exists in the __builtin_arm_get_fpscr intrinsic definition in llvm which
treats this intrinsic as a IntroNoMem which means it's not a memory access and
doesn't have any other side-effects. Having this property on this intrinsic
means that various optimizations can be done on this such as common
sub-expression elimination with other reads. This can cause issues if there has
been write to this register, e.g.

void foo(int *p) {
     p[0] = __builtin_arm_get_fpscr();
     __builtin_arm_set_fpscr(1);
     p[1] = __builtin_arm_get_fpscr();
}

in the above example the second read is currently CSE'd into the first read,
this is because llvm isn't aware that the write done by __builtin_arm_set_fpscr
effects the same register that __builtin_arm_get_fpscr reads from, to fix this
problem I've removed the property IntrNoMem so that __builtin_arm_get_fpscr is
treated as a memory access.

Differential Revision: https://reviews.llvm.org/D30542

llvm-svn: 296865
2017-03-03 11:40:07 +00:00
Chandler Carruth
7b066daf55 [SDAG] Revert r296476 (and r296486, r296668, r296690).
This patch causes compile times for some patterns to explode. I have
a (large, unreduced) test case that slows down by more than 20x and
several test cases slow down by 2x. I'm sending some of the test cases
directly to Nirav and following up with more details in the review log,
but this should unblock anyone else hitting this.

llvm-svn: 296862
2017-03-03 10:02:25 +00:00
Eli Friedman
5ccdb47717 [ARM] Fix insert point for store rescheduling.
In ARMPreAllocLoadStoreOpt::RescheduleOps, LastOp should be the last
operation which we want to merge. If we break out of the loop because
an operation has the wrong offset, we shouldn't use that operation
as LastOp.

This patch fixes some cases where we would move stores to the wrong
insert point.

Re-commit with a fix to increment NumMove in the right place.

Differential Revision: https://reviews.llvm.org/D30124

llvm-svn: 296815
2017-03-02 21:39:39 +00:00
Sanjay Patel
ae7d68a14f [DAGCombiner] avoid assertion when folding binops with opaque constants
This bug was introduced with:
https://reviews.llvm.org/rL296699

There may be a way to loosen the restriction, but for now just bail out
on any opaque constant.

The tests show that opacity is target-specific. This goes back to cost
calculations in ConstantHoisting based on TTI->getIntImmCost().

llvm-svn: 296768
2017-03-02 17:18:56 +00:00
Eli Friedman
5d8cb8245c Revert r296708; causing test failures on ARM hosts.
Original commit message:

[ARM] Fix insert point for store rescheduling.
    
In ARMPreAllocLoadStoreOpt::RescheduleOps, LastOp should be the last
operation which we want to merge. If we break out of the loop because
an operation has the wrong offset, we shouldn't use that operation as
LastOp.
    
This patch fixes some cases where we would sink stores for no reason.

llvm-svn: 296718
2017-03-02 00:08:50 +00:00
Eli Friedman
d150ed85fb [ARM] Fix insert point for store rescheduling.
In ARMPreAllocLoadStoreOpt::RescheduleOps, LastOp should be the last
operation which we want to merge. If we break out of the loop because
an operation has the wrong offset, we shouldn't use that operation as
LastOp.
    
This patch fixes some cases where we would sink stores for no reason.
    
Differential Revision: https://reviews.llvm.org/D30124

llvm-svn: 296708
2017-03-01 23:20:29 +00:00
Eli Friedman
c3db369888 [ARM] Check correct instructions for load/store rescheduling.
This code starts from the high end of the sorted vector of offsets, and
works backwards: it tries to find contiguous offsets, process them, then
pops them from the end of the vector. Most of the code agrees with this
order of processing, but one loop doesn't: it instead processes elements
from the low end of the vector (which are nodes with unrelated offsets).
Fix that loop to process the correct elements.
    
This has a few implications. One, we don't incorrectly return early when
processing multiple groups of offsets in the same block (which allows
rescheduling prera-ldst-insertpt.mir). Two, we pick the correct insert
point for loads, so they're correctly sorted (which affects the
scheduling of vldm-liveness.ll). I think it might also impact some of
the heuristics slightly.
    
Differential Revision: https://reviews.llvm.org/D30368

llvm-svn: 296701
2017-03-01 22:56:20 +00:00
Sanjay Patel
b91c17c096 [DAGCombiner] fold binops with constant into select-of-constants
This is part of the ongoing attempt to improve select codegen for all targets and select 
canonicalization in IR (see D24480 for more background). The transform is a subset of what
is done in InstCombine's FoldOpIntoSelect().

I first noticed a regression in the x86 avx512-insert-extract.ll tests with a patch that 
hopes to convert more selects to basic math ops. This appears to be a general missing DAG
transform though, so I added tests for all standard binops in rL296621 
(PowerPC was chosen semi-randomly; it has scripted FileCheck support, but so do ARM and x86).

The poor output for "sel_constants_shl_constant" is tracked with:
https://bugs.llvm.org/show_bug.cgi?id=32105

Differential Revision: https://reviews.llvm.org/D30502

llvm-svn: 296699
2017-03-01 22:51:31 +00:00
Reid Kleckner
599ca91378 Elide argument copies during instruction selection
Summary:
Avoids tons of prologue boilerplate when arguments are passed in memory
and left in memory. This can happen in a debug build or in a release
build when an argument alloca is escaped.  This will dramatically affect
the code size of x86 debug builds, because X86 fast isel doesn't handle
arguments passed in memory at all. It only handles the x86_64 case of up
to 6 basic register parameters.

This is implemented by analyzing the entry block before ISel to identify
copy elision candidates. A copy elision candidate is an argument that is
used to fully initialize an alloca before any other possibly escaping
uses of that alloca. If an argument is a copy elision candidate, we set
a flag on the InputArg. If the the target generates loads from a fixed
stack object that matches the size and alignment requirements of the
alloca, the SelectionDAG builder will delete the stack object created
for the alloca and replace it with the fixed stack object. The load is
left behind to satisfy any remaining uses of the argument value. The
store is now dead and is therefore elided. The fixed stack object is
also marked as mutable, as it may now be modified by the user, and it
would be invalid to rematerialize the initial load from it.

Supersedes D28388

Fixes PR26328

Reviewers: chandlerc, MatzeB, qcolombet, inglorion, hans

Subscribers: igorb, llvm-commits

Differential Revision: https://reviews.llvm.org/D29668

llvm-svn: 296683
2017-03-01 21:42:00 +00:00
Artur Pilipenko
3d8bddc9b0 [DAGCombiner] Support {a|s}ext, {a|z|s}ext load nodes in load combine
Resubmit r295336 after the bug with non-zero offset patterns on BE targets is fixed (r296336).

Support {a|s}ext, {a|z|s}ext load nodes as a part of load combine patters.

Reviewed By: filcab

Differential Revision: https://reviews.llvm.org/D29591

llvm-svn: 296651
2017-03-01 18:12:29 +00:00
Diana Picus
e1c659ac82 [ARM] GlobalISel: Lower call params that need extensions
Lower i1, i8 and i16 call parameters by extending them before storing them on
the stack. Also make sure we encode the correct, extended size in the
corresponding memory operand, and that we compute the correct stack size in the
end.

The latter is a bit more complicated because we used to compute the stack size
in the getStackAddress method, based on the Size and Offset of the parameters.
However, if the last parameter is sign extended, we'd be using the wrong,
non-extended size, and we'd end up with a smaller stack than we need to hold the
extended value. Instead of hacking this up based on the value of Size in
getStackAddress, we move our stack size handling logic to assignArg, where we
have access to the CCState which knows everything we could possibly want to know
about the stack. This way we don't need to duplicate any knowledge or resort to
any ugly hacks.

On this same occasion, update the IRTranslator test to check the sizes of the
stores everywhere, not just for sign extended paramteres.

llvm-svn: 296631
2017-03-01 15:35:14 +00:00
Nirav Dave
e24ecaa975 In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.

    * Simplify Consecutive Merge Store Candidate Search

    Now that address aliasing is much less conservative, push through
    simplified store merging search and chain alias analysis which only
    checks for parallel stores through the chain subgraph. This is cleaner
    as the separation of non-interfering loads/stores from the
    store-merging logic.

    When merging stores search up the chain through a single load, and
    finds all possible stores by looking down from through a load and a
    TokenFactor to all stores visited.

    This improves the quality of the output SelectionDAG and the output
    Codegen (save perhaps for some ARM cases where we correctly constructs
    wider loads, but then promotes them to float operations which appear
    but requires more expensive constant generation).

    Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)

    Additional Minor Changes:

      1. Finishes removing unused AliasLoad code

      2. Unifies the chain aggregation in the merged stores across code
         paths

      3. Re-add the Store node to the worklist after calling
         SimplifyDemandedBits.

      4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
         arbitrary, but seems sufficient to not cause regressions in
         tests.

      5. Remove Chain dependencies of Memory operations on CopyfromReg
         nodes as these are captured by data dependence

      6. Forward loads-store values through tokenfactors containing
          {CopyToReg,CopyFromReg} Values.

      7. Peephole to convert buildvector of extract_vector_elt to
         extract_subvector if possible (see
         CodeGen/AArch64/store-merge.ll)

      8. Store merging for the ARM target is restricted to 32-bit as
         some in some contexts invalid 64-bit operations are being
         generated. This can be removed once appropriate checks are
         added.

    This finishes the change Matt Arsenault started in r246307 and
    jyknight's original patch.

    Many tests required some changes as memory operations are now
    reorderable, improving load-store forwarding. One test in
    particular is worth noting:

      CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
      forwarding converts a load-store pair into a parallel store and
      a memory-realized bitcast of the same value. However, because we
      lose the sharing of the explicit and implicit store values we
      must create another local store. A similar transformation
      happens before SelectionDAG as well.

    Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle

llvm-svn: 296476
2017-02-28 14:24:15 +00:00
Diana Picus
c958dc921b [ARM] GlobalISel: Lower i32 and fp call parameters on the stack
Lower i32, float and double parameters that need to live on the stack. This
boils down to creating some G_GEPs starting from the stack pointer and storing
the values there. During the process we also keep track of the stack size and
use the final value in the ADJCALLSTACKDOWN/UP instructions.

We currently assert for smaller types, since they usually require extensions.
They will be handled in a separate patch.

llvm-svn: 296473
2017-02-28 14:17:53 +00:00
Diana Picus
9f47d8e6fa [ARM] GlobalISel: Select 32-bit G_CONSTANT
Put it into a register by means of a MOVi.

llvm-svn: 296471
2017-02-28 13:05:42 +00:00
Diana Picus
74ef680dc6 [ARM] GlobalISel: Add mapping for G_CONSTANT
Like G_FRAME_INDEX, G_CONSTANT has one register operand and one non-register
operand.

llvm-svn: 296469
2017-02-28 12:13:58 +00:00
Diana Picus
2640394619 [ARM] GlobalISel: Legalize 32-bit constants
llvm-svn: 296468
2017-02-28 11:33:46 +00:00
Diana Picus
7cb8f6733d [ARM] GlobalISel: Select G_GEP
At this point, G_GEP is just an add, so we treat it exactly like a G_ADD.

llvm-svn: 296462
2017-02-28 10:14:38 +00:00
Diana Picus
80a5ee84d6 [ARM] GlobalISel: Add reg bank mapping for G_GEP
This should be the same as the mapping for G_ADD etc.

llvm-svn: 296455
2017-02-28 09:35:10 +00:00
Diana Picus
95763abcda [ARM] GlobalISel: Legalize G_GEP with 32-bit offsets
At the moment we're only interested in GEPs for putting call parameters on the
stack, so we'll stick to 32-bit offsets.

llvm-svn: 296452
2017-02-28 09:02:42 +00:00
Michael Kuperstein
bbb5beaf34 [CGP] Split some critical edges coming out of indirect branches
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.

This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.

Differential Revision: https://reviews.llvm.org/D29916

llvm-svn: 296416
2017-02-28 00:11:34 +00:00
Sanjay Patel
5b3ea3a127 [ARM] don't transform an add(ext Cond), C to select unless there's a setcc of the condition
The transform in question claims to be doing:

// fold (add (select cc, 0, c), x) -> (select cc, x, (add, x, c))

...starting in PerformADDCombineWithOperands(), but it wasn't actually checking for a setcc node
for the sext/zext patterns.

This is exactly the opposite of a transform I'd like to add to DAGCombiner's foldSelectOfConstants(),
so I was seeing infinite loops with my draft of a patch applied.

The changes in select_const.ll look positive (less instructions). The change in arm-and-tst-peephole.ll
is unrelated. We're changing the input IR in that test to preserve the intent of the test, but that's 
not affected by this code change.

Differential Revision:
https://reviews.llvm.org/D30355

llvm-svn: 296389
2017-02-27 21:30:54 +00:00
Artur Pilipenko
79793a245a [DAGCombine] Fix for a load combine bug with non-zero offset patterns on BE targets
This pattern is essentially a i16 load from p+1 address:

  %p1.i16 = bitcast i8* %p to i16*
  %p2.i8 = getelementptr i8, i8* %p, i64 2
  %v1 = load i16, i16* %p1.i16
  %v2.i8 = load i8, i8* %p2.i8
  %v2 = zext i8 %v2.i8 to i16
  %v1.shl = shl i16 %v1, 8
  %res = or i16 %v1.shl, %v2

Current implementation would identify %v1 load as the first byte load and would mistakenly emit a i16 load from %p1.i16 address. This patch adds a check that the first byte is loaded from a non-zero offset of the first load address. This way this address can be used as the base address for the combined value. Otherwise just give up combining.

llvm-svn: 296336
2017-02-27 13:04:23 +00:00
Amaury Sechet
0f0c173f03 Do full codegen for various tests. NFC
llvm-svn: 296305
2017-02-27 01:15:57 +00:00
Daniel Jasper
25b14cabc6 Revert "[CGP] Split some critical edges coming out of indirect branches"
This reverts commit r296149 as it leads to crashes when compiling for
PPC.

llvm-svn: 296295
2017-02-26 11:09:12 +00:00
Nirav Dave
e1556d9e43 Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled."
This reverts commit r296252 until 256-bit operations are more efficiently generated in X86.

llvm-svn: 296279
2017-02-26 01:27:32 +00:00
Artyom Skrobov
e038fcb62b The automatic CHECK: to CHECK-LABEL: conversion, back in 2013,
had missed most labels in this test because they didn't end
with a colon.

llvm-svn: 296254
2017-02-25 15:17:16 +00:00
Nirav Dave
0d7bce1241 In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.

    * Simplify Consecutive Merge Store Candidate Search

    Now that address aliasing is much less conservative, push through
    simplified store merging search and chain alias analysis which only
    checks for parallel stores through the chain subgraph. This is cleaner
    as the separation of non-interfering loads/stores from the
    store-merging logic.

    When merging stores search up the chain through a single load, and
    finds all possible stores by looking down from through a load and a
    TokenFactor to all stores visited.

    This improves the quality of the output SelectionDAG and the output
    Codegen (save perhaps for some ARM cases where we correctly constructs
    wider loads, but then promotes them to float operations which appear
    but requires more expensive constant generation).

    Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)

    Additional Minor Changes:

      1. Finishes removing unused AliasLoad code

      2. Unifies the chain aggregation in the merged stores across code
         paths

      3. Re-add the Store node to the worklist after calling
         SimplifyDemandedBits.

      4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
         arbitrary, but seems sufficient to not cause regressions in
         tests.

      5. Remove Chain dependencies of Memory operations on CopyfromReg
         nodes as these are captured by data dependence

      6. Forward loads-store values through tokenfactors containing
          {CopyToReg,CopyFromReg} Values.

      7. Peephole to convert buildvector of extract_vector_elt to
         extract_subvector if possible (see
         CodeGen/AArch64/store-merge.ll)

      8. Store merging for the ARM target is restricted to 32-bit as
         some in some contexts invalid 64-bit operations are being
         generated. This can be removed once appropriate checks are
         added.

    This finishes the change Matt Arsenault started in r246307 and
    jyknight's original patch.

    Many tests required some changes as memory operations are now
    reorderable, improving load-store forwarding. One test in
    particular is worth noting:

      CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
      forwarding converts a load-store pair into a parallel store and
      a memory-realized bitcast of the same value. However, because we
      lose the sharing of the explicit and implicit store values we
      must create another local store. A similar transformation
      happens before SelectionDAG as well.

    Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle

llvm-svn: 296252
2017-02-25 11:43:58 +00:00
Sanjay Patel
de9324159f [ARM] add tests for alternate forms of select-of-constants; NFC
llvm-svn: 296178
2017-02-24 21:36:34 +00:00
Sanjay Patel
2027771bad [ARM] auto-generate complete checks; NFC
The affected test may change with a patch I'm looking at for DAGCombiner,
so I want to make sure it's not a regression.

llvm-svn: 296175
2017-02-24 21:19:09 +00:00
Michael Kuperstein
c87d81a5a0 [CGP] Split some critical edges coming out of indirect branches
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.

This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.

Differential Revision: https://reviews.llvm.org/D29916

llvm-svn: 296149
2017-02-24 18:41:32 +00:00
Sanjay Patel
a3c6b40b1a [DAGCombiner] add missing folds for scalar select of {-1,0,1}
The motivation for filling out these select-of-constants cases goes back to D24480, 
where we discussed removing an IR fold from add(zext) --> select. And that goes back to:
https://reviews.llvm.org/rL75531
https://reviews.llvm.org/rL159230

The idea is that we should always canonicalize patterns like this to a select-of-constants 
in IR because that's the smallest IR and the best for value tracking. Note that we currently 
do the opposite in some cases (like the cases in *this* patch). Ie, the proposed folds in 
this patch already exist in InstCombine today:
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSelect.cpp#L1151

As this patch shows, most targets generate better machine code for simple ext/add/not ops 
rather than a select of constants. So the follow-up steps to make this less of a patchwork 
of special-case folds and missing IR canonicalization:

1. Have DAGCombiner convert any select of constants into ext/add/not ops.
2  Have InstCombine canonicalize in the other direction (create more selects).

Differential Revision: https://reviews.llvm.org/D30180

llvm-svn: 296137
2017-02-24 17:17:33 +00:00
Diana Picus
8c539c18a9 [ARM] GlobalISel: Select G_STORE
Same as selecting G_LOAD.

llvm-svn: 296122
2017-02-24 14:01:27 +00:00
Diana Picus
1758d3cace Minor test fix
The test was using a size of 8 for loading/storing pointers. It should be 4.

llvm-svn: 296120
2017-02-24 13:27:55 +00:00
Diana Picus
003ac86cc4 [ARM] GlobalISel: Add reg bank mappings for stores
Same as the ones for loads.

llvm-svn: 296115
2017-02-24 13:07:25 +00:00
Diana Picus
1182c715bd [ARM] GlobalISel: Legalize stores
Allow the same types that we allow for loads.

llvm-svn: 296108
2017-02-24 11:28:24 +00:00
Diana Picus
fa51ff1661 Revert "[ARM] GlobalISel: Legalize stores"
This reverts commit r296103 because the test broke on one of the bots. Sorry!

llvm-svn: 296104
2017-02-24 10:35:39 +00:00
Diana Picus
a2fddd6d10 [ARM] GlobalISel: Legalize stores
Allow the same types that we allow for loads.

llvm-svn: 296103
2017-02-24 10:19:23 +00:00
Eli Friedman
ae591c71b4 Add some testcases for bitfields with illegal widths.
clang will generate IR like this for input using packed bitfields;
very simple semantically, but it's a bit tricky to actually
generate good code.

llvm-svn: 296080
2017-02-24 03:04:11 +00:00
Michael Kuperstein
256f5bb263 Revert r269060 to pacify bots.
llvm-svn: 296064
2017-02-24 01:22:19 +00:00
Michael Kuperstein
dbcc6fef4c [CGP] Split some critical edges coming out of indirect branches
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.

This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.

Differential Revision: https://reviews.llvm.org/D29916

llvm-svn: 296060
2017-02-24 00:56:21 +00:00
Tim Northover
26bd3a9606 ARM: make sure FastISel bails on f64 operations for Cortex-M4.
FastISel wasn't checking the isFPOnlySP subtarget feature before emitting
double-precision operations, so it got completely invalid CodeGen for doubles
on Cortex-M4F.

The normal ISel testing wasn't spectacular either so I added a second RUN line
to improve that while I was in the area.

llvm-svn: 296031
2017-02-23 22:35:00 +00:00
Diana Picus
cc63e86ff2 [ARM] GlobalISel: Lower call returns
Introduce a common ValueHandler for call returns and formal arguments, and
inherit two different versions for handling the differences (at the moment the
only difference is the way physical registers are marked as used).

llvm-svn: 295973
2017-02-23 14:18:41 +00:00
Diana Picus
587106e326 [ARM] GlobalISel: Lower call parameters in regs
Add support for lowering calls with parameters than can fit into regs.  Use the
same ValueHandler that we used for function returns, but rename it to match its
new, extended purpose.

llvm-svn: 295971
2017-02-23 13:25:43 +00:00
Kristof Beyls
9ca2feaac5 Fix assertion failure in ARMConstantIslandPass.
The ARMConstantIslandPass didn't have support for handling accesses to
constant island objects through ARM::t2LDRBpci instructions. This adds
support for that.

This fixes PR31997.

llvm-svn: 295964
2017-02-23 12:24:55 +00:00
Bill Seurer
2628d17171 [DAGCombiner] revert r295336
r295336 causes a bootstrapped clang to fail for many compilations on
powerpc BE.  See 
http://lab.llvm.org:8011/builders/clang-ppc64be-linux-multistage/builds/2315
for example.

Reverting as per the developer's request.

llvm-svn: 295849
2017-02-22 16:27:33 +00:00
Javed Absar
a1d1885fcc [ARM] Classification Improvements to ARM Sched-Models. NFCI.
This patch adds missing sched classes for Thumb2 instructions.
This has been missing so far, and as a consequence, machine
scheduler models for individual sub-targets have tended to
be larger than they needed to be. These patches should help
write schedulers better and faster in the future
for ARM sub-targets.

Reviewer: Diana Picus
Differential Revision: https://reviews.llvm.org/D29953

llvm-svn: 295811
2017-02-22 07:22:57 +00:00
Evgeniy Stepanov
c010343bc2 Fix PR31896.
Address of an alias of a global with offset is incorrectly lowered as an address of the global (i.e. ignoring offset).

llvm-svn: 295762
2017-02-21 20:17:34 +00:00
Diana Picus
eff0bb64c9 [ARM] GlobalISel: Lower calls to void() functions
For now, we hardcode a BLX instruction, and generate an ADJCALLSTACKDOWN/UP pair
with amount 0.

llvm-svn: 295716
2017-02-21 11:33:59 +00:00
Sanne Wouda
0552a7b4d5 [ARM] Add a div regression test for Cortex-M23
Summary:
This file was missed in the commit for Cortex-M23 and Cortex-M33
support.  See https://reviews.llvm.org/D29073?id=85814 .

Reviewers: rengolin, javed.absar, samparker

Reviewed By: samparker

Subscribers: llvm-commits, aemerson

Differential Revision: https://reviews.llvm.org/D30162

llvm-svn: 295655
2017-02-20 12:05:07 +00:00
Tim Northover
c7b0714476 GlobalISel: verify that generic loads & stores have a mem operand.
The mem operand is used by GlobalISel to convey atomic constraints so dropping
it is invalid.

llvm-svn: 295476
2017-02-17 18:50:15 +00:00
Sanjay Patel
dad1c87604 [ARM] add tests for select-of-constants; NFC
llvm-svn: 295459
2017-02-17 16:34:13 +00:00
Diana Picus
e302afcbd6 [ARM] GlobalISel: Use Subtarget in Legalizer
Start using the Subtarget to make decisions about what's legal. In particular,
we only mark floating point operations as legal if we have VFP2, which is
something we should've done from the very start.

llvm-svn: 295439
2017-02-17 11:25:17 +00:00
Diana Picus
58c63023ca [ARM] GlobalISel: Add end-to-end tests for double
Test some really basic functionality through the whole GlobalISel pipeline.

llvm-svn: 295438
2017-02-17 11:25:11 +00:00
Artur Pilipenko
17b23dc26c [DAGCombiner] Support {a|s}ext, {a|z|s}ext load nodes in load combine
Resubmit -r295314 with PowerPC and AMDGPU tests updated.

Support {a|s}ext, {a|z|s}ext load nodes as a part of load combine patters.

Reviewed By: filcab

Differential Revision: https://reviews.llvm.org/D29591

llvm-svn: 295336
2017-02-16 17:07:27 +00:00
Diana Picus
10445ac029 [ARM] GlobalISel: Select floating point loads
llvm-svn: 295321
2017-02-16 14:10:50 +00:00
Artur Pilipenko
310ec69c66 Rever -r295314 "[DAGCombiner] Support {a|s}ext, {a|z|s}ext load nodes in load combine"
This change causes some of AMDGPU and PowerPC tests to fail.

llvm-svn: 295316
2017-02-16 13:04:46 +00:00
Artur Pilipenko
409c061e49 [DAGCombiner] Support {a|s}ext, {a|z|s}ext load nodes in load combine
Support {a|s}ext, {a|z|s}ext load nodes as a part of load combine patters.

Reviewed By: filcab

Differential Revision: https://reviews.llvm.org/D29591

llvm-svn: 295314
2017-02-16 12:53:26 +00:00
Diana Picus
e647a7a202 [ARM] GlobalISel: Select G_SEQUENCE and G_EXTRACT
Since they're only used for passing around double precision floating point
values into the general purpose registers, we'll lower them to VMOVDRR and
VMOVRRD.

llvm-svn: 295310
2017-02-16 12:19:57 +00:00
Diana Picus
ad8a798a52 [ARM] GlobalISel: Select double G_FADD and copies
Just use VADDD if available, bail out if not.

llvm-svn: 295309
2017-02-16 12:19:52 +00:00
Diana Picus
e6063ce05b [ARM] GlobalISel: Assert that we don't use the FPR bank if we don't have VFP
llvm-svn: 295308
2017-02-16 11:25:09 +00:00
Diana Picus
bf4b3c8bca [ARM] GlobalISel: Add reg bank mappings for G_SEQUENCE and G_EXTRACT
Support G_SEQUENCE and G_EXTRACT as needed for passing double precision floating
point values in the soft-fp float mode.

llvm-svn: 295306
2017-02-16 11:00:31 +00:00
Diana Picus
2eba84f7a1 [ARM] GlobalISel: Make the FPR bank 64-bit wide
Also add mappings for single and double precision FP, and use them for G_FADD
and G_LOAD.

llvm-svn: 295302
2017-02-16 10:12:49 +00:00
Diana Picus
dc8724964f [ARM] GlobalISel: Legalize 64-bit G_FADD and G_LOAD
For now we just mark them as legal all the time and let the other passes bail
out if they can't handle it. In the future, we'll want to move more of the
brains into the legalizer.

llvm-svn: 295300
2017-02-16 09:09:49 +00:00
Diana Picus
070c98b2bf [ARM] GlobalISel: Lower double precision FP args
For the hard float calling convention, we just use the D registers.

For the soft-fp calling convention, we use the R registers and move values
to/from the D registers by means of G_SEQUENCE/G_EXTRACT. While doing so, we
make sure to honor the endianness of the target, since the CCAssignFn doesn't do
that for us.

For pure soft float targets, we still bail out because we don't support the
libcalls yet.

llvm-svn: 295295
2017-02-16 07:53:07 +00:00
Kyle Butt
96c1e7e4f0 Codegen: Make chains from trellis-shaped CFGs
Lay out trellis-shaped CFGs optimally.
A trellis of the shape below:

  A     B
  |\   /|
  | \ / |
  |  X  |
  | / \ |
  |/   \|
  C     D

would be laid out A; B->C ; D by the current layout algorithm. Now we identify
trellises and lay them out either A->C; B->D or A->D; B->C. This scales with an
increasing number of predecessors. A trellis is a a group of 2 or more
predecessor blocks that all have the same successors.

because of this we can tail duplicate to extend existing trellises.

As an example consider the following CFG:

    B   D   F   H
   / \ / \ / \ / \
  A---C---E---G---Ret

Where A,C,E,G are all small (Currently 2 instructions).

The CFG preserving layout is then A,B,C,D,E,F,G,H,Ret.

The current code will copy C into B, E into D and G into F and yield the layout
A,C,B(C),E,D(E),F(G),G,H,ret

define void @straight_test(i32 %tag) {
entry:
  br label %test1
test1: ; A
  %tagbit1 = and i32 %tag, 1
  %tagbit1eq0 = icmp eq i32 %tagbit1, 0
  br i1 %tagbit1eq0, label %test2, label %optional1
optional1: ; B
  call void @a()
  br label %test2
test2: ; C
  %tagbit2 = and i32 %tag, 2
  %tagbit2eq0 = icmp eq i32 %tagbit2, 0
  br i1 %tagbit2eq0, label %test3, label %optional2
optional2: ; D
  call void @b()
  br label %test3
test3: ; E
  %tagbit3 = and i32 %tag, 4
  %tagbit3eq0 = icmp eq i32 %tagbit3, 0
  br i1 %tagbit3eq0, label %test4, label %optional3
optional3: ; F
  call void @c()
  br label %test4
test4: ; G
  %tagbit4 = and i32 %tag, 8
  %tagbit4eq0 = icmp eq i32 %tagbit4, 0
  br i1 %tagbit4eq0, label %exit, label %optional4
optional4: ; H
  call void @d()
  br label %exit
exit:
  ret void
}

here is the layout after D27742:
straight_test:                          # @straight_test
; ... Prologue elided
; BB#0:                                 # %entry ; A (merged with test1)
; ... More prologue elided
	mr 30, 3
	andi. 3, 30, 1
	bc 12, 1, .LBB0_2
; BB#1:                                 # %test2 ; C
	rlwinm. 3, 30, 0, 30, 30
	beq	 0, .LBB0_3
	b .LBB0_4
.LBB0_2:                                # %optional1 ; B (copy of C)
	bl a
	nop
	rlwinm. 3, 30, 0, 30, 30
	bne	 0, .LBB0_4
.LBB0_3:                                # %test3 ; E
	rlwinm. 3, 30, 0, 29, 29
	beq	 0, .LBB0_5
	b .LBB0_6
.LBB0_4:                                # %optional2 ; D (copy of E)
	bl b
	nop
	rlwinm. 3, 30, 0, 29, 29
	bne	 0, .LBB0_6
.LBB0_5:                                # %test4 ; G
	rlwinm. 3, 30, 0, 28, 28
	beq	 0, .LBB0_8
	b .LBB0_7
.LBB0_6:                                # %optional3 ; F (copy of G)
	bl c
	nop
	rlwinm. 3, 30, 0, 28, 28
	beq	 0, .LBB0_8
.LBB0_7:                                # %optional4 ; H
	bl d
	nop
.LBB0_8:                                # %exit ; Ret
	ld 30, 96(1)                    # 8-byte Folded Reload
	addi 1, 1, 112
	ld 0, 16(1)
	mtlr 0
	blr

The tail-duplication has produced some benefit, but it has also produced a
trellis which is not laid out optimally. With this patch, we improve the layouts
of such trellises, and decrease the cost calculation for tail-duplication
accordingly.

This patch produces the layout A,C,E,G,B,D,F,H,Ret. This layout does have
back edges, which is a negative, but it has a bigger compensating
positive, which is that it handles the case where there are long strings
of skipped blocks much better than the original layout. Both layouts
handle runs of executed blocks equally well. Branch prediction also
improves if there is any correlation between subsequent optional blocks.

Here is the resulting concrete layout:

straight_test:                          # @straight_test
; BB#0:                                 # %entry ; A (merged with test1)
	mr 30, 3
	andi. 3, 30, 1
	bc 12, 1, .LBB0_4
; BB#1:                                 # %test2 ; C
	rlwinm. 3, 30, 0, 30, 30
	bne	 0, .LBB0_5
.LBB0_2:                                # %test3 ; E
	rlwinm. 3, 30, 0, 29, 29
	bne	 0, .LBB0_6
.LBB0_3:                                # %test4 ; G
	rlwinm. 3, 30, 0, 28, 28
	bne	 0, .LBB0_7
	b .LBB0_8
.LBB0_4:                                # %optional1 ; B (Copy of C)
	bl a
	nop
	rlwinm. 3, 30, 0, 30, 30
	beq	 0, .LBB0_2
.LBB0_5:                                # %optional2 ; D (Copy of E)
	bl b
	nop
	rlwinm. 3, 30, 0, 29, 29
	beq	 0, .LBB0_3
.LBB0_6:                                # %optional3 ; F (Copy of G)
	bl c
	nop
	rlwinm. 3, 30, 0, 28, 28
	beq	 0, .LBB0_8
.LBB0_7:                                # %optional4 ; H
	bl d
	nop
.LBB0_8:                                # %exit

Differential Revision: https://reviews.llvm.org/D28522

llvm-svn: 295223
2017-02-15 19:49:14 +00:00
Reid Kleckner
96b6dea648 [BranchFolding] Tail common all identical unreachable blocks
Summary:
Blocks ending in unreachable are typically cold because they end the
program or throw an exception, so merging them with other identical
blocks is usually profitable because it reduces the size of cold code.
MachineBlockPlacement generally does not arrange to fall through to such
blocks, so commoning these blocks will not introduce additional
unconditional branches.

Reviewers: hans, iteratee, haicheng

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D29153

llvm-svn: 295105
2017-02-14 21:02:24 +00:00
Arnold Schwaighofer
51dc444272 swiftcc: Don't emit tail calls from callers with swifterror parameters
Backends don't support this yet. They would have to move to the swifterror
register before the tail call to make sure it is live-in to the call.

rdar://30495920

llvm-svn: 294982
2017-02-13 19:58:28 +00:00
James Molloy
29f53382f7 [ARM] Fix crash caused by r294945
I'd missed a creator of FCMP nodes - duplicateCmp().

Kindly and promptly reported by Gabor Ballabas, due to his CSiBE test suite.

llvm-svn: 294968
2017-02-13 17:18:00 +00:00
Sanne Wouda
4ef730d369 [CodeGen] fix alignment of JUMPTABLE_INSTS on v8M.base
Summary:
The attached test case fails with "fatal error: error in backend:
misaligned pc-relative fixup value" as the jump table is misaligned.
The EmitAlignment existed already for ARM and Thumb-1 code, but was
missing for Thumb-2.

The test checks that the fatal error disappears when generating an obj
file, as well as checking the align directive is there when producing an
asm file.


Reviewers: rengolin, grosbach, t.p.northover, jmolloy, SjoerdMeijer, samparker

Reviewed By: samparker

Subscribers: samparker, aemerson, llvm-commits

Differential Revision: https://reviews.llvm.org/D29650

llvm-svn: 294950
2017-02-13 14:07:45 +00:00
James Molloy
a7843cefbb [ARM] Use VCMP, not VCMPE, for floating point equality comparisons
When generating a floating point comparison we currently unconditionally
generate VCMPE. This has the sideeffect of setting the cumulative Invalid
bit in FPSCR if any of the operands are QNaN.

It is expected that use of a relational predicate on a QNaN value should
raise Invalid. Quoting from the C standard:

  The relational and equality operators support the usual mathematical
  relationships between numeric values. For any ordered pair of numeric
  values exactly one of relationships the less, greater, equal and is true.
  Relational operators may raise the floating-point exception when argument
  values are NaNs.

The standard doesn't explicitly state the expectation for equality operators,
but the implication and obvious expectation is that equality operators
should not raise Invalid on a QNaN input, as those predicates are wholly
defined on unordered inputs (to return not equal).

Therefore, add a new operand to ARMISD::FPCMP and FPCMPZ indicating if
QNaN should raise Invalid, and pipe that through to TableGen.

llvm-svn: 294945
2017-02-13 12:32:47 +00:00
John Brawn
b7fc3ce6ba [ARM] Fix incorrect mask bits in MSR encoding for write_register intrinsic
In the encoding of system registers in the M-class MSR instruction the mask bits
should be 2 for registers that don't take a _<bits> qualifier (the instruction
is unpredictable otherwise), and should also be 2 if the register takes a
_<bits> qualifier but it's not present as no _<bits> is an alias for _nzcvq.

Differential Revision: https://reviews.llvm.org/D29828

llvm-svn: 294762
2017-02-10 17:41:08 +00:00
George Burgess IV
b4e5bc94a3 [ARM] Add support for armv7ve triple in llvm (PR31358).
Gcc supports target armv7ve which is armv7-a with virtualization
extensions. This change adds support for this in llvm for gcc
compatibility.

Also remove redundant FeatureHWDiv, FeatureHWDivARM for a few models as
this is specified automatically by FeatureVirtualization.

Patch by Manoj Gupta.

Differential Revision: https://reviews.llvm.org/D29472

llvm-svn: 294661
2017-02-09 23:29:14 +00:00
Artur Pilipenko
6283107573 Add DAGCombiner load combine tests for partially available values
If some of the trailing or leading bytes of a load combine pattern are zeroes we can combine the pattern to a load + zext and shift. Currently we don't support it, so the tests check the current codegen without load combine. This change will make the patch to support this kind of combine a bit more clear.

llvm-svn: 294591
2017-02-09 15:13:40 +00:00
Diana Picus
cd0eb4d32f [ARM] GlobalISel: Lower single precision FP args
Both for aapcscc and aapcs_vfpcc. We currently filter out soft float targets
because we don't support libcalls yet.

llvm-svn: 294584
2017-02-09 13:09:59 +00:00
Artur Pilipenko
f15639bed9 [DAGCombiner] Support non-zero offset in load combine
Enable folding patterns which load the value from non-zero offset:

  i8 *a = ...
  i32 val = a[4] | (a[5] << 8) | (a[6] << 16) | (a[7] << 24)
=>
  i32 val = *((i32*)(a+4))

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D29394

llvm-svn: 294582
2017-02-09 12:06:01 +00:00
Arnold Schwaighofer
bd5edcffec SwiftCC: swifterror register cannot be as the base register
Functions that have a dynamic alloca require a base register which is defined to
be X19 on AArch64 and r6 on ARM.  We have defined the swifterror register to be
the same register. Use a different callee save register for swifterror instead:

 X21 on AArch64
 R8 on ARM

rdar://30433803

llvm-svn: 294551
2017-02-09 01:52:17 +00:00
Arnold Schwaighofer
133aefc634 [ARM/AArch ISel] SwiftCC: First parameters that are marked swiftself are not 'this returns'
We mark X0 as preserved by a call that passes the returned parameter.

 x0 = ...
 fun(x0) // no implicit def of x0

This no longer is valid if we pass the parameter in a different register then
the returned value as is the case with a swiftself parameter (passed in x20).

x20 = ...
fun(x20) // there should be an implict def of x8

rdar://30425845

llvm-svn: 294527
2017-02-08 22:30:47 +00:00
Diana Picus
7af55bd9c2 Fix test to work on swift/cyclone too
I forgot to remove the neonfp target feature from the test, which means we'd
have trouble selecting VADDS on targets that have neonfp enabled by default.

llvm-svn: 294451
2017-02-08 14:23:30 +00:00
Diana Picus
3daf58db54 [ARM] GlobalISel: Add FPR reg bank
Add a register bank for floating point values and select simple instructions
using them (add, copies from GPR).

This assumes that the hardware can cope with a single precision add (VADDS)
instruction, so the legalizer will treat G_FADD as legal and the instruction
selector will refuse to select if the hardware doesn't support it. In the future
we'll want to be more careful about this, and legalize to libcalls if we have to
use soft float.

llvm-svn: 294442
2017-02-08 13:23:04 +00:00
Artur Pilipenko
f92edffe20 Add DAGCombiner load combine tests for {a|s}ext, {a|z|s}ext load nodes
Currently we don't support these nodes, so the tests check the current codegen without load combine. This change makes the review of the change to support these nodes more clear.

Separated from https://reviews.llvm.org/D29591 review.

llvm-svn: 294305
2017-02-07 14:09:37 +00:00
Christof Douma
d2bda2ce24 [ARM] Make RWPI use movw/movt when available
When constructing global address literals while targeting the RWPI
relocation model. LLVM currently only uses literal pools. If MOVW/MOVT
instructions are available we can use these instead. Beside being more
efficient it allows -arm-execute-only to work with
-relocation-model=RWPI as well.

When we generate MOVW/MOVT for global addresses when targeting the RWPI
relocation model, we need to use base relative relocations. This patch
does the needed plumbing in MC to generate these for MOVW/MOVT.

Differential Revision: https://reviews.llvm.org/D29487

Change-Id: I446786e43a6f5aa9b6a5bb2cd216d60d41c7755d
llvm-svn: 294298
2017-02-07 13:07:12 +00:00
Artur Pilipenko
45810c9c2e [DAGCombiner] Support bswap as a part of load combine patterns
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D29397

llvm-svn: 294201
2017-02-06 17:48:08 +00:00
Artur Pilipenko
02f0bfdf45 Add DAGCombiner load combine tests with non-zero offset
This is separated from https://reviews.llvm.org/D29394 review.

llvm-svn: 294185
2017-02-06 14:15:31 +00:00
Matthias Braun
95a43d873a MachineCopyPropagation: Respect implicit operands of COPY
The code missed to check implicit operands of COPY instructions for
defs/uses.

Differential Revision: https://reviews.llvm.org/D29522

llvm-svn: 294088
2017-02-04 02:27:20 +00:00
Sanne Wouda
af912f2b15 [ARM] Change TCReturn to tBL if tailcall optimization fails.
Summary:
The tail call optimisation is performed before register allocation, so
at that point we don't know if LR is being spilt or not. If LR was spilt
to the stack, then we cannot do a tail call optimisation. That would
involve popping back into LR which is not possible in Thumb1 code.

Reviewers: rengolin, jmolloy, rovka, olista01

Reviewed By: olista01

Subscribers: llvm-commits, aemerson

Differential Revision: https://reviews.llvm.org/D29020

llvm-svn: 294000
2017-02-03 11:15:53 +00:00
Sanne Wouda
ded29c4b12 [LLC] Add an inline assembly diagnostics handler.
Summary:
llc would hit a fatal error for errors in inline assembly. The
diagnostics message is now printed.

Reviewers: rengolin, MatzeB, javed.absar, anemet

Reviewed By: anemet

Subscribers: jyknight, nemanjai, llvm-commits

Differential Revision: https://reviews.llvm.org/D29408

llvm-svn: 293999
2017-02-03 11:14:39 +00:00
Javed Absar
59c3e0d946 [ARM] Classification Improvements to ARM Sched-Model. NFCI.
This is the second in the series of patches to enable adding
of machine sched-models for ARM processors easier and compact.
This patch focuses on integer instructions and adds missing
sched definitions.

Reviewers: rovka, rengolin
Differential Revision: https://reviews.llvm.org/D29127

llvm-svn: 293935
2017-02-02 21:08:12 +00:00
Nirav Dave
63300d8c5e Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled."
This reverts commit r293893 which is miscompiling lua on ARM and
bootstrapping for x86-windows.

llvm-svn: 293915
2017-02-02 18:24:55 +00:00
Nirav Dave
d4909b474b In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Recommiting after fixing X86 inc/dec chain bug.

    * Simplify Consecutive Merge Store Candidate Search

    Now that address aliasing is much less conservative, push through
    simplified store merging search and chain alias analysis which only
    checks for parallel stores through the chain subgraph. This is cleaner
    as the separation of non-interfering loads/stores from the
    store-merging logic.

    When merging stores search up the chain through a single load, and
    finds all possible stores by looking down from through a load and a
    TokenFactor to all stores visited.

    This improves the quality of the output SelectionDAG and the output
    Codegen (save perhaps for some ARM cases where we correctly constructs
    wider loads, but then promotes them to float operations which appear
    but requires more expensive constant generation).

    Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)

    Additional Minor Changes:

      1. Finishes removing unused AliasLoad code

      2. Unifies the chain aggregation in the merged stores across code
         paths

      3. Re-add the Store node to the worklist after calling
         SimplifyDemandedBits.

      4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
         arbitrary, but seems sufficient to not cause regressions in
         tests.

      5. Remove Chain dependencies of Memory operations on CopyfromReg
         nodes as these are captured by data dependence

      6. Forward loads-store values through tokenfactors containing
          {CopyToReg,CopyFromReg} Values.

      7. Peephole to convert buildvector of extract_vector_elt to
         extract_subvector if possible (see
         CodeGen/AArch64/store-merge.ll)

      8. Store merging for the ARM target is restricted to 32-bit as
         some in some contexts invalid 64-bit operations are being
         generated. This can be removed once appropriate checks are
         added.

    This finishes the change Matt Arsenault started in r246307 and
    jyknight's original patch.

    Many tests required some changes as memory operations are now
    reorderable, improving load-store forwarding. One test in
    particular is worth noting:

      CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
      forwarding converts a load-store pair into a parallel store and
      a memory-realized bitcast of the same value. However, because we
      lose the sharing of the explicit and implicit store values we
      must create another local store. A similar transformation
      happens before SelectionDAG as well.

    Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle

llvm-svn: 293893
2017-02-02 14:39:42 +00:00
Diana Picus
296902f902 [ARM] GlobalISel: Lower pointer args and returns
It is important to change the ArgInfo's type from pointer to integer, otherwise
the CC assign function won't know what to do. Instead of hacking it up, we use
ComputeValueVTs and introduce some of the helpers that we will need later on for
lowering more complex types.

llvm-svn: 293889
2017-02-02 14:01:00 +00:00
Diana Picus
c9fa60e375 [ARM] GlobalISel: Legalize loading pointers
Make it legal to load pointer values. Also check that pointers are assigned
to the GPR reg bank by default.

llvm-svn: 293886
2017-02-02 13:20:49 +00:00
Diana Picus
3e72318194 [ARM] GlobalISel: Test default banks for load results. NFC.
Check that all scalars are loaded into the GPR by default.

llvm-svn: 293883
2017-02-02 13:00:24 +00:00
Javed Absar
4fcf0cfd20 [ARM] Enable Cortex-M23 and Cortex-M33 support.
Add both cores to the target parser and TableGen. Test that eabi
attributes are set correctly for both cores. Additionally, test the
absence and presence of MOVT in Cortex-M23 and Cortex-M33, respectively.

Committed on behalf of Sanne Wouda.
Reviewers : rengolin, olista01.

Differential Revision: https://reviews.llvm.org/D29073

llvm-svn: 293761
2017-02-01 11:55:03 +00:00
Kyle Butt
9386601a22 CodeGen: Allow small copyable blocks to "break" the CFG.
When choosing the best successor for a block, ordinarily we would have preferred
a block that preserves the CFG unless there is a strong probability the other
direction. For small blocks that can be duplicated we now skip that requirement
as well, subject to some simple frequency calculations.

Differential Revision: https://reviews.llvm.org/D28583

llvm-svn: 293716
2017-01-31 23:48:32 +00:00
Sam Parker
12536f21d1 [ARM] Avoid using ARM instructions in Thumb mode
The Requires class overrides the target requirements of an instruction,
rather than adding to them, so all ARM instructions need to include the
IsARM predicate when they have overwitten requirements.

This caused the swp and swpb instructions to be allowed in thumb mode
assembly, and the ARM encoding of CDP to be selected in codegen (which
is different for conditional instructions).

Differential Revision: https://reviews.llvm.org/D29283

llvm-svn: 293634
2017-01-31 14:35:01 +00:00
Matt Arsenault
c4ccc9b791 DAG: Constant fold fp16_to_fp/fp16_to_fp
This fixes emitting conversions of constants on targets
without legal f16 that need to use these for legalization.

llvm-svn: 293499
2017-01-30 16:57:41 +00:00
Saleem Abdulrasool
f4a03e0903 ARM: support -mlong-calls with AEABI TLS on ELF
Support lowering AEABI TLS access (__aeabi_read_tp) with long calls.
This requires adjusting the call sequence to use an indirect call to get
full addressability.

Resolves PR31769!

llvm-svn: 293433
2017-01-29 16:46:22 +00:00
Matthew Simpson
401964cda6 [ARM/AArch64] Relocate and update InterleavedAccessPass tests (NFC)
The interleaved access pass is an IR-to-IR transformation that runs before code
generation. It matches interleaved memory operations to target-specific
intrinsics (that are later lowered to load and store multiple instructions on
ARM/AArch64). We place tests for similar passes (e.g., GlobalMergePass) under
test/Transforms. This patch moves the InterleavedAccessPass tests out of
test/CodeGen and into target-specific directories under
test/Transforms/InterleavedAccess.

Although the pass is an IR pass, many of the existing tests were llc tests
rather opt tests. For example, the tests would check for ldN/stN instructions
generated by llc rather than the intrinsic calls the pass actually inserts.
Thus, this patch updates all tests to be opt tests that check for the inserted
intrinsics. We already have separate CodeGen tests that ensure we lower the
interleaved access intrinsics to their corresponding ldN/stN instructions. In
addition to migrating the tests to opt, this patch also performs some minor
clean-up (to ensure consistent naming, etc.).

Differential Revision: https://reviews.llvm.org/D29184

llvm-svn: 293309
2017-01-27 17:33:16 +00:00
Saleem Abdulrasool
da5a339f73 ARM: fix vectorized division on WoA
The Windows on ARM target uses custom division for normal division as
the backend needs to insert division-by-zero checks.  However, it is
designed to only handle non-vectorized division.  ARM has custom
lowering for vectorized division as that can avoid loading registers
with the values and invoke a division routine for each one, preferring
to lower using NEON instructions.  Fall back to the custom lowering for
the NEON instructions if we encounter a vectorized division.

Resolves PR31778!

llvm-svn: 293259
2017-01-27 03:41:53 +00:00
Nirav Dave
2a565d7a4e Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled."
This reverts commit r293184 which is failing in LTO builds

llvm-svn: 293188
2017-01-26 16:46:13 +00:00
Nirav Dave
c7f26fe4ae In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
* Simplify Consecutive Merge Store Candidate Search

    Now that address aliasing is much less conservative, push through
    simplified store merging search and chain alias analysis which only
    checks for parallel stores through the chain subgraph. This is cleaner
    as the separation of non-interfering loads/stores from the
    store-merging logic.

    When merging stores search up the chain through a single load, and
    finds all possible stores by looking down from through a load and a
    TokenFactor to all stores visited.

    This improves the quality of the output SelectionDAG and the output
    Codegen (save perhaps for some ARM cases where we correctly constructs
    wider loads, but then promotes them to float operations which appear
    but requires more expensive constant generation).

    Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)

    Additional Minor Changes:

      1. Finishes removing unused AliasLoad code

      2. Unifies the chain aggregation in the merged stores across code
         paths

      3. Re-add the Store node to the worklist after calling
         SimplifyDemandedBits.

      4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
         arbitrary, but seems sufficient to not cause regressions in
         tests.

      5. Remove Chain dependencies of Memory operations on CopyfromReg
         nodes as these are captured by data dependence

      6. Forward loads-store values through tokenfactors containing
          {CopyToReg,CopyFromReg} Values.

      7. Peephole to convert buildvector of extract_vector_elt to
         extract_subvector if possible (see
         CodeGen/AArch64/store-merge.ll)

      8. Store merging for the ARM target is restricted to 32-bit as
         some in some contexts invalid 64-bit operations are being
         generated. This can be removed once appropriate checks are
         added.

    This finishes the change Matt Arsenault started in r246307 and
    jyknight's original patch.

    Many tests required some changes as memory operations are now
    reorderable, improving load-store forwarding. One test in
    particular is worth noting:

      CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
      forwarding converts a load-store pair into a parallel store and
      a memory-realized bitcast of the same value. However, because we
      lose the sharing of the explicit and implicit store values we
      must create another local store. A similar transformation
      happens before SelectionDAG as well.

    Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle

llvm-svn: 293184
2017-01-26 16:02:24 +00:00
Diana Picus
b8228ddf1c [ARM] GlobalISel: Load i1, i8 and i16 args from stack
Add support for loading i1, i8 and i16 arguments from the stack, with or without
the ABI extension flags.

When the ABI extension flags are present, we load a 4-byte value, otherwise we
preserve the size of the load and let the instruction selector replace it with a
LDRB/LDRH. This generates the same thing as DAGISel.

Differential Revision: https://reviews.llvm.org/D27803

llvm-svn: 293163
2017-01-26 09:20:47 +00:00
Tim Northover
e4011c071b SDag: fix how initial loads are formed when splitting vector ops.
Later code expects the vector loads produced to be directly
concatenable, which means we shouldn't pad anything except the last load
produced with UNDEF.

llvm-svn: 293088
2017-01-25 20:58:26 +00:00
Artur Pilipenko
0e6418640a [DAGCombiner] Match load by bytes idiom and fold it into a single load. Attempt #2.
The previous patch (https://reviews.llvm.org/rL289538) got reverted because of a bug. Chandler also requested some changes to the algorithm.
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161212/413479.html

This is an updated patch. The key difference is that collectBitProviders (renamed to calculateByteProvider) now collects the origin of one byte, not the whole value. It simplifies the implementation and allows to stop the traversal earlier if we know that the result won't be used.

From the original commit:

Match a pattern where a wide type scalar value is loaded by several narrow loads and combined by shifts and ors. Fold it into a single load or a load and a bswap if the targets supports it.

Assuming little endian target:
  i8 *a = ...
  i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
=>
  i32 val = *((i32)a)

  i8 *a = ...
  i32 val = (a[0] << 24) | (a[1] << 16) | (a[2] << 8) | a[3]
=>
  i32 val = BSWAP(*((i32)a))

This optimization was discussed on llvm-dev some time ago in "Load combine pass" thread. We came to the conclusion that we want to do this transformation late in the pipeline because in presence of atomic loads load widening is irreversible transformation and it might hinder other optimizations.

Eventually we'd like to support folding patterns like this where the offset has a variable and a constant part:
  i32 val = a[i] | (a[i + 1] << 8) | (a[i + 2] << 16) | (a[i + 3] << 24)

Matching the pattern above is easier at SelectionDAG level since address reassociation has already happened and the fact that the loads are adjacent is clear. Understanding that these loads are adjacent at IR level would have involved looking through geps/zexts/adds while looking at the addresses.

The general scheme is to match OR expressions by recursively calculating the origin of individual bytes which constitute the resulting OR value. If all the OR bytes come from memory verify that they are adjacent and match with little or big endian encoding of a wider value. If so and the load of the wider type (and bswap if needed) is allowed by the target generate a load and a bswap if needed.

Reviewed By: RKSimon, filcab, chandlerc 

Differential Revision: https://reviews.llvm.org/D27861

llvm-svn: 293036
2017-01-25 08:53:31 +00:00
Diana Picus
3e1c91b8c7 [ARM] GlobalISel: Support i1 add and ABI extensions
Add support for:
* i1 add
* i1 function arguments, if passed through registers
* i1 returns, with ABI signext/zeroext

Differential Revision: https://reviews.llvm.org/D27706

llvm-svn: 293035
2017-01-25 08:47:40 +00:00
Diana Picus
216544ff2f [ARM] GlobalISel: Support i8/i16 ABI extensions
At the moment, this means supporting the signext/zeroext attribute on the return
type of the function. For function arguments, signext/zeroext should be handled
by the caller, so there's nothing for us to do until we start lowering calls.

Note that this does not include support for other extensions (i8 to i16), those
will be added later.

Differential Revision: https://reviews.llvm.org/D27705

llvm-svn: 293034
2017-01-25 08:10:40 +00:00
Javed Absar
97ef1a63f1 [ARM] Classification Improvements to ARM Sched-Models. NFCI.
This is a series of patches to enable adding of machine sched
models for ARM processors easier and compact. They define new
sched-readwrites for groups of ARM instructions. This has been
missing so far, and as a consequence, machine scheduler models
for individual sub-targets have tended to be larger than they
needed to be. 

The current patch focuses on floating-point instructions.

Reviewers: Diana Picus (rovka), Renato Golin (rengolin)

Differential Revision: https://reviews.llvm.org/D28194

llvm-svn: 292825
2017-01-23 20:20:39 +00:00
Sjoerd Meijer
b8f5ff7071 [Thumb] Add support for tMUL in the compare instruction peephole optimizer.
We also want to optimise tests like this: return a*b == 0.  The MULS
instruction is flag setting, so we don't need the CMP instruction but can
instead branch on the result of the MULS. The generated instructions sequence
for this example was: MULS, MOVS, MOVS, CMP. The MOVS instruction load the
boolean values resulting from the select instruction, but these MOVS
instructions are flag setting and were thus preventing this optimisation. Now
we first reorder and move the MULS to before the CMP and generate sequence
MOVS, MOVS, MULS, CMP so that the optimisation could trigger. Reordering of the
MULS and MOVS is safe to do because the subsequent MOVS instructions just set
the CPSR register and don't use it, i.e. the CPSR is dead.

Differential Revision: https://reviews.llvm.org/D27990

llvm-svn: 292608
2017-01-20 13:10:12 +00:00
Serge Rogatch
1f08944e95 [XRay][Arm] Repair XRay table emission on Arm32 and add tests to identify such problem earlier
Summary:
Emission of XRay table was occasionally disabled for Arm32, but this bug was not then detected because earlier (also by mistake) testing of XRay was occasionally disabled on 32-bit Arm targets. This patch should fix that problem and detect such problems in the future.
This patch is one of a series, see also
- https://reviews.llvm.org/D28623

Reviewers: rengolin, dberris

Reviewed By: dberris

Subscribers: llvm-commits, aemerson, rengolin, dberris, iid_iunknown

Differential Revision: https://reviews.llvm.org/D28624

llvm-svn: 292516
2017-01-19 20:24:23 +00:00
Renato Golin
6772a440c4 Revert "[XRay][Arm] Repair XRay table emission on Arm32 and add tests to identify such problem earlier"
This reverts commit r292210, as it broke the Thumb buldbot with:

clang-5.0: error: the clang compiler does not support '-fxray-instrument
on thumbv7-unknown-linux-gnueabihf'.

llvm-svn: 292357
2017-01-18 09:08:43 +00:00
Serge Rogatch
8364a160c3 [XRay][Arm] Repair XRay table emission on Arm32 and add tests to identify such problem earlier
Summary:
Emission of XRay table was occasionally disabled for Arm32, but this bug was not then detected because earlier (also by mistake) testing of XRay was occasionally disabled on 32-bit Arm targets. This patch should fix that problem and detect such problems in the future.
This patch is one of a series, see also
- https://reviews.llvm.org/D28623

Reviewers: rengolin, dberris

Reviewed By: dberris

Subscribers: llvm-commits, aemerson, rengolin, dberris, iid_iunknown

Differential Revision: https://reviews.llvm.org/D28624

llvm-svn: 292210
2017-01-17 11:52:10 +00:00
Simon Pilgrim
a8edcf036b [SelectionDAG] Add support for BITREVERSE constant folding
We were relying on constant folding of the legalized instructions to do what constant folding we had previously

llvm-svn: 292114
2017-01-16 13:39:00 +00:00
Kyle Butt
6035a9f12a Revert "CodeGen: Allow small copyable blocks to "break" the CFG."
This reverts commit ada6595a526d71df04988eb0a4b4fe84df398ded.

This needs a simple probability check because there are some cases where it is
not profitable.

llvm-svn: 291695
2017-01-11 19:55:19 +00:00
Eli Friedman
854b4056c4 [ARM] More aggressive matching for vpadd and vpaddl.
The new matchers work after legalization to make them simpler, and to avoid
blocking other optimizations.

Differential Revision: https://reviews.llvm.org/D27779

llvm-svn: 291693
2017-01-11 19:33:38 +00:00
Evgeny Astigeevich
7ec3f26819 [ARM] Fix test CodeGen/ARM/fpcmp_ueq.ll broken by rL290616
Commit rL290616 (https://reviews.llvm.org/rL290616) changed a checking command
for the triple arm-apple-darwin in LLVM::CodeGen/ARM/fpcmp_ueq.ll. As a result
of the changes the test could fail for the valid generated code.

These changes fixes the test to check only instructions we would expect.

Differential Revision: https://reviews.llvm.org/D28159

llvm-svn: 291678
2017-01-11 16:23:28 +00:00
Kyle Butt
af32417840 CodeGen: Allow small copyable blocks to "break" the CFG.
When choosing the best successor for a block, ordinarily we would have preferred
a block that preserves the CFG unless there is a strong probability the other
direction. For small blocks that can be duplicated we now skip that requirement
as well.

Differential revision: https://reviews.llvm.org/D27742

llvm-svn: 291609
2017-01-10 23:04:30 +00:00
Matt Arsenault
f633d550ca DAG: Avoid OOB when legalizing vector indexing
If a vector index is out of bounds, the result is supposed to be
undefined but is not undefined behavior. Change the legalization
for indexing the vector on the stack so that an out of bounds
index does not create an out of bounds memory access.

llvm-svn: 291604
2017-01-10 22:02:30 +00:00
Joerg Sonnenberger
5f19da4f7e Emit .cfi_sections before the first .cfi_startproc
GNU as rejects input where .cfi_sections is used after .cfi_startproc,
if the new section differs from the old. Adjust our output to always
emit .cfi_sections before the first .cfi_startproc to minimize necessary
code.

Differential Revision: https://reviews.llvm.org/D28011

llvm-svn: 290817
2017-01-02 18:05:27 +00:00
Saleem Abdulrasool
33b04261ab test: modernise ARM CodeGen tests
Replace the use of grep with FileCheck.  Tidy up some of the tests.  A
few of the tests have been left as weak as previously, though some have
been made more stringent.

llvm-svn: 290616
2016-12-27 18:35:19 +00:00
Zijiao Ma
eef813bb47 Make the canonicalisation on shifts benifit to more case.
1.Fix pessimized case in FIXME.
2.Add tests for it.
3.The canonicalisation on shifts results in different sequence for
  tests of machine-licm.Correct some check lines.

Differential Revision: https://reviews.llvm.org/D27916

llvm-svn: 290410
2016-12-23 02:56:07 +00:00
Adrian Prantl
1f3fc31b75 Renumber testcase metadata nodes after r290153.
This patch renumbers the metadata nodes in debug info testcases after
https://reviews.llvm.org/D26769. This is a separate patch because it
causes so much churn. This was implemented with a python script that
pipes the testcases through llvm-as - | llvm-dis - and then goes
through the original and new output side-by side to insert all
comments at a close-enough location.

Differential Revision: https://reviews.llvm.org/D27765

llvm-svn: 290292
2016-12-22 00:45:21 +00:00
Adrian Prantl
073dce6100 Legalize metadata in legacy testcases
llvm-svn: 290285
2016-12-21 23:28:49 +00:00
Eli Friedman
6cf25f5feb [ARM] Implement isExtractSubvectorCheap.
See https://reviews.llvm.org/D6678 for the history of
isExtractSubvectorCheap. Essentially the same considerations apply
to ARM.

This temporarily breaks the formation of vpadd/vpaddl in certain cases;
AddCombineToVPADDL essentially assumes that we won't form VUZP shuffles.
See https://reviews.llvm.org/D27779 for followup fix.

Differential Revision: https://reviews.llvm.org/D27774

llvm-svn: 290198
2016-12-20 20:05:07 +00:00
Eli Friedman
5ff1330194 [ARM] Generate checks for shuffle tests using update_llc_test_checks.py.
llvm-svn: 290196
2016-12-20 19:33:24 +00:00
Adrian Prantl
cf7e846d11 [IR] Remove the DIExpression field from DIGlobalVariable.
This patch implements PR31013 by introducing a
DIGlobalVariableExpression that holds a pair of DIGlobalVariable and
DIExpression.

Currently, DIGlobalVariables holds a DIExpression. This is not the
best way to model this:

(1) The DIGlobalVariable should describe the source level variable,
    not how to get to its location.

(2) It makes it unsafe/hard to update the expressions when we call
    replaceExpression on the DIGLobalVariable.

(3) It makes it impossible to represent a global variable that is in
    more than one location (e.g., a variable with multiple
    DW_OP_LLVM_fragment-s).  We also moved away from attaching the
    DIExpression to DILocalVariable for the same reasons.

This reapplies r289902 with additional testcase upgrades and a change
to the Bitcode record for DIGlobalVariable, that makes upgrading the
old format unambiguous also for variables without DIExpressions.

<rdar://problem/29250149>
https://llvm.org/bugs/show_bug.cgi?id=31013
Differential Revision: https://reviews.llvm.org/D26769

llvm-svn: 290153
2016-12-20 02:09:43 +00:00
Eli Friedman
11b33a0cd7 Add ARM support to update_llc_test_checks.py
Just the minimal support to get it working at the moment.

Includes checks for test/CodeGen/ARM/vzip.ll as an example.

Differential Revision: https://reviews.llvm.org/D27829

llvm-svn: 290144
2016-12-19 23:09:51 +00:00
Diana Picus
c595261d24 [ARM] GlobalISel: Add more checks to test
llvm-svn: 290108
2016-12-19 14:08:11 +00:00
Diana Picus
a66e97f1fc [ARM] GlobalISel: Minor style fixup in test
llvm-svn: 290107
2016-12-19 14:08:06 +00:00
Diana Picus
008160ace8 [ARM] GlobalISel: Lower i8 and i16 register args
This allows lowering i8 and i16 arguments if they can fit in the registers. Note
that the lowering is incomplete - ABI extensions are handled in a subsequent
patch.

(Last part of)
Differential Revision: https://reviews.llvm.org/D27704

llvm-svn: 290106
2016-12-19 14:08:02 +00:00
Diana Picus
4d7a9b8358 [ARM] GlobalISel: Allow i8 and i16 adds
Teach the instruction selector and legalizer that it's ok to have adds with 8 or
16-bit integers.

This is the second part of https://reviews.llvm.org/D27704

llvm-svn: 290105
2016-12-19 14:07:56 +00:00
Diana Picus
add856a854 [ARM] GlobalISel: Select i8 and i16 copies
Teach the instruction selector that it's ok to copy small values from physical
registers.

First part of https://reviews.llvm.org/D27704

llvm-svn: 290104
2016-12-19 14:07:50 +00:00
Diana Picus
1019c2202a [ARM] GlobalISel: Lower more than 4 arguments
This adds support for lowering more than 4 arguments (although still i32 only).
It uses the handleAssignments / ValueHandler infrastructure extracted from
the AArch64 backend in r288658.

Differential Revision: https://reviews.llvm.org/D27195

llvm-svn: 290098
2016-12-19 11:55:41 +00:00
Diana Picus
9037c7b7ae [ARM] GlobalISel: Support loading from the stack
Add support for selecting simple G_LOAD and G_FRAME_INDEX instructions (32-bit
scalars only). This will be useful for functions that need to pass arguments on
the stack.

First part of https://reviews.llvm.org/D27195.

llvm-svn: 290096
2016-12-19 11:26:31 +00:00
Matthias Braun
cf2df3b1a5 Move test to correct directory
See also test/CodeGen/MIR/README

llvm-svn: 290032
2016-12-17 02:16:26 +00:00
Adrian Prantl
0ab6669d6d Revert "[IR] Remove the DIExpression field from DIGlobalVariable."
This reverts commit 289920 (again).
I forgot to implement a Bitcode upgrade for the case where a DIGlobalVariable
has not DIExpression. Unfortunately it is not possible to safely upgrade
these variables without adding a flag to the bitcode record indicating which
version they are.
My plan of record is to roll the planned follow-up patch that adds a
unit: field to DIGlobalVariable into this patch before recomitting.
This way we only need one Bitcode upgrade for both changes (with a
version flag in the bitcode record to safely distinguish the record
formats).

Sorry for the churn!

llvm-svn: 289982
2016-12-16 19:39:01 +00:00
Eli Friedman
09b663436a [ARM] Add ARMISD::VLD1DUP to match vld1_dup more consistently.
Currently, there are substantial problems forming vld1_dup even if the
VDUP survives legalization. The lack of an actual node
leads to terrible results: not only can we not form post-increment vld1_dup
instructions, but we form scalar pre-increment and post-increment
loads which force the loaded value into a GPR. This patch fixes that
by combining the vdup+load into an ARMISD node before DAGCombine
messes it up.

Also includes a crash fix for vld2_dup (see testcase @vld2dupi8_postinc_variable).

Recommiting with fix to avoid forming vld1dup if the type of the load
doesn't match the type of the vdup (see
https://llvm.org/bugs/show_bug.cgi?id=31404).

Differential Revision: https://reviews.llvm.org/D27694

llvm-svn: 289972
2016-12-16 18:44:08 +00:00
Diana Picus
a8212f0f94 [ARM] GlobalISel: Select add i32, i32
Add the minimal support necessary to select a function that returns the sum of
two i32 values.

This includes some support for argument/return lowering of i32 values through
registers, as well as the handling of copy and add instructions throughout the
GlobalISel pipeline.

Differential Revision: https://reviews.llvm.org/D26677

llvm-svn: 289940
2016-12-16 12:54:46 +00:00
Nico Weber
b553ce3ab5 Revert 279703, it caused PR31404.
llvm-svn: 289923
2016-12-16 04:51:25 +00:00
Adrian Prantl
2345112c5b [IR] Remove the DIExpression field from DIGlobalVariable.
This patch implements PR31013 by introducing a
DIGlobalVariableExpression that holds a pair of DIGlobalVariable and
DIExpression.

Currently, DIGlobalVariables holds a DIExpression. This is not the
best way to model this:

(1) The DIGlobalVariable should describe the source level variable,
    not how to get to its location.

(2) It makes it unsafe/hard to update the expressions when we call
    replaceExpression on the DIGLobalVariable.

(3) It makes it impossible to represent a global variable that is in
    more than one location (e.g., a variable with multiple
    DW_OP_LLVM_fragment-s).  We also moved away from attaching the
    DIExpression to DILocalVariable for the same reasons.

This reapplies r289902 with additional testcase upgrades.

<rdar://problem/29250149>
https://llvm.org/bugs/show_bug.cgi?id=31013
Differential Revision: https://reviews.llvm.org/D26769

llvm-svn: 289920
2016-12-16 04:25:54 +00:00
Chandler Carruth
dc4dc4b073 Revert patch series introducing the DAG combine to match a load-by-bytes
idiom.

r289538: Match load by bytes idiom and fold it into a single load
r289540: Fix a buildbot failure introduced by r289538
r289545: Use more detailed assertion messages in the code ...
r289646: Add a couple of assertions to the load combine code ...

This DAG combine has a bad crash in it that is quite hard to trigger
sadly -- it relies on sneaking code with UB through the SDAG build and
into this particular combine. I've responded to the original commit with
a test case that reproduces it.

However, the code also has other problems that will require substantial
changes to address and so I'm going ahead and reverting it for now. This
should unblock us and perhaps others that are hitting the crash in the
wild and will let a fresh patch with updated approach come in cleanly
afterward.

Sorry for any trouble or disruption!

llvm-svn: 289916
2016-12-16 04:05:22 +00:00
Adrian Prantl
daf4fef1f9 Revert "[IR] Remove the DIExpression field from DIGlobalVariable."
This reverts commit 289902 while investigating bot berakage.

llvm-svn: 289906
2016-12-16 01:00:30 +00:00
Adrian Prantl
0eee52640f [IR] Remove the DIExpression field from DIGlobalVariable.
This patch implements PR31013 by introducing a
DIGlobalVariableExpression that holds a pair of DIGlobalVariable and
DIExpression.

Currently, DIGlobalVariables holds a DIExpression. This is not the
best way to model this:

(1) The DIGlobalVariable should describe the source level variable,
    not how to get to its location.

(2) It makes it unsafe/hard to update the expressions when we call
    replaceExpression on the DIGLobalVariable.

(3) It makes it impossible to represent a global variable that is in
    more than one location (e.g., a variable with multiple
    DW_OP_LLVM_fragment-s).  We also moved away from attaching the
    DIExpression to DILocalVariable for the same reasons.

<rdar://problem/29250149>
https://llvm.org/bugs/show_bug.cgi?id=31013
Differential Revision: https://reviews.llvm.org/D26769

llvm-svn: 289902
2016-12-16 00:36:43 +00:00
Eli Friedman
a6de56264a Don't combine splats with other shuffles.
We sometimes end up creating shuffles which are worse than the obvious
translation of the IR.

Fixes https://llvm.org/bugs/show_bug.cgi?id=31301 .

Differential Revision: https://reviews.llvm.org/D27793

llvm-svn: 289882
2016-12-15 22:41:40 +00:00
Eli Friedman
221fbc02dc Don't combine a shuffle of two BUILD_VECTORs with duplicate elements.
Targets can't handle this case well in general; we often transform
a shuffle of two cheap BUILD_VECTORs to element-by-element insertion,
which is very inefficient.

Fixes https://llvm.org/bugs/show_bug.cgi?id=31364 . Partially
fixes https://llvm.org/bugs/show_bug.cgi?id=31301.

Differential Revision: https://reviews.llvm.org/D27787

llvm-svn: 289874
2016-12-15 21:36:59 +00:00
Sjoerd Meijer
9792b09ed3 [Thumb] Teach ISel how to lower compares of AND bitmasks efficiently
This is essentially a recommit of r285893, but with a correctness fix. The
problem of the original commit was that this:

bic r5, r7, #31
cbz r5, .LBB2_10

got rewritten into:

lsrs  r5, r7, #5
beq .LBB2_10

The result in destination register r5 is not the same and this is incorrect
when r5 is not dead. So this fix includes checking the uses of the AND
destination register. And also, compared to the original commit, some regression
tests didn't need changing anymore because of this extra check.

For completeness, this was the original commit message:

For the common pattern (CMPZ (AND x, #bitmask), #0), we can do some more
efficient instruction selection if the bitmask is one consecutive sequence of
set bits (32 - clz(bm) - ctz(bm) == popcount(bm)).

1) If the bitmask touches the LSB, then we can remove all the upper bits and
set the flags by doing one LSLS.
2) If the bitmask touches the MSB, then we can remove all the lower bits and
set the flags with one LSRS.
3) If the bitmask has popcount == 1 (only one set bit), we can shift that bit
into the sign bit with one LSLS and change the condition query from NE/EQ to
MI/PL (we could also implement this by shifting into the carry bit and
branching on BCC/BCS).
4) Otherwise, we can emit a sequence of LSLS+LSRS to remove the upper and lower
zero bits of the mask.

1-3 require only one 16-bit instruction and can elide the CMP. 4 requires two
16-bit instructions but can elide the CMP and doesn't require materializing a
complex immediate, so is also a win.

Differential Revision: https://reviews.llvm.org/D27761

llvm-svn: 289794
2016-12-15 09:38:59 +00:00
Prakhar Bahuguna
dc9a43dec1 [ARM] Implement execute-only support in CodeGen
This implements execute-only support for ARM code generation, which
prevents the compiler from generating data accesses to code sections.
The following changes are involved:

* Add the CodeGen option "-arm-execute-only" to the ARM code generator.
* Add the clang flag "-mexecute-only" as well as the GCC-compatible
  alias "-mpure-code" to enable this option.
* When enabled, literal pools are replaced with MOVW/MOVT instructions,
  with VMOV used in addition for floating-point literals. As the MOVT
  instruction is required, execute-only support is only available in
  Thumb mode for targets supporting ARMv8-M baseline or Thumb2.
* Jump tables are placed in data sections when in execute-only mode.
* The execute-only text section is assigned section ID 0, and is
  marked as unreadable with the SHF_ARM_PURECODE flag with symbol 'y'.
  This also overrides selection of ELF sections for globals.

llvm-svn: 289784
2016-12-15 07:59:08 +00:00
Eli Friedman
913ac3bf1a Add testcases for some shuffle bugs.
See https://llvm.org/bugs/show_bug.cgi?id=31301 and
https://llvm.org/bugs/show_bug.cgi?id=31364 .

llvm-svn: 289751
2016-12-15 01:47:15 +00:00
Eli Friedman
c9513096a1 [ARM] Split 128-bit vectors in BUILD_VECTOR lowering
Given that INSERT_VECTOR_ELT operates on D registers anyway, combining
64-bit vectors into a 128-bit vector is basically free. Therefore, try
to split BUILD_VECTOR nodes before giving up and lowering them to a series
of INSERT_VECTOR_ELT instructions. Sometimes this allows dramatically
better lowerings; see testcases for examples. Inspired by similar code
in the x86 backend for AVX.

Differential Revision: https://reviews.llvm.org/D27624

llvm-svn: 289706
2016-12-14 20:44:38 +00:00
Eli Friedman
fbd3811bc0 [ARM] Add ARMISD::VLD1DUP to match vld1_dup more consistently.
Currently, there are substantial problems forming vld1_dup even if the
VDUP survives legalization. The lack of an actual node
leads to terrible results: not only can we not form post-increment vld1_dup
instructions, but we form scalar pre-increment and post-increment
loads which force the loaded value into a GPR. This patch fixes that
by combining the vdup+load into an ARMISD node before DAGCombine
messes it up.

Also includes a crash fix for vld2_dup (see testcase @vld2dupi8_postinc_variable).

Differential Revision: https://reviews.llvm.org/D27694

llvm-svn: 289703
2016-12-14 20:25:26 +00:00
Nirav Dave
9fd3ae9cf9 Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled."
Reverting due to ARM MCJIT and MIPS LLD error.

This reverts commit r289659.

llvm-svn: 289667
2016-12-14 16:43:44 +00:00
Nirav Dave
afe2eccae3 In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Retrying after fixing after removing load-store factoring through
token factors in favor of improved token factor operand pruning

Simplify Consecutive Merge Store Candidate Search

Now that address aliasing is much less conservative, push through
simplified store merging search which only checks for parallel stores
through the chain subgraph. This is cleaner as the separation of
non-interfering loads/stores from the store-merging logic.

Whem merging stores, search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited. This improves the quality of the
output SelectionDAG and generally the output CodeGen (with some
exceptions).

Additional Minor Changes:

   1. Finishes removing unused AliasLoad code
   2. Unifies the the chain aggregation in the merged stores across
      code paths
   3. Re-add the Store node to the worklist after calling
      SimplifyDemandedBits.
   4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
      arbitrary, but seemed sufficient to not cause regressions in
      tests.

This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.

Many tests required some changes as memory operations are now
reorderable. Some tests relying on the order were changed to use
volatile memory operations

Noteworthy tests:

    CodeGen/AArch64/argument-blocks.ll -
      It's not entirely clear what the test_varargs_stackalign test is
      supposed to be asserting, but the new code looks right.

    CodeGen/AArch64/arm64-memset-inline.lli -
    CodeGen/AArch64/arm64-stur.ll -
    CodeGen/ARM/memset-inline.ll -

      The backend now generates *worse* code due to store merging
      succeeding, as we do do a 16-byte constant-zero store efficiently.

    CodeGen/AArch64/merge-store.ll -
      Improved, but there still seems to be an extraneous vector insert
      from an element to itself?

    CodeGen/PowerPC/ppc64-align-long-double.ll -
      Worse code emitted in this case, due to the improved store->load
      forwarding.

    CodeGen/X86/dag-merge-fast-accesses.ll -
    CodeGen/X86/MergeConsecutiveStores.ll -
    CodeGen/X86/stores-merging.ll -
    CodeGen/Mips/load-store-left-right.ll -
      Restored correct merging of non-aligned stores

    CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
      Improved. Correctly merges buffer_store_dword calls

    CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
      Improved. Sidesteps loading a stored value and
      merges two stores

    CodeGen/X86/pr18023.ll -
      This test has been removed, as it was asserting incorrect
      behavior. Non-volatile stores *CAN* be moved past volatile loads,
      and now are.

    CodeGen/X86/vector-idiv.ll -
    CodeGen/X86/vector-lzcnt-128.ll -
      It's basically impossible to tell what these tests are actually
      testing. But, looks like the code got better due to the memory
      operations being recognized as non-aliasing.

    CodeGen/X86/win32-eh.ll -
      Both loads of the securitycookie are now merged.

Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle

Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, dsanders, resistor, tstellarAMD, t.p.northover, spatel

Differential Revision: https://reviews.llvm.org/D14834

llvm-svn: 289659
2016-12-14 15:44:26 +00:00
Evandro Menezes
f2e582c435 [ARM] Fix typo in checking prefix
llvm-svn: 289617
2016-12-14 00:02:03 +00:00
Evandro Menezes
d6a1cb395c Add support for Samsung Exynos M3 (NFC)
llvm-svn: 289613
2016-12-13 23:31:41 +00:00
Alina Sbirlea
81ca226117 Generalize strided store pattern in interleave access pass
Summary:
This patch aims to generalize matching of the strided store accesses to more general masks.
The more general rule is to have consecutive accesses based on the stride:
[x, y, ... z, x+1, y+1, ...z+1, x+2, y+2, ...z+2, ...]
All elements in the masks need not form a contiguous space, there may be gaps.
As before, undefs are allowed and filled in with adjacent element loads.

Reviewers: HaoLiu, mssimpso

Subscribers: mkuper, delena, llvm-commits

Differential Revision: https://reviews.llvm.org/D23646

llvm-svn: 289573
2016-12-13 19:32:36 +00:00
Artur Pilipenko
5f36ddf06d [DAGCombiner] Match load by bytes idiom and fold it into a single load
Match a pattern where a wide type scalar value is loaded by several narrow loads and combined by shifts and ors. Fold it into a single load or a load and a bswap if the targets supports it.

Assuming little endian target:
  i8 *a = ...
  i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
=>
  i32 val = *((i32)a)

  i8 *a = ...
  i32 val = (a[0] << 24) | (a[1] << 16) | (a[2] << 8) | a[3]
=>
  i32 val = BSWAP(*((i32)a))

This optimization was discussed on llvm-dev some time ago in "Load combine pass" thread. We came to the conclusion that we want to do this transformation late in the pipeline because in presence of atomic loads load widening is irreversible transformation and it might hinder other optimizations.

Eventually we'd like to support folding patterns like this where the offset has a variable and a constant part:
  i32 val = a[i] | (a[i + 1] << 8) | (a[i + 2] << 16) | (a[i + 3] << 24)

Matching the pattern above is easier at SelectionDAG level since address reassociation has already happened and the fact that the loads are adjacent is clear. Understanding that these loads are adjacent at IR level would have involved looking through geps/zexts/adds while looking at the addresses.

The general scheme is to match OR expressions by recursively calculating the origin of individual bits which constitute the resulting OR value. If all the OR bits come from memory verify that they are adjacent and match with little or big endian encoding of a wider value. If so and the load of the wider type (and bswap if needed) is allowed by the target generate a load and a bswap if needed.

Reviewed By: hfinkel, RKSimon, filcab

Differential Revision: https://reviews.llvm.org/D26149

llvm-svn: 289538
2016-12-13 14:21:14 +00:00
Sanjoy Das
a0e8011216 [Verifier] Add verification for TBAA metadata
Summary:
This change adds some verification in the IR verifier around struct path
TBAA metadata.

Other than some basic sanity checks (e.g. we get constant integers where
we expect constant integers), this checks:

 - That by the time an struct access tuple `(base-type, offset)` is
   "reduced" to a scalar base type, the offset is `0`.  For instance, in
   C++ you can't start from, say `("struct-a", 16)`, and end up with
   `("int", 4)` -- by the time the base type is `"int"`, the offset
   better be zero.  In particular, a variant of this invariant is needed
   for `llvm::getMostGenericTBAA` to be correct.

 - That there are no cycles in a struct path.

 - That struct type nodes have their offsets listed in an ascending
   order.

 - That when generating the struct access path, you eventually reach the
   access type listed in the tbaa tag node.

Reviewers: dexonsmith, chandlerc, reames, mehdi_amini, manmanren

Subscribers: mcrosier, llvm-commits

Differential Revision: https://reviews.llvm.org/D26438

llvm-svn: 289402
2016-12-11 20:07:15 +00:00
Matthias Braun
12c61bab00 Move .mir tests to appropriate directories
test/CodeGen/MIR should contain tests that intent to test the MIR
printing or parsing. Tests that test something else should be in
test/CodeGen/TargetName even when they are written in .mir.

As a rule of thumb, only tests using "llc -run-pass none" should be in
test/CodeGen/MIR.

llvm-svn: 289254
2016-12-09 19:08:15 +00:00
Nirav Dave
0b475dec77 Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled."
This reverts commit r289221 which appears to be triggering an assertion

llvm-svn: 289226
2016-12-09 17:18:24 +00:00
Nirav Dave
075ae0197d In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Retrying after fixing overly aggressive load-store forwarding optimization.

Simplify Consecutive Merge Store Candidate Search

Now that address aliasing is much less conservative, push through
simplified store merging search which only checks for parallel stores
through the chain subgraph. This is cleaner as the separation of
non-interfering loads/stores from the store-merging logic.

Whem merging stores, search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited. This improves the quality of the
output SelectionDAG and generally the output CodeGen (with some
exceptions).

Additional Minor Changes:

   1. Finishes removing unused AliasLoad code
   2. Unifies the the chain aggregation in the merged stores across
      code paths
   3. Re-add the Store node to the worklist after calling
      SimplifyDemandedBits.
   4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
      arbitrary, but seemed sufficient to not cause regressions in
      tests.

This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.

Many tests required some changes as memory operations are now
reorderable. Some tests relying on the order were changed to use
volatile memory operations

Noteworthy tests:

    CodeGen/AArch64/argument-blocks.ll -
      It's not entirely clear what the test_varargs_stackalign test is
      supposed to be asserting, but the new code looks right.

    CodeGen/AArch64/arm64-memset-inline.lli -
    CodeGen/AArch64/arm64-stur.ll -
    CodeGen/ARM/memset-inline.ll -

      The backend now generates *worse* code due to store merging
      succeeding, as we do do a 16-byte constant-zero store efficiently.

    CodeGen/AArch64/merge-store.ll -
      Improved, but there still seems to be an extraneous vector insert
      from an element to itself?

    CodeGen/PowerPC/ppc64-align-long-double.ll -
      Worse code emitted in this case, due to the improved store->load
      forwarding.

    CodeGen/X86/dag-merge-fast-accesses.ll -
    CodeGen/X86/MergeConsecutiveStores.ll -
    CodeGen/X86/stores-merging.ll -
    CodeGen/Mips/load-store-left-right.ll -
      Restored correct merging of non-aligned stores

    CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
      Improved. Correctly merges buffer_store_dword calls

    CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
      Improved. Sidesteps loading a stored value and
      merges two stores

    CodeGen/X86/pr18023.ll -
      This test has been removed, as it was asserting incorrect
      behavior. Non-volatile stores *CAN* be moved past volatile loads,
      and now are.

    CodeGen/X86/vector-idiv.ll -
    CodeGen/X86/vector-lzcnt-128.ll -
      It's basically impossible to tell what these tests are actually
      testing. But, looks like the code got better due to the memory
      operations being recognized as non-aliasing.

    CodeGen/X86/win32-eh.ll -
      Both loads of the securitycookie are now merged.

Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle

Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, dsanders, resistor, tstellarAMD, t.p.northover, spatel

Differential Revision: https://reviews.llvm.org/D14834

llvm-svn: 289221
2016-12-09 16:15:12 +00:00
Weiming Zhao
404d8d84c6 Summary: Currently there is no way to disable deprecated warning from asm like this
clang  -target arm deprecated-asm.s -c
  deprecated-asm.s:30:9: warning: use of SP or PC in the list is deprecated
       stmia   r4!, {r12-r14}

We have to have an option what can disable it.

Patched by Yin Ma!

Reviewers: joey, echristo, weimingz

Subscribers: llvm-commits, aemerson

Differential Revision: https://reviews.llvm.org/D27219

llvm-svn: 288734
2016-12-05 23:55:13 +00:00
Wolfgang Pieb
78cccaa525 When instructions are hoisted out of loops by MachineLICM, remove their debug loc.
This prevents erratic stepping behavior as well as incorrect source attribution
for sample profiling.

Reviewers: dblakie

Subscribers: llvm-commit

Differential Revision: https://reviews.llvm.org/D27290

llvm-svn: 288442
2016-12-02 00:37:57 +00:00
Oleg Ranevskyy
ac195a6f2b [ARM] Fix for 64-bit CAS expansion on ARM32 with -O0
Summary:
This patch fixes comparison of 64-bit atomic with its expected value in CMP_SWAP_64 expansion.

Currently, the low words are compared with CMP, while the high words are compared with SBC. SBC expects the carry flag to be set if CMP detects a difference. CMP might leave the carry unset for unequal arguments though if the first one is >= than the second. This might cause the comparison logic to detect false equality.

Example of the broken C++ code:
```
std::atomic<long long> at(2);

long long ll = 1;
std::atomic_compare_exchange_strong(&at, &ll, 3);
```
Even though the atomic `at` and the expected value `ll` are not equal and `atomic_compare_exchange_strong` returns `false`, `at` is changed to 3.

The patch replaces SBC with CMPEQ.

Reviewers: t.p.northover

Subscribers: aemerson, rengolin, llvm-commits, asl

Differential Revision: https://reviews.llvm.org/D27315

llvm-svn: 288433
2016-12-01 22:58:35 +00:00
Kuba Mracek
c7c751102c [xray] Add XRay support for Mach-O in CodeGen
Currently, XRay only supports emitting the XRay table (xray_instr_map) on ELF binaries. Let's add Mach-O support.

Differential Revision: https://reviews.llvm.org/D26983

llvm-svn: 287734
2016-11-23 02:07:04 +00:00
Pablo Barrio
b119bf4783 [ARM] Relax restriction on variadic functions for tailcall optimization
Summary:
Variadic functions can be treated in the same way as normal functions
with respect to the number and types of parameters.

Reviewers: grosbach, olista01, t.p.northover, rengolin

Subscribers: javed.absar, aemerson, llvm-commits

Differential Revision: https://reviews.llvm.org/D26748

llvm-svn: 287219
2016-11-17 10:56:58 +00:00
Tim Northover
921ff0e003 ARM: fix CodeGen for 64-bit shifts.
One half of the shifts obviously needed conditional selection based on whether
the shift amount is more than 32-bits, but leaving the other half as the
natural shift isn't acceptable either: it's undefined behaviour to shift a
32-bit value by more than 31.

llvm-svn: 287149
2016-11-16 20:54:28 +00:00
Javed Absar
8e25c39cd6 [ARM] Add machine scheduler for Cortex-R52
This patch adds the Sched Machine Model for Cortex-R52.

Details of the pipeline and descriptions are in comments
in file ARMScheduleR52.td included in this patch.

Reviewers: rengolin, jmolloy

Differential Revision: https://reviews.llvm.org/D26500

llvm-svn: 286949
2016-11-15 11:34:54 +00:00
Tim Northover
8066f28ac6 Recommit: ARM: sort register lists by encoding in push/pop instructions.
For example we were producing

    push {r8, r10, r11, r4, r5, r7, lr}

This is misleading (r4, r5 and r7 are actually pushed before the rest), and
other components (stack folding recently) often forget to deal with the extra
complexity coming from the different order, leading to miscompiles. Finally, we
warn about our own code in -no-integrated-as mode without this, which is really
not a good idea.

Fixed usage of std::sort so that we (hopefully) use instantiations that
actually exist in GCC 4.8.

llvm-svn: 286881
2016-11-14 20:28:24 +00:00
Tim Northover
e57ef1ceb3 Revert "ARM: sort register lists by encoding in push/pop instructions."
This reverts commit 286866. It broke a bot, something to do with exactly which
templates std::sort accepts.

llvm-svn: 286867
2016-11-14 19:05:28 +00:00
Tim Northover
dc5cbf4521 ARM: sort register lists by encoding in push/pop instructions.
For example we were producing

    push {r8, r10, r11, r4, r5, r7, lr}

This is misleading (r4, r5 and r7 are actually pushed before the rest), and
other components (stack folding recently) often forget to deal with the extra
complexity coming from the different order, leading to miscompiles. Finally, we
warn about our own code in -no-integrated-as mode without this, which is really
not a good idea.

llvm-svn: 286866
2016-11-14 19:02:17 +00:00
Adrian Prantl
6dcbcbb6e8 Revert "Use private linkage for MergedGlobals variables" on Darwin.
This is a partial revert of r244615 (http://reviews.llvm.org/D11942),
which caused a major regression in debug info quality.

Turning the artificial __MergedGlobal symbols into private symbols
(l__MergedGlobal) means that the linker will not include them in the
symbol table of the final executable. Without a symbol table entry
dsymutil is not be able to process the debug info for any of the
merged globals and thus drops the debug info for all of them.

This patch is enabling the old behavior for all MachO targets while
leaving all other targets unaffected.

rdar://problem/29160481
https://reviews.llvm.org/D26531

llvm-svn: 286607
2016-11-11 17:50:09 +00:00
Diana Picus
ece0160511 [ARM] Add plumbing for GlobalISel
Add GlobalISel skeleton, up to the point where we can select a ret void.

llvm-svn: 286573
2016-11-11 08:27:37 +00:00
Matthias Braun
4741c1e622 ScheduleDAGInstrs: Add condjump deps to addSchedBarrierDeps()
addSchedBarrierDeps() is supposed to add use operands to the ExitSU
node. The current implementation adds uses for calls/barrier instruction
and the MBB live-outs in all other cases. The use
operands of conditional jump instructions were missed.

Also added code to macrofusion to set the latencies between nodes to
zero to avoid problems with the fusing nodes lingering around in the
pending list now.

Differential Revision: https://reviews.llvm.org/D25140

llvm-svn: 286544
2016-11-11 01:34:21 +00:00
James Molloy
39cc4bbf20 [Thumb1] Move padding earlier when synthesizing TBBs off of the PC
When the base register (register pointing to the jump table) is the PC, we expect the jump table to directly follow the jump sequence with no intervening padding.

If there is intervening padding, the calculated offsets will not be correct. One solution would be to account for any padding in the emitted LDRB instruction, but at the moment we don't support emitting MCExprs for the load offset.

In the meantime, it's correct and only a slight amount worse to just move the padding up, from just before the jump table to just before the jump instruction sequence. We can do that by emitting code alignment before the jump sequence, as we know the number of instructions in the sequence is always 4.

llvm-svn: 286107
2016-11-07 13:38:21 +00:00
Saleem Abdulrasool
dd02e99d65 ARM: lower fpowi appropriately for Windows ARM
This handles the last case of the builtin function calls that we would
generate code which differed from Microsoft's ABI.  Rather than
generating a call to `__pow{d,s}i2` we now promote the parameter to a
float or double and invoke `powf` or `pow` instead.

Addresses PR30825!

llvm-svn: 286082
2016-11-06 19:46:54 +00:00
Weiming Zhao
300116bb60 [Cortex-M0] Atomic lowering
Summary: ARMv6m supports dmb etc fench instructions but not ldrex/strex etc. So for some atomic load/store, LLVM should inline instructions instead of lowering to __sync_ calls.

Reviewers: rengolin, efriedma, t.p.northover, jmolloy

Subscribers: efriedma, aemerson, llvm-commits

Differential Revision: https://reviews.llvm.org/D26120

llvm-svn: 285969
2016-11-03 21:49:08 +00:00
James Molloy
48e2ca4d44 Revert "[Thumb] Teach ISel how to lower compares of AND bitmasks efficiently"
This reverts commit r285893. It caused (probably) http://lab.llvm.org:8011/builders/clang-cmake-thumbv7-a15-full-sh/builds/83 .

llvm-svn: 285912
2016-11-03 14:08:01 +00:00
James Molloy
d32f0cec9f [Thumb] Teach ISel how to lower compares of AND bitmasks efficiently
This recommits r281323, which was backed out for two reasons. One, a selfhost failure, and two, it apparently caused Chromium failures. Actually, the latter was a red herring. The log has expired from the former, but I suspect that was a red herring too (actually caused by another problematic patch of mine). Therefore reapplying, and will watch the bots like a hawk.

For the common pattern (CMPZ (AND x, #bitmask), #0), we can do some more efficient instruction selection if the bitmask is one consecutive sequence of set bits (32 - clz(bm) - ctz(bm) == popcount(bm)).

1) If the bitmask touches the LSB, then we can remove all the upper bits and set the flags by doing one LSLS.
2) If the bitmask touches the MSB, then we can remove all the lower bits and set the flags with one LSRS.
3) If the bitmask has popcount == 1 (only one set bit), we can shift that bit into the sign bit with one LSLS and change the condition query from NE/EQ to MI/PL (we could also implement this by shifting into the carry bit and branching on BCC/BCS).
4) Otherwise, we can emit a sequence of LSLS+LSRS to remove the upper and lower zero bits of the mask.

1-3 require only one 16-bit instruction and can elide the CMP. 4 requires two 16-bit instructions but can elide the CMP and doesn't require materializing a complex immediate, so is also a win.

llvm-svn: 285893
2016-11-03 10:18:20 +00:00
Sjoerd Meijer
7c7c25f50f This is a 1 character fix for an ARM build attribute test (r284571): the
purpose of the test was to have 2 different function attribute sets, but due
to a typo there was only one both with number #0.

llvm-svn: 285701
2016-11-01 15:59:37 +00:00
James Molloy
5538ea11d7 [Thumb-1] Synthesize TBB/TBH instructions to make use of compressed jump tables
[Reapplying r284580 and r285917 with fix and testing to ensure emitted jump tables for Thumb-1 have 4-byte alignment]

The TBB and TBH instructions in Thumb-2 allow jump tables to be compressed into sequences of bytes or shorts respectively. These instructions do not exist in Thumb-1, however it is possible to synthesize them out of a sequence of other instructions.

It turns out this sequence is so short that it's almost never a lose for performance and is ALWAYS a significant win for code size.

TBB example:
Before: lsls r0, r0, #2    After: add  r0, pc
        adr  r1, .LJTI0_0         ldrb r0, [r0, #6]
        ldr  r0, [r0, r1]         lsls r0, r0, #1
        mov  pc, r0               add  pc, r0
  => No change in prologue code size or dynamic instruction count. Jump table shrunk by a factor of 4.

The only case that can increase dynamic instruction count is the TBH case:

Before: lsls r0, r4, #2    After: lsls r4, r4, #1
        adr  r1, .LJTI0_0         add  r4, pc
        ldr  r0, [r0, r1]         ldrh r4, [r4, #6]
        mov  pc, r0               lsls r4, r4, #1
                                  add  pc, r4
  => 1 more instruction in prologue. Jump table shrunk by a factor of 2.

So there is an argument that this should be disabled when optimizing for performance (and a TBH needs to be generated). I'm not so sure about that in practice, because on small cores with Thumb-1 performance is often tied to code size. But I'm willing to turn it off when optimizing for performance if people want (also note that TBHs are fairly rare in practice!)

llvm-svn: 285690
2016-11-01 13:37:41 +00:00
Saleem Abdulrasool
2e23098170 CodeGen: further loosen -O0 CG for WoA division
Generate the slowest possible codepath for noopt CodeGen.  Even trying to be
clever with the negated jump can cause out-of-range jumps.  Use a wide branch
instead. Although the code is modelled simplistically, the later optimizations
would recombine the branching into `cbz` if possible.  This re-enables the
previous optimization as well as hopefully gives us working code in all cases.

Addresses PR30356!

llvm-svn: 285649
2016-10-31 22:12:37 +00:00
Arnold Schwaighofer
252873258a Make swift calling convention test specific to armv7
llvm-svn: 285431
2016-10-28 19:18:09 +00:00
Arnold Schwaighofer
7322acc850 More swift calling convention tests
llvm-svn: 285417
2016-10-28 17:21:05 +00:00
Saleem Abdulrasool
5d2e860995 ARM: ensure that the Windows DBZ check is in range
The Windows ARM target expects the compiler to emit a division-by-zero check.
The check would use the form of:

    cmp r?, #0
    cbz .Ltrap
    b .Lbody
  .Lbody:
    ...
  .Ltrap:
    udf #249 @ __brkdiv0

This works great most of the time.  However, if the body of the function is
greater than 127 bytes, the branch target limitation of cbz becomes an issue.
This occurs in the unoptimized code generation cases sometimes (like in
compiler-rt).

Since this is a matter of correctness, possibly pay a small penalty instead.  We
now form this slightly differently:

    cbnz .Lbody
    udf #249 @ __brkdiv0
  .Lbody:
    ...

The positive case is through the branch instead of being the next instruction.
However, because of the basic block layout, the negated branch is going to be
a short distance always (2 bytes away, after the inserted __brkdiv0).

The new t__brkdiv0 instruction is required to explicitly mark the instruction as
a terminator as the generic UDF instruction is not a terminator.

Addresses PR30532!

llvm-svn: 285312
2016-10-27 16:59:22 +00:00
Sam Parker
cc86b4dc71 [ARM] Add newline char to test.
Missed a newline in the previous commit.

Differential Revision: https://reviews.llvm.org/D26027

llvm-svn: 285280
2016-10-27 10:43:02 +00:00
Sam Parker
5118fae0d2 [ARM] Predicate UMAAL selection on hasDSP.
UMAAL is a DSP instruction and it is not available on thumbv7m
(Cortex-M3) and thumbv6m (Cortex-M0+1) targets. Also fix wrong
CHECK prefix in longMAC.ll test.

Patch by Vadzim Dambrouski.

Differential Revision: https://reviews.llvm.org/D25890

llvm-svn: 285278
2016-10-27 09:47:10 +00:00
Tim Northover
b805bcfa72 ARM: don't rely on push/pop reglists being in order when folding SP adjust.
It would be a very nice invariant to rely on, but unfortunately it doesn't
necessarily hold (and the causes of mis-sorted reglists appear to be quite
varied) so to be robust the frame lowering code can't assume that the first
register in the list is also the first one that actually gets pushed.

Should fix an issue where we were turning something like:

    push {r8, r4, r7, lr}
    sub sp, #24

into nonsense like:

    push {r2, r3, r4, r5, r6, r7, r8, r4, r7, lr}

llvm-svn: 285232
2016-10-26 20:01:00 +00:00
Mandeep Singh Grang
54f45ece9e [llvm] Remove redundant --check-prefix=CHECK from tests
Reviewers: MatzeB, mcrosier, rengolin

Differential Revision: https://reviews.llvm.org/D25894

llvm-svn: 285003
2016-10-24 18:57:55 +00:00
Eli Friedman
b17f521ba9 Revert r284580+r284917. ("Synthesize TBB/TBH instructions")
The optimization has correctness issues, so reverting for now to fix tests
on thumb1 targets.

llvm-svn: 284993
2016-10-24 17:20:50 +00:00
James Molloy
3bb4aed2c3 [ARM] Fix crash in ConstantIslands
tPCRelJT may not be the first instruction in a block. Check that instead of dereferencing a broken iterator.

llvm-svn: 284917
2016-10-22 09:58:37 +00:00
Dehao Chen
65848e30b4 Using branch probability to guide critical edge splitting.
Summary:
The original heuristic to break critical edge during machine sink is relatively conservertive: when there is only one instruction sinkable to the critical edge, it is likely that the machine sink pass will not break the critical edge. This leads to many speculative instructions executed at runtime. However, with profile info, we could model the splitting benefits: if the critical edge has 50% taken rate, it would always be beneficial to split the critical edge to avoid the speculated runtime instructions. This patch uses profile to guide critical edge splitting in machine sink pass.

The performance impact on speccpu2006 on Intel sandybridge machines:

spec/2006/fp/C++/444.namd                  25.3  +0.26%
spec/2006/fp/C++/447.dealII               45.96  -0.10%
spec/2006/fp/C++/450.soplex               41.97  +1.49%
spec/2006/fp/C++/453.povray               36.83  -0.96%
spec/2006/fp/C/433.milc                   23.81  +0.32%
spec/2006/fp/C/470.lbm                    41.17  +0.34%
spec/2006/fp/C/482.sphinx3                48.13  +0.69%
spec/2006/int/C++/471.omnetpp             22.45  +3.25%
spec/2006/int/C++/473.astar               21.35  -2.06%
spec/2006/int/C++/483.xalancbmk           36.02  -2.39%
spec/2006/int/C/400.perlbench              33.7  -0.17%
spec/2006/int/C/401.bzip2                  22.9  +0.52%
spec/2006/int/C/403.gcc                   32.42  -0.54%
spec/2006/int/C/429.mcf                   39.59  +0.19%
spec/2006/int/C/445.gobmk                 26.98  -0.00%
spec/2006/int/C/456.hmmer                 24.52  -0.18%
spec/2006/int/C/458.sjeng                 28.26  +0.02%
spec/2006/int/C/462.libquantum            55.44  +3.74%
spec/2006/int/C/464.h264ref               46.67  -0.39%

geometric mean                                   +0.20%

Manually checked 473 and 471 to verify the diff is in the noise range.

Reviewers: rengolin, davidxl

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D24818

llvm-svn: 284757
2016-10-20 18:06:52 +00:00
Reid Kleckner
dcf108e176 [GlobalMerge] Handle non-landingpad EH pads
This code crashed on funclet-style EH instructions such as catchpad,
catchswitch, and cleanuppad. Just treat all EH pad instructions
equivalently and avoid merging the globals they reference through any
use.

llvm-svn: 284633
2016-10-19 19:56:22 +00:00
Sanjay Patel
b7d4502f35 [DAG] optimize negation of bool
Use mask and negate for legalization of i1 source type with SIGN_EXTEND_INREG.
With the mask, this should be no worse than 2 shifts. The mask can be eliminated
in some cases, so that should be better than 2 shifts.

This change exposed some missing folds related to negation:
https://reviews.llvm.org/rL284239
https://reviews.llvm.org/rL284395

There may be others, so please let me know if you see any regressions.

Differential Revision: https://reviews.llvm.org/D25485

llvm-svn: 284611
2016-10-19 16:58:59 +00:00
Sjoerd Meijer
3d94a0d81d Reapply r284571 (with the new tests fixed).
llvm-svn: 284588
2016-10-19 13:43:02 +00:00
James Molloy
5a58c95fb7 [Thumb-1] Synthesize TBB/TBH instructions to make use of compressed jump tables
The TBB and TBH instructions in Thumb-2 allow jump tables to be compressed into sequences of bytes or shorts respectively. These instructions do not exist in Thumb-1, however it is possible to synthesize them out of a sequence of other instructions.

It turns out this sequence is so short that it's almost never a lose for performance and is ALWAYS a significant win for code size.

TBB example:
Before: lsls r0, r0, #2    After: add  r0, pc
        adr  r1, .LJTI0_0         ldrb r0, [r0, #6]
        ldr  r0, [r0, r1]         lsls r0, r0, #1
        mov  pc, r0               add  pc, r0
  => No change in prologue code size or dynamic instruction count. Jump table shrunk by a factor of 4.

The only case that can increase dynamic instruction count is the TBH case:

Before: lsls r0, r4, #2    After: lsls r4, r4, #1
        adr  r1, .LJTI0_0         add  r4, pc
        ldr  r0, [r0, r1]         ldrh r4, [r4, #6]
        mov  pc, r0               lsls r4, r4, #1
                                  add  pc, r4
  => 1 more instruction in prologue. Jump table shrunk by a factor of 2.

So there is an argument that this should be disabled when optimizing for performance (and a TBH needs to be generated). I'm not so sure about that in practice, because on small cores with Thumb-1 performance is often tied to code size. But I'm willing to turn it off when optimizing for performance if people want (also note that TBHs are fairly rare in practice!)

llvm-svn: 284580
2016-10-19 12:06:49 +00:00
Sjoerd Meijer
6540326539 Revert of r284571 because of failing tests.
llvm-svn: 284572
2016-10-19 07:45:48 +00:00
Sjoerd Meijer
522c2a6a6a Checking FP function attribute values and adding more build attribute tests.
This renames the function for checking FP function attribute values and also
adds more build attribute tests (which are in separate files because build
attributes are set per file).

Differential Revision: https://reviews.llvm.org/D25625

llvm-svn: 284571
2016-10-19 07:25:06 +00:00
Dehao Chen
9da0fdce12 Revert r284545 again as the regression in ppc still exists. There is bug in MBPI exposed by th patch.
Also update the section.ll to fix non-x86 failure.

llvm-svn: 284563
2016-10-19 01:18:25 +00:00
Dehao Chen
c30554ed83 Using branch probability to guide critical edge splitting.
Summary:
The original heuristic to break critical edge during machine sink is relatively conservertive: when there is only one instruction sinkable to the critical edge, it is likely that the machine sink pass will not break the critical edge. This leads to many speculative instructions executed at runtime. However, with profile info, we could model the splitting benefits: if the critical edge has 50% taken rate, it would always be beneficial to split the critical edge to avoid the speculated runtime instructions. This patch uses profile to guide critical edge splitting in machine sink pass.

The performance impact on speccpu2006 on Intel sandybridge machines:

spec/2006/fp/C++/444.namd                  25.3  +0.26%
spec/2006/fp/C++/447.dealII               45.96  -0.10%
spec/2006/fp/C++/450.soplex               41.97  +1.49%
spec/2006/fp/C++/453.povray               36.83  -0.96%
spec/2006/fp/C/433.milc                   23.81  +0.32%
spec/2006/fp/C/470.lbm                    41.17  +0.34%
spec/2006/fp/C/482.sphinx3                48.13  +0.69%
spec/2006/int/C++/471.omnetpp             22.45  +3.25%
spec/2006/int/C++/473.astar               21.35  -2.06%
spec/2006/int/C++/483.xalancbmk           36.02  -2.39%
spec/2006/int/C/400.perlbench              33.7  -0.17%
spec/2006/int/C/401.bzip2                  22.9  +0.52%
spec/2006/int/C/403.gcc                   32.42  -0.54%
spec/2006/int/C/429.mcf                   39.59  +0.19%
spec/2006/int/C/445.gobmk                 26.98  -0.00%
spec/2006/int/C/456.hmmer                 24.52  -0.18%
spec/2006/int/C/458.sjeng                 28.26  +0.02%
spec/2006/int/C/462.libquantum            55.44  +3.74%
spec/2006/int/C/464.h264ref               46.67  -0.39%

geometric mean                                   +0.20%

Manually checked 473 and 471 to verify the diff is in the noise range.

Reviewers: rengolin, davidxl

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D24818

llvm-svn: 284545
2016-10-18 23:24:02 +00:00
Dehao Chen
ee1c9ba17a revert r284541.
llvm-svn: 284544
2016-10-18 23:11:20 +00:00
Dehao Chen
a1d3ed3e41 Using branch probability to guide critical edge splitting.
Summary:
The original heuristic to break critical edge during machine sink is relatively conservertive: when there is only one instruction sinkable to the critical edge, it is likely that the machine sink pass will not break the critical edge. This leads to many speculative instructions executed at runtime. However, with profile info, we could model the splitting benefits: if the critical edge has 50% taken rate, it would always be beneficial to split the critical edge to avoid the speculated runtime instructions. This patch uses profile to guide critical edge splitting in machine sink pass.

The performance impact on speccpu2006 on Intel sandybridge machines:

spec/2006/fp/C++/444.namd                  25.3  +0.26%
spec/2006/fp/C++/447.dealII               45.96  -0.10%
spec/2006/fp/C++/450.soplex               41.97  +1.49%
spec/2006/fp/C++/453.povray               36.83  -0.96%
spec/2006/fp/C/433.milc                   23.81  +0.32%
spec/2006/fp/C/470.lbm                    41.17  +0.34%
spec/2006/fp/C/482.sphinx3                48.13  +0.69%
spec/2006/int/C++/471.omnetpp             22.45  +3.25%
spec/2006/int/C++/473.astar               21.35  -2.06%
spec/2006/int/C++/483.xalancbmk           36.02  -2.39%
spec/2006/int/C/400.perlbench              33.7  -0.17%
spec/2006/int/C/401.bzip2                  22.9  +0.52%
spec/2006/int/C/403.gcc                   32.42  -0.54%
spec/2006/int/C/429.mcf                   39.59  +0.19%
spec/2006/int/C/445.gobmk                 26.98  -0.00%
spec/2006/int/C/456.hmmer                 24.52  -0.18%
spec/2006/int/C/458.sjeng                 28.26  +0.02%
spec/2006/int/C/462.libquantum            55.44  +3.74%
spec/2006/int/C/464.h264ref               46.67  -0.39%

geometric mean                                   +0.20%

Manually checked 473 and 471 to verify the diff is in the noise range.

Reviewers: rengolin, davidxl

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D24818

llvm-svn: 284541
2016-10-18 21:36:11 +00:00
Eli Friedman
a5414b4036 Improve ARM lowering for "icmp <2 x i64> eq".
The custom lowering is pretty straightforward: basically, just AND
together the two halves of a <4 x i32> compare.

Differential Revision: https://reviews.llvm.org/D25713

llvm-svn: 284536
2016-10-18 21:03:40 +00:00
Javed Absar
90db73a094 [ARM] Assign cost of scaling for Cortex-R52
This patch assigns cost of the scaling used in addressing for Cortex-R52.

On Cortex-R52 a negated register offset takes longer than a non-negated
register offset, in a register-offset addressing mode.

Differential Revision: http://reviews.llvm.org/D25670

Reviewer: jmolloy
llvm-svn: 284460
2016-10-18 09:08:54 +00:00
Dean Michael Berris
b135433036 [XRay] Support for for tail calls for ARM no-Thumb
This patch adds simplified support for tail calls on ARM with XRay instrumentation.

Known issue: compiled with generic flags: `-O3 -g -fxray-instrument -Wall
-std=c++14  -ffunction-sections -fdata-sections` (this list doesn't include my
specific flags like --target=armv7-linux-gnueabihf etc.), the following program

    #include <cstdio>
    #include <cassert>
    #include <xray/xray_interface.h>

    [[clang::xray_always_instrument]] void __attribute__ ((noinline)) fC() {
      std::printf("In fC()\n");
    }

    [[clang::xray_always_instrument]] void __attribute__ ((noinline)) fB() {
      std::printf("In fB()\n");
      fC();
    }

    [[clang::xray_always_instrument]] void __attribute__ ((noinline)) fA() {
      std::printf("In fA()\n");
      fB();
    }

    // Avoid infinite recursion in case the logging function is instrumented (so calls logging
    //   function again).
    [[clang::xray_never_instrument]] void simplyPrint(int32_t functionId, XRayEntryType xret)
    {
      printf("XRay: functionId=%d type=%d.\n", int(functionId), int(xret));
    }

    int main(int argc, char* argv[]) {
      __xray_set_handler(simplyPrint);

      printf("Patching...\n");
      __xray_patch();
      fA();

      printf("Unpatching...\n");
      __xray_unpatch();
      fA();

      return 0;
    }

gives the following output:

    Patching...
    XRay: functionId=3 type=0.
    In fA()
    XRay: functionId=3 type=1.
    XRay: functionId=2 type=0.
    In fB()
    XRay: functionId=2 type=1.
    XRay: functionId=1 type=0.
    XRay: functionId=1 type=1.
    In fC()
    Unpatching...
    In fA()
    In fB()
    In fC()

So for function fC() the exit sled seems to be called too much before function
exit: before printing In fC().

Debugging shows that the above happens because printf from fC is also called as
a tail call. So first the exit sled of fC is executed, and only then printf is
jumped into. So it seems we can't do anything about this with the current
approach (i.e. within the simplification described in
https://reviews.llvm.org/D23988 ).

Differential Revision: https://reviews.llvm.org/D25030

llvm-svn: 284456
2016-10-18 05:54:15 +00:00
James Molloy
c6be02fbf0 [SDAG] Use ABI type alignment for constant pools when optimizing for size
SelectionDAG::getConstantPool will automatically determine an appropriate alignment if one is not specified. It does this by querying the type's preferred alignment. This can end up creating quite a lot of padding when the preferred alignment for vectors is 128.

In optimize-for-size mode, it makes sense to instead query the ABI type alignment which is often smaller and causes less padding.

llvm-svn: 284381
2016-10-17 12:54:07 +00:00
Sanjay Patel
55f9ef6ad9 [ARM] add tests for PR30660
llvm-svn: 284280
2016-10-14 20:52:43 +00:00
Nirav Dave
7c2dd71bf8 Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled."
This reverts commit r284151 which appears to be triggering a LTO
failures on Hexagon

llvm-svn: 284157
2016-10-13 20:23:25 +00:00
Nirav Dave
01829c947a In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Retrying after upstream changes.

   Simplify Consecutive Merge Store Candidate Search

   Now that address aliasing is much less conservative, push through
   simplified store merging search which only checks for parallel stores
   through the chain subgraph. This is cleaner as the separation of
   non-interfering loads/stores from the store-merging logic.

   Whem merging stores, search up the chain through a single load, and
   finds all possible stores by looking down from through a load and a
   TokenFactor to all stores visited. This improves the quality of the
   output SelectionDAG and generally the output CodeGen (with some
   exceptions).

   Additional Minor Changes:

       1. Finishes removing unused AliasLoad code
       2. Unifies the the chain aggregation in the merged stores across
       code paths
       3. Re-add the Store node to the worklist after calling
       SimplifyDemandedBits.
       4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
       arbitrary, but seemed sufficient to not cause regressions in
       tests.

   This finishes the change Matt Arsenault started in r246307 and
   jyknight's original patch.

   Many tests required some changes as memory operations are now
   reorderable. Some tests relying on the order were changed to use
   volatile memory operations

   Noteworthy tests:

    CodeGen/AArch64/argument-blocks.ll -
      It's not entirely clear what the test_varargs_stackalign test is
      supposed to be asserting, but the new code looks right.

    CodeGen/AArch64/arm64-memset-inline.lli -
    CodeGen/AArch64/arm64-stur.ll -
    CodeGen/ARM/memset-inline.ll -

      The backend now generates *worse* code due to store merging
      succeeding, as we do do a 16-byte constant-zero store efficiently.

    CodeGen/AArch64/merge-store.ll -
      Improved, but there still seems to be an extraneous vector insert
      from an element to itself?

    CodeGen/PowerPC/ppc64-align-long-double.ll -
      Worse code emitted in this case, due to the improved store->load
      forwarding.

    CodeGen/X86/dag-merge-fast-accesses.ll -
    CodeGen/X86/MergeConsecutiveStores.ll -
    CodeGen/X86/stores-merging.ll -
    CodeGen/Mips/load-store-left-right.ll -
      Restored correct merging of non-aligned stores

    CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
      Improved. Correctly merges buffer_store_dword calls

    CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
      Improved. Sidesteps loading a stored value and
      merges two stores

    CodeGen/X86/pr18023.ll -
      This test has been removed, as it was asserting incorrect
      behavior. Non-volatile stores *CAN* be moved past volatile loads,
      and now are.

    CodeGen/X86/vector-idiv.ll -
    CodeGen/X86/vector-lzcnt-128.ll -
      It's basically impossible to tell what these tests are actually
      testing. But, looks like the code got better due to the memory
      operations being recognized as non-aliasing.

    CodeGen/X86/win32-eh.ll -
      Both loads of the securitycookie are now merged.

    CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot-compute.ll -
      This test appears to work but no longer exhibits the spill behavior.

Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle

Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, dsanders, resistor, tstellarAMD, t.p.northover, spatel

Differential Revision: https://reviews.llvm.org/D14834

llvm-svn: 284151
2016-10-13 19:20:16 +00:00
Javed Absar
35adbf7085 [ARM]: Assign cost of scaling used in addressing mode for ARM cores
This patch assigns cost of the scaling used in addressing.
On many ARM cores, a negated register offset takes longer than a
non-negated register offset, in a register-offset addressing mode.

For instance:

LDR R0, [R1, R2 LSL #2]
LDR R0, [R1, -R2 LSL #2]

Above, (1) takes less cycles than (2).

By assigning appropriate scaling factor cost, we enable the LLVM
to make the right trade-offs in the optimization and code-selection phase.

Differential Revision: http://reviews.llvm.org/D24857

Reviewers: jmolloy, rengolin
llvm-svn: 284127
2016-10-13 14:57:43 +00:00
Reid Kleckner
b5d0510883 Correct PrivateLinkage for COFF
- Use storage class C_STAT for 'PrivateLinkage' The storage class for
  PrivateLinkage should equal to the Internal Linkage.

- Set 'PrivateGlobalPrefix' from "L" to ".L" for MM_WinCOFF (includes
  x86_64) MM_WinCOFF has empty GlobalPrefix '\0' so PrivateGlobalPrefix
  "L" may conflict to the normal symbol name starting with 'L'.

Based on a patch by Han Sangjin! Manually updated test cases.

llvm-svn: 284096
2016-10-13 00:55:24 +00:00
Konstantin Zhuravlyov
da8ee8e7dc [DAGCombiner] Do not remove the load of stored values when optimizations are disabled
This combiner breaks debug experience and should not be run when optimizations are disabled.

For example:
  int main() {
    int j = 0;
    j += 2;
    if (j == 2)
      return 0;
    return 5;
  }
When debugging this code compiled in /O0, it should be valid to break at line "j+=2;" and edit the value of j. It should change the return value of the function.

Differential Revision: https://reviews.llvm.org/D19268

llvm-svn: 284014
2016-10-12 13:44:24 +00:00
Kyle Butt
abb4b9d9ee Codegen: Tail-duplicate during placement.
The tail duplication pass uses an assumed layout when making duplication
decisions. This is fine, but passes up duplication opportunities that
may arise when blocks are outlined. Because we want the updated CFG to
affect subsequent placement decisions, this change must occur during
placement.

In order to achieve this goal, TailDuplicationPass is split into a
utility class, TailDuplicator, and the pass itself. The pass delegates
nearly everything to the TailDuplicator object, except for looping over
the blocks in a function. This allows the same code to be used for tail
duplication in both places.

This change, in concert with outlining optional branches, allows
triangle shaped code to perform much better, esepecially when the
taken/untaken branches are correlated, as it creates a second spine when
the tests are small enough.

Issue from previous rollback fixed, and a new test was added for that
case as well. Issue was worklist/scheduling/taildup issue in layout.

Issue from 2nd rollback fixed, with 2 additional tests. Issue was
tail merging/loop info/tail-duplication causing issue with loops that share
a header block.

Issue with early tail-duplication of blocks that branch to a fallthrough
predecessor fixed with test case: tail-dup-branch-to-fallthrough.ll

Differential revision: https://reviews.llvm.org/D18226

llvm-svn: 283934
2016-10-11 20:36:43 +00:00
Oliver Stannard
674ab91499 [ARM] Fix registers clobbered by SjLj EH on soft-float targets
Currently, the Int_eh_sjlj_dispatchsetup intrinsic is marked as
clobbering all registers, including floating-point registers that may
not be present on the target. This is technically true, as we could get
linked against code that does use the FP registers, but that will not
actually work, as the soft-float code cannot save and restore the FP
registers. SjLj exception handling can only work correctly if either all
or none of the code is built for a target with FP registers. Therefore,
we can assume that, when Int_eh_sjlj_dispatchsetup is compiled for a
soft-float target, it is only going to be linked against other
soft-float code, and so only clobbers the general-purpose registers.
This allows us to check that no non-savable registers are clobbered when
generating the prologue/epilogue.

Differential Revision: https://reviews.llvm.org/D25180

llvm-svn: 283866
2016-10-11 10:06:59 +00:00
Daniel Jasper
80e95dce89 Revert "Codegen: Tail-duplicate during placement."
This reverts commit r283842.

test/CodeGen/X86/tail-dup-repeat.ll causes and llc crash with our
internal testing. I'll share a link with you.

llvm-svn: 283857
2016-10-11 07:36:11 +00:00
Kyle Butt
f6ea6695ce Codegen: Tail-duplicate during placement.
The tail duplication pass uses an assumed layout when making duplication
decisions. This is fine, but passes up duplication opportunities that
may arise when blocks are outlined. Because we want the updated CFG to
affect subsequent placement decisions, this change must occur during
placement.

In order to achieve this goal, TailDuplicationPass is split into a
utility class, TailDuplicator, and the pass itself. The pass delegates
nearly everything to the TailDuplicator object, except for looping over
the blocks in a function. This allows the same code to be used for tail
duplication in both places.

This change, in concert with outlining optional branches, allows
triangle shaped code to perform much better, esepecially when the
taken/untaken branches are correlated, as it creates a second spine when
the tests are small enough.

Issue from previous rollback fixed, and a new test was added for that
case as well. Issue was worklist/scheduling/taildup issue in layout.

Issue from 2nd rollback fixed, with 2 additional tests. Issue was
tail merging/loop info/tail-duplication causing issue with loops that share
a header block.

Issue with early tail-duplication of blocks that branch to a fallthrough
predecessor fixed with test case: tail-dup-branch-to-fallthrough.ll

Differential revision: https://reviews.llvm.org/D18226

llvm-svn: 283842
2016-10-11 01:20:33 +00:00
Alexandros Lamprineas
7a09ca4165 [ARM] Fix invalid VLDM/VSTM access when targeting Big Endian with NEON
The instructions VLDM/VSTM can only access word-aligned memory
locations and produce alignment fault if the condition is not met.

The compiler currently generates VLDM/VSTM for v2f64 load/store
regardless the alignment of the memory access. Instead, if a v2f64
load/store is not word-aligned, the compiler should generate
VLD1/VST1. For each non double-word-aligned VLD1/VST1, a VREV
instruction should be generated when targeting Big Endian.

Differential Revision: https://reviews.llvm.org/D25281

llvm-svn: 283763
2016-10-10 16:01:54 +00:00
Kyle Butt
db90994d08 Revert "Codegen: Tail-duplicate during placement."
This reverts commit 71c312652c10f1855b28d06697c08d47e7a243e4.

llvm-svn: 283647
2016-10-08 01:47:05 +00:00
Kyle Butt
71644ee1e5 Codegen: Tail-duplicate during placement.
The tail duplication pass uses an assumed layout when making duplication
decisions. This is fine, but passes up duplication opportunities that
may arise when blocks are outlined. Because we want the updated CFG to
affect subsequent placement decisions, this change must occur during
placement.

In order to achieve this goal, TailDuplicationPass is split into a
utility class, TailDuplicator, and the pass itself. The pass delegates
nearly everything to the TailDuplicator object, except for looping over
the blocks in a function. This allows the same code to be used for tail
duplication in both places.

This change, in concert with outlining optional branches, allows
triangle shaped code to perform much better, esepecially when the
taken/untaken branches are correlated, as it creates a second spine when
the tests are small enough.

Issue from previous rollback fixed, and a new test was added for that
case as well. Issue was worklist/scheduling/taildup issue in layout.

Issue from 2nd rollback fixed, with 2 additional tests. Issue was
tail merging/loop info/tail-duplication causing issue with loops that share
a header block.

Differential revision: https://reviews.llvm.org/D18226

llvm-svn: 283619
2016-10-07 22:33:20 +00:00
Arnold Schwaighofer
6fad756c74 swifterror: Don't compute swifterror vregs during instruction selection
The code used llvm basic block predecessors to decided where to insert phi
nodes. Instruction selection can and will liberally insert new machine basic
block predecessors. There is not a guaranteed one-to-one mapping from pred.
llvm basic blocks and machine basic blocks.

Therefore the current approach does not work as it assumes we can mark
predecessor machine basic block as needing a copy, and needs to know the set of
all predecessor machine basic blocks to decide when to insert phis.

Instead of computing the swifterror vregs as we select instructions, propagate
them at the end of instruction selection when the MBB CFG is complete.

When an instruction needs a swifterror vreg and we don't know the value yet,
generate a new vreg and remember this "upward exposed" use, and reconcile this
at the end of instruction selection.

This will only happen if the target supports promoting swifterror parameters to
registers and the swifterror attribute is used.

rdar://28300923

llvm-svn: 283617
2016-10-07 22:06:55 +00:00
Martin Storsjo
59ad31e4f1 [ARM] Reapply: Use __rt_div functions for divrem on Windows
Reapplying r283383 after revert in r283442. The additional fix
is a getting rid of a stray space in a function name, in the
refactoring part of the commit.

This avoids falling back to calling out to the GCC rem functions
(__moddi3, __umoddi3) when targeting Windows.

The __rt_div functions have flipped the two arguments compared
to the __aeabi_divmod functions. To match MSVC, we emit a
check for division by zero before actually calling the library
function (even if the library function itself also might do
the same check).

Not all calls to __rt_div functions for division are currently
merged with calls to the same function with the same parameters
for the remainder. This is more wasteful than a div + mls as before,
but avoids calls to __moddi3.

Differential Revision: https://reviews.llvm.org/D25332

llvm-svn: 283550
2016-10-07 13:28:53 +00:00
Javed Absar
636ccd32c2 [ARM]: Add Cortex-R52 target to LLVM
This patch adds Cortex-R52, the new ARM real-time processor, to LLVM. 
Cortex-R52 implements the ARMv8-R architecture.

llvm-svn: 283542
2016-10-07 12:06:40 +00:00
Diana Picus
5225ad60e1 Revert "[ARM] Use __rt_div functions for divrem on Windows"
This reverts commit r283383 because it broke some of the bots:
undefined reference to ` __aeabi_uldivmod'

It affected (at least) clang-cmake-armv7-a15-selfhost,
clang-cmake-armv7-a15-selfhost and clang-native-arm-lnt.

llvm-svn: 283442
2016-10-06 11:24:29 +00:00
James Molloy
0b94204e9b [ARM] Constant pool promotion - fix alignment calculation
Global variables are GlobalValues, so they have explicit alignment. Querying
DataLayout for the alignment was incorrect.

Testcase added.

llvm-svn: 283423
2016-10-06 07:56:00 +00:00
James Molloy
8870a22d98 [ARM] Improve testcase for r283323
We can work around a shortcoming of FileCheck by using {{\[}} to match a square
bracket before a [[ sequence.

Thanks to Eli Friedman for the heads up!

llvm-svn: 283422
2016-10-06 07:44:05 +00:00