1
0
mirror of https://github.com/RPCS3/llvm-mirror.git synced 2024-10-19 02:52:53 +02:00
Commit Graph

42520 Commits

Author SHA1 Message Date
Simon Pilgrim
038cfb6d37 [IR] PatternMatch - add m_FShl/m_FShr funnel shift intrinsic matchers. NFCI. 2020-10-01 14:42:34 +01:00
James Henderson
6428f0596b [Archive] Don't throw away errors for malformed archive members
When adding an archive member with a problem, e.g. a new bitcode with an
old archiver, containing an unsupported attribute, or an ELF file with a
malformed symbol table, the archiver would throw away the error and
simply add the member to the archive without any symbol entries. This
meant that the resultant archive could be silently unusable when not
using --whole-archive, and result in unexpected undefined symbols.

This change fixes this issue by addressing two FIXMEs and only throwing
away not-an-object errors. However, this meant that some LLD tests which
didn't need symbol tables and were using invalid members deliberately to
test the linker's malformed input handling no longer worked, so this
patch also stops the archiver from looking for symbols in an object if
it doesn't require a symbol table, and updates the tests accordingly.

Differential Revision: https://reviews.llvm.org/D88288

Reviewed by: grimar, rupprecht, MaskRay
2020-10-01 14:03:34 +01:00
Sjoerd Meijer
c208ece583 [LoopFlatten] Add a loop-flattening pass
This is a simple pass that flattens nested loops.  The intention is to optimise
loop nests like this, which together access an array linearly:

  for (int i = 0; i < N; ++i)
    for (int j = 0; j < M; ++j)
      f(A[i*M+j]);

into one loop:

  for (int i = 0; i < (N*M); ++i)
    f(A[i]);

It can also flatten loops where the induction variables are not used in the
loop. This can help with codesize and runtime, especially on simple cpus
without advanced branch prediction.

This is only worth flattening if the induction variables are only used in an
expression like i*M+j. If they had any other uses, we would have to insert a
div/mod to reconstruct the original values, so this wouldn't be profitable.

This partially fixes PR40581 as this pass triggers on one of the two cases. I
will follow up on this to learn LoopFlatten a few more (small) tricks. Please
note that LoopFlatten is not yet enabled by default.

Patch by Oliver Stannard, with minor tweaks from Dave Green and myself.

Differential Revision: https://reviews.llvm.org/D42365
2020-10-01 13:54:45 +01:00
Andrew Paverd
ee355f6f7f [CFGuard] Add address-taken IAT tables and delay-load support
This patch adds support for creating Guard Address-Taken IAT Entry Tables (.giats$y sections) in object files, matching the behavior of MSVC. These contain lists of address-taken imported functions, which are used by the linker to create the final GIATS table.
Additionally, if any DLLs are delay-loaded, the linker must look through the .giats tables and add the respective load thunks of address-taken imports to the GFIDS table, as these are also valid call targets.

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D87544
2020-10-01 12:45:07 +01:00
Max Kazantsev
407ffa5a5f [SCEV] Prove implicaitons via AddRec start
If we know that some predicate is true for AddRec and an invariant
(w.r.t. this AddRec's loop), this fact is, in particular, true on the first
iteration. We can try to prove the facts we need using the start value.

The motivating example is proving things like
```
  isImpliedCondOperands(>=, X, 0, {X,+,-1}, 0}
```

Differential Revision: https://reviews.llvm.org/D88208
Reviewed By: reames
2020-10-01 17:09:38 +07:00
Fangrui Song
3485738376 [MC] Inline MCExpr::printVariantKind & remove UseParensForSymbolVariantBit
Note, MAI may be nullptr in -show-encoding.
2020-10-01 00:10:06 -07:00
Max Kazantsev
6b8dcdbad5 [SCEV][NFC] Introduce isKnownPredicateAt method
We can query known predicates in different points, respecting
their dominating conditions.
2020-10-01 12:11:24 +07:00
Arthur Eubanks
e702df7d7f [WholeProgramDevirt][NewPM] Add NPM testing path to match legacy pass
The legacy pass's default constructor sets UseCommandLine = true and
goes down a separate testing route. Match that in the NPM pass.

This fixes all tests in llvm/test/Transforms/WholeProgramDevirt under NPM.

Reviewed By: ychen

Differential Revision: https://reviews.llvm.org/D88588
2020-09-30 17:27:37 -07:00
Reid Kleckner
03ab9ed5e9 Re-land "[PDB] Merge types in parallel when using ghashing"
Stored Error objects have to be checked, even if they are success
values.

This reverts commit 8d250ac3cd48d0f17f9314685a85e77895c05351.
Relands commit 49b3459930655d879b2dc190ff8fe11c38a8be5f..

Original commit message:
-----------------------------------------

This makes type merging much faster (-24% on chrome.dll) when multiple
threads are available, but it slightly increases the time to link (+10%)
when /threads:1 is passed. With only one more thread, the new type
merging is faster (-11%). The output PDB should be identical to what it
was before this change.

To give an idea, here is the /time output placed side by side:
                              BEFORE    | AFTER
  Input File Reading:           956 ms  |  968 ms
  Code Layout:                  258 ms  |  190 ms
  Commit Output File:             6 ms  |    7 ms
  PDB Emission (Cumulative):   6691 ms  | 4253 ms
    Add Objects:               4341 ms  | 2927 ms
      Type Merging:            2814 ms  | 1269 ms  -55%!
      Symbol Merging:          1509 ms  | 1645 ms
    Publics Stream Layout:      111 ms  |  112 ms
    TPI Stream Layout:          764 ms  |   26 ms  trivial
    Commit to Disk:            1322 ms  | 1036 ms  -300ms
----------------------------------------- --------
Total Link Time:               8416 ms    5882 ms  -30% overall

The main source of the additional overhead in the single-threaded case
is the need to iterate all .debug$T sections up front to check which
type records should go in the IPI stream. See fillIsItemIndexFromDebugT.
With changes to the .debug$H section, we could pre-calculate this info
and eliminate the need to do this walk up front. That should restore
single-threaded performance back to what it was before this change.

This change will cause LLD to be much more parallel than it used to, and
for users who do multiple links in parallel, it could regress
performance. However, when the user is only doing one link, it's a huge
improvement. In the future, we can use NT worker threads to avoid
oversaturating the machine with work, but for now, this is such an
improvement for the single-link use case that I think we should land
this as is.

Algorithm
----------

Before this change, we essentially used a
DenseMap<GloballyHashedType, TypeIndex> to check if a type has already
been seen, and if it hasn't been seen, insert it now and use the next
available type index for it in the destination type stream. DenseMap
does not support concurrent insertion, and even if it did, the linker
must be deterministic: it cannot produce different PDBs by using
different numbers of threads. The output type stream must be in the same
order regardless of the order of hash table insertions.

In order to create a hash table that supports concurrent insertion, the
table cells must be small enough that they can be updated atomically.
The algorithm I used for updating the table using linear probing is
described in this paper, "Concurrent Hash Tables: Fast and General(?)!":
https://dl.acm.org/doi/10.1145/3309206

The GHashCell in this change is essentially a pair of 32-bit integer
indices: <sourceIndex, typeIndex>. The sourceIndex is the index of the
TpiSource object, and it represents an input type stream. The typeIndex
is the index of the type in the stream. Together, we have something like
a ragged 2D array of ghashes, which can be looked up as:
  tpiSources[tpiSrcIndex]->ghashes[typeIndex]

By using these side tables, we can omit the key data from the hash
table, and keep the table cell small. There is a cost to this: resolving
hash table collisions requires many more loads than simply looking at
the key in the same cache line as the insertion position. However, most
supported platforms should have a 64-bit CAS operation to update the
cell atomically.

To make the result of concurrent insertion deterministic, the cell
payloads must have a priority function. Defining one is pretty
straightforward: compare the two 32-bit numbers as a combined 64-bit
number. This means that types coming from inputs earlier on the command
line have a higher priority and are more likely to appear earlier in the
final PDB type stream than types from an input appearing later on the
link line.

After table insertion, the non-empty cells in the table can be copied
out of the main table and sorted by priority to determine the ordering
of the final type index stream. At this point, item and type records
must be separated, either by sorting or by splitting into two arrays,
and I chose sorting. This is why the GHashCell must contain the isItem
bit.

Once the final PDB TPI stream ordering is known, we need to compute a
mapping from source type index to PDB type index. To avoid starting over
from scratch and looking up every type again by its ghash, we save the
insertion position of every hash table insertion during the first
insertion phase. Because the table does not support rehashing, the
insertion position is stable. Using the array of insertion positions
indexed by source type index, we can replace the source type indices in
the ghash table cells with the PDB type indices.

Once the table cells have been updated to contain PDB type indices, the
mapping for each type source can be computed in parallel. Simply iterate
the list of cell positions and replace them with the PDB type index,
since the insertion positions are no longer needed.

Once we have a source to destination type index mapping for every type
source, there are no more data dependencies. We know which type records
are "unique" (not duplicates), and what their final type indices will
be. We can do the remapping in parallel, and accumulate type sizes and
type hashes in parallel by type source.

Lastly, TPI stream layout must be done serially. Accumulate all the type
records, sizes, and hashes, and add them to the PDB.

Differential Revision: https://reviews.llvm.org/D87805
2020-09-30 15:44:38 -07:00
Reid Kleckner
fb7d976110 Revert "[PDB] Merge types in parallel when using ghashing"
This reverts commit 49b3459930655d879b2dc190ff8fe11c38a8be5f.
2020-09-30 14:55:32 -07:00
Reid Kleckner
9574ece4b9 [PDB] Merge types in parallel when using ghashing
This makes type merging much faster (-24% on chrome.dll) when multiple
threads are available, but it slightly increases the time to link (+10%)
when /threads:1 is passed. With only one more thread, the new type
merging is faster (-11%). The output PDB should be identical to what it
was before this change.

To give an idea, here is the /time output placed side by side:
                              BEFORE    | AFTER
  Input File Reading:           956 ms  |  968 ms
  Code Layout:                  258 ms  |  190 ms
  Commit Output File:             6 ms  |    7 ms
  PDB Emission (Cumulative):   6691 ms  | 4253 ms
    Add Objects:               4341 ms  | 2927 ms
      Type Merging:            2814 ms  | 1269 ms  -55%!
      Symbol Merging:          1509 ms  | 1645 ms
    Publics Stream Layout:      111 ms  |  112 ms
    TPI Stream Layout:          764 ms  |   26 ms  trivial
    Commit to Disk:            1322 ms  | 1036 ms  -300ms
----------------------------------------- --------
Total Link Time:               8416 ms    5882 ms  -30% overall

The main source of the additional overhead in the single-threaded case
is the need to iterate all .debug$T sections up front to check which
type records should go in the IPI stream. See fillIsItemIndexFromDebugT.
With changes to the .debug$H section, we could pre-calculate this info
and eliminate the need to do this walk up front. That should restore
single-threaded performance back to what it was before this change.

This change will cause LLD to be much more parallel than it used to, and
for users who do multiple links in parallel, it could regress
performance. However, when the user is only doing one link, it's a huge
improvement. In the future, we can use NT worker threads to avoid
oversaturating the machine with work, but for now, this is such an
improvement for the single-link use case that I think we should land
this as is.

Algorithm
----------

Before this change, we essentially used a
DenseMap<GloballyHashedType, TypeIndex> to check if a type has already
been seen, and if it hasn't been seen, insert it now and use the next
available type index for it in the destination type stream. DenseMap
does not support concurrent insertion, and even if it did, the linker
must be deterministic: it cannot produce different PDBs by using
different numbers of threads. The output type stream must be in the same
order regardless of the order of hash table insertions.

In order to create a hash table that supports concurrent insertion, the
table cells must be small enough that they can be updated atomically.
The algorithm I used for updating the table using linear probing is
described in this paper, "Concurrent Hash Tables: Fast and General(?)!":
https://dl.acm.org/doi/10.1145/3309206

The GHashCell in this change is essentially a pair of 32-bit integer
indices: <sourceIndex, typeIndex>. The sourceIndex is the index of the
TpiSource object, and it represents an input type stream. The typeIndex
is the index of the type in the stream. Together, we have something like
a ragged 2D array of ghashes, which can be looked up as:
  tpiSources[tpiSrcIndex]->ghashes[typeIndex]

By using these side tables, we can omit the key data from the hash
table, and keep the table cell small. There is a cost to this: resolving
hash table collisions requires many more loads than simply looking at
the key in the same cache line as the insertion position. However, most
supported platforms should have a 64-bit CAS operation to update the
cell atomically.

To make the result of concurrent insertion deterministic, the cell
payloads must have a priority function. Defining one is pretty
straightforward: compare the two 32-bit numbers as a combined 64-bit
number. This means that types coming from inputs earlier on the command
line have a higher priority and are more likely to appear earlier in the
final PDB type stream than types from an input appearing later on the
link line.

After table insertion, the non-empty cells in the table can be copied
out of the main table and sorted by priority to determine the ordering
of the final type index stream. At this point, item and type records
must be separated, either by sorting or by splitting into two arrays,
and I chose sorting. This is why the GHashCell must contain the isItem
bit.

Once the final PDB TPI stream ordering is known, we need to compute a
mapping from source type index to PDB type index. To avoid starting over
from scratch and looking up every type again by its ghash, we save the
insertion position of every hash table insertion during the first
insertion phase. Because the table does not support rehashing, the
insertion position is stable. Using the array of insertion positions
indexed by source type index, we can replace the source type indices in
the ghash table cells with the PDB type indices.

Once the table cells have been updated to contain PDB type indices, the
mapping for each type source can be computed in parallel. Simply iterate
the list of cell positions and replace them with the PDB type index,
since the insertion positions are no longer needed.

Once we have a source to destination type index mapping for every type
source, there are no more data dependencies. We know which type records
are "unique" (not duplicates), and what their final type indices will
be. We can do the remapping in parallel, and accumulate type sizes and
type hashes in parallel by type source.

Lastly, TPI stream layout must be done serially. Accumulate all the type
records, sizes, and hashes, and add them to the PDB.

Differential Revision: https://reviews.llvm.org/D87805
2020-09-30 14:22:48 -07:00
Arthur Eubanks
6bf4e16602 [NPM] Add target specific hook to add passes for New Pass Manager
The patch adds a new TargetMachine member "registerPassBuilderCallbacks" for targets to add passes to the pass pipeline using the New Pass Manager (similar to adjustPassManager for the Legacy Pass Manager).

Reviewed By: aeubanks

Differential Revision: https://reviews.llvm.org/D88138
2020-09-30 13:29:43 -07:00
Joseph Huber
48d104847a Revert "[OpenMP] Replace OpenMP RTL Functions With OMPIRBuilder and OMPKinds.def"
Failing tests on Arm due to the tests automatically populating
incomatible pointer width architectures. Reverting until the tests are
updated. Failing tests:

OpenMP/distribute_parallel_for_num_threads_codegen.cpp
OpenMP/distribute_parallel_for_if_codegen.cpp
OpenMP/distribute_parallel_for_simd_if_codegen.cpp
OpenMP/distribute_parallel_for_simd_num_threads_codegen.cpp
OpenMP/target_teams_distribute_parallel_for_if_codegen.cpp
OpenMP/target_teams_distribute_parallel_for_simd_if_codegen.cpp
OpenMP/teams_distribute_parallel_for_if_codegen.cpp
OpenMP/teams_distribute_parallel_for_simd_if_codegen.cpp

This reverts commit 90eaedda9b8ef46e2c0c1b8bce33e98a3adbb68c.
2020-09-30 15:12:21 -04:00
Rahman Lavaee
32c4fd8ef6 Exception support for basic block sections
This is part of the Propeller framework to do post link code layout optimizations. Please see the RFC here: https://groups.google.com/forum/#!msg/llvm-dev/ef3mKzAdJ7U/1shV64BYBAAJ and the detailed RFC doc here: https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

This patch provides exception support for basic block sections by splitting the call-site table into call-site ranges corresponding to different basic block sections. Still all landing pads must reside in the same basic block section (which is guaranteed by the the core basic block section patch D73674 (ExceptionSection) ). Each call-site table will refer to the landing pad fragment by explicitly specifying @LPstart (which is omitted in the normal non-basic-block section case). All these call-site tables will share their action and type tables.

The C++ ABI somehow assumes that no landing pads point directly to LPStart (which works in the normal case since the function begin is never a landing pad), and uses LP.offset = 0 to specify no landing pad. In the case of basic block section where one section contains all the landing pads, the landing pad offset relative to LPStart could actually be zero. Thus, we avoid zero-offset landing pads by inserting a **nop** operation as the first non-CFI instruction in the exception section.

**Background on Exception Handling in C++ ABI**
https://github.com/itanium-cxx-abi/cxx-abi/blob/master/exceptions.pdf

Compiler emits an exception table for every function. When an exception is thrown, the stack unwinding library queries the unwind table (which includes the start and end of each function) to locate the exception table for that function.

The exception table includes a call site table for the function, which is used to guide the exception handling runtime to take the appropriate action upon an exception. Each call site record in this table is structured as follows:

| CallSite                       |  -->  Position of the call site (relative to the function entry)
| CallSite length           |  -->  Length of the call site.
| Landing Pad               |  -->  Position of the landing pad (relative to the landing pad fragment’s begin label)
| Action record offset  |  -->  Position of the first action record

The call site records partition a function into different pieces and describe what action must be taken for each callsite. The callsite fields are relative to the start of the function (as captured in the unwind table).

The landing pad entry is a reference into the function and corresponds roughly to the catch block of a try/catch statement. When execution resumes at a landing pad, it receives an exception structure and a selector value corresponding to the type of the exception thrown, and executes similar to a switch-case statement. The landing pad field is relative to the beginning of the procedure fragment which includes all the landing pads (@LPStart). The C++ ABI requires all landing pads to be in the same fragment. Nonetheless, without basic block sections, @LPStart is the same as the function @Start (found in the unwind table) and can be omitted.

The action record offset is an index into the action table which includes information about which exception types are caught.

**C++ Exceptions with Basic Block Sections**
Basic block sections break the contiguity of a function fragment. Therefore, call sites must be specified relative to the beginning of the basic block section. Furthermore, the unwinding library should be able to find the corresponding callsites for each section. To do so, the .cfi_lsda directive for a section must point to the range of call-sites for that section.
This patch introduces a new **CallSiteRange** structure which specifies the range of call-sites which correspond to every section:

  `struct CallSiteRange {
    // Symbol marking the beginning of the precedure fragment.
    MCSymbol *FragmentBeginLabel = nullptr;
    // Symbol marking the end of the procedure fragment.
    MCSymbol *FragmentEndLabel = nullptr;
    // LSDA symbol for this call-site range.
    MCSymbol *ExceptionLabel = nullptr;
    // Index of the first call-site entry in the call-site table which
    // belongs to this range.
    size_t CallSiteBeginIdx = 0;
    // Index just after the last call-site entry in the call-site table which
    // belongs to this range.
    size_t CallSiteEndIdx = 0;
    // Whether this is the call-site range containing all the landing pads.
    bool IsLPRange = false;
  };`

With N basic-block-sections, the call-site table is partitioned into N call-site ranges.

Conceptually, we emit the call-site ranges for sections sequentially in the exception table as if each section has its own exception table. In the example below, two sections result in the two call site ranges (denoted by LSDA1 and LSDA2) placed next to each other. However, their call-sites will refer to records in the shared Action Table. We also emit the header fields (@LPStart and CallSite Table Length) for each call site range in order to place the call site ranges in separate LSDAs. We note that with -basic-block-sections, The CallSiteTableLength will not actually represent the length of the call site table, but rather the reference to the action table. Since the only purpose of this field is to locate the action table, correctness is guaranteed.

Finally, every call site range has one @LPStart pointer so the landing pads of each section must all reside in one section (not necessarily the same section). To make this easier, we decide to place all landing pads of the function in one section (hence the `IsLPRange` field in CallSiteRange).

|  @LPStart                   |  --->  Landing pad fragment     ( LSDA1 points here)
| CallSite Table Length | ---> Used to find the action table.
| CallSites                     |
| …                                 |
| …                                 |
| @LPStart                    |  --->  Landing pad fragment ( LSDA2 points here)
| CallSite Table Length |
| CallSites                     |
| …                                 |
| …                                 |
…
…
|      Action Table          |
|      Types Table           |

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D73739
2020-09-30 11:05:55 -07:00
Joseph Huber
271aadf0d3 [OpenMP] Replace OpenMP RTL Functions With OMPIRBuilder and OMPKinds.def
Summary:
Replace the OpenMP Runtime Library functions used in CGOpenMPRuntimeGPU
for OpenMP device code generation with ones in OMPKinds.def and use
OMPIRBuilder for generating runtime calls. This allows us to consolidate
more OpenMP code generation into the OMPIRBuilder. This patch also
invalidates specifying target architectures with conflicting pointer
sizes.

Reviewers: jdoerfert

Subscribers: aaron.ballman cfe-commits guansong llvm-commits sstefan1 yaxunl

Tags: #OpenMP #Clang #LLVM

Differential Revision: https://reviews.llvm.org/D88430
2020-09-30 14:00:01 -04:00
Simon Moll
5ebd974f16 [DA][SDA] SyncDependenceAnalysis re-write
This patch achieves two things:
1. It breaks up the `join_blocks` interface between the SDA to the DA to
   return two separate sets for divergent loops exits and divergent,
disjoint path joins.
2. It updates the SDA algorithm to run in O(n) time and improves the
   precision on divergent loop exits.

This fixes `https://bugs.llvm.org/show_bug.cgi?id=46372` (by virtue of
the improved `join_blocks` interface) and revealed an imprecise expected
result in the `Analysis/DivergenceAnalysis/AMDGPU/hidden_loopdiverge.ll`
test.

Reviewed By: sameerds

Differential Revision: https://reviews.llvm.org/D84413
2020-09-30 17:36:26 +02:00
Mircea Trofin
e13a3a826a [NFC][regalloc] Make VirtRegAuxInfo part of allocator state
All the state of VRAI is allocator-wide, so we can avoid creating it
every time we need it. In addition, the normalization function is
allocator-specific. In a next change, we can simplify that design in
favor of just having it as a virtual member.

Differential Revision: https://reviews.llvm.org/D88499
2020-09-30 08:13:05 -07:00
Xiang1 Zhang
92910706f0 [X86] Support Intel Key Locker
Key Locker provides a mechanism to encrypt and decrypt data with an AES key without having access
to the raw key value by converting AES keys into “handles”. These handles can be used to perform the
same encryption and decryption operations as the original AES keys, but they only work on the current
system and only until they are revoked. If software revokes Key Locker handles (e.g., on a reboot),
then any previous handles can no longer be used.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D88398
2020-09-30 18:08:45 +08:00
Hubert Tong
6480f49006 [AIX] asm output: use character literals in byte lists for strings
This patch improves the assembly output produced for string literals by
using character literals in byte lists. This provides the benefits of
having printable characters appear as such in the assembly output and of
having strings kept as logical units on the same line.

Reviewed By: daltenty

Differential Revision: https://reviews.llvm.org/D80953
2020-09-29 21:14:41 -04:00
Eric Astor
5230c79252 [ms] [llvm-ml] Add support for "alias" directive
Support the "alias" directive.

Required support for emitWeakReference in MCWinCOFFStreamer.

Reviewed By: thakis

Differential Revision: https://reviews.llvm.org/D87403
2020-09-29 16:59:50 -04:00
Eric Astor
91cac48098 [ms] [llvm-ml] Add REAL10 support (x87 extended precision)
Add MASM support for 80-bit reals in the x87 extended precision format.

Reviewed By: thakis

Differential Revision: https://reviews.llvm.org/D87402
2020-09-29 16:58:46 -04:00
Eric Astor
95792b92ea [ms] [llvm-ml] Add MASM hex float support
Implement MASM's syntax for specifying floats in raw hexadecimal bytes.

Reviewed By: thakis

Differential Revision: https://reviews.llvm.org/D87401
2020-09-29 16:57:32 -04:00
Eric Astor
ccf36de799 [ms] [llvm-ml] Add support for .radix directive, and accept all radix specifiers
Add support for .radix directive, and radix specifiers [yY] (binary), [oOqQ] (octal), and [tT] (decimal).

Also, when lexing MASM integers, require radix specifier; MASM requires that all literals without a radix specifier be treated as in the default radix. (e.g., 0100 = 100)

Relanding D87400, now with fewer ms-inline-asm tests broken!

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D88337
2020-09-29 16:55:51 -04:00
Zequan Wu
1b632f5114 [CodeGen] emit CG profile for COFF object file
Differential Revision: https://reviews.llvm.org/D87811
2020-09-29 12:03:30 -07:00
Simon Pilgrim
11e1b3f795 [InstCombine] visitTrunc - trunc (lshr (sext A), C) --> (ashr A, C) non-uniform support
This came from @lebedev.ri's suggestion to use m_SpecificInt_ICMP for D88429 - since I was going to change the m_APInt to m_Constant for that patch I thought I would do it for the only other user of the APInt first.

I've added a ConstantExpr::getUMin helper - its trivial to add UMAX/SMIN/SMAX but thought I'd wait until we have use cases.

Differential Revision: https://reviews.llvm.org/D88475
2020-09-29 15:01:16 +01:00
Jay Foad
938bf31fa3 [SDag] Refactor and simplify divergence calculation and checking. NFC. 2020-09-29 14:05:07 +01:00
sstefan1
e47fd785ee [OpenMPOpt][Fix] Only initialize ICV initial values once.
Reviewers: jdoerfert, ggeorgakoudis

Differential Revision: https://reviews.llvm.org/D88441
2020-09-29 12:22:58 +02:00
Max Kazantsev
41907f58c6 [SCEV][NFC] Introduce isBasicBlockEntryGuardedByCond
Currently, we have `isLoopEntryGuardedByCond` method in SCEV, which
checks that some fact is true if we enter the loop. In fact, this is just a
particular case of more general concept `isBasicBlockEntryGuardedByCond`
applied to given loop's header. In fact, the logic if this code is largely
independent on the given loop and only cares code above it.

This patch makes this generalization. Now we can query it for any block,
and `isBasicBlockEntryGuardedByCond` is just a particular case.

Differential Revision: https://reviews.llvm.org/D87828
Reviewed By: fhahn
2020-09-29 15:53:45 +07:00
Tres Popp
59b6daf823 Revert "OpaquePtr: Add type to sret attribute"
This reverts commit 55c4ff91bd820d72014f63dcf7f3d5a0d3397986.

Issues were introduced as discussed in https://reviews.llvm.org/D88241
where this change made previous bugs in the linker and BitCodeWriter
visible.
2020-09-29 10:31:04 +02:00
Amara Emerson
09394476cd [AArch64][GlobalISel] Scalarize <2 x s64> G_MUL since we don't have native support for it.
Differential Revision: https://reviews.llvm.org/D88437
2020-09-28 19:29:45 -07:00
Yonghong Song
b27683b417 BPF: move AbstractMemberAccess and PreserveDIType passes to EP_EarlyAsPossible
Move abstractMemberAccess and PreserveDIType passes as early as
possible, right after clang code generation.

Currently, compiler may transform the above code
  p1 = llvm.bpf.builtin.preserve.struct.access(base, 0, 0);
  p2 = llvm.bpf.builtin.preserve.struct.access(p1, 1, 2);
  a = llvm.bpf.builtin.preserve_field_info(p2, EXIST);
  if (a) {
    p1 = llvm.bpf.builtin.preserve.struct.access(base, 0, 0);
    p2 = llvm.bpf.builtin.preserve.struct.access(p1, 1, 2);
    bpf_probe_read(buf, buf_size, p2);
  }
to
  p1 = llvm.bpf.builtin.preserve.struct.access(base, 0, 0);
  p2 = llvm.bpf.builtin.preserve.struct.access(p1, 1, 2);
  a = llvm.bpf.builtin.preserve_field_info(p2, EXIST);
  if (a) {
    bpf_probe_read(buf, buf_size, p2);
  }
and eventually assembly code looks like
  reloc_exist = 1;
  reloc_member_offset = 10; //calculate member offset from base
  p2 = base + reloc_member_offset;
  if (reloc_exist) {
    bpf_probe_read(bpf, buf_size, p2);
  }
if during libbpf relocation resolution, reloc_exist is actually
resolved to 0 (not exist), reloc_member_offset relocation cannot
be resolved and will be patched with illegal instruction.
This will cause verifier failure.

This patch attempts to address this issue by do chaining
analysis and replace chains with special globals right
after clang code gen. This will remove the cse possibility
described in the above. The IR typically looks like
  %6 = load @llvm.sk_buff:0:50$0:0:0:2:0
  %7 = bitcast %struct.sk_buff* %2 to i8*
  %8 = getelementptr i8, i8* %7, %6
for a particular address computation relocation.

But this transformation has another consequence, code sinking
may happen like below:
  PHI = <possibly different @preserve_*_access_globals>
  %7 = bitcast %struct.sk_buff* %2 to i8*
  %8 = getelementptr i8, i8* %7, %6

For such cases, we will not able to generate relocations since
multiple relocations are merged into one.

This patch introduced a passthrough builtin
to prevent such optimization. Looks like inline assembly has more
impact for optimizaiton, e.g., inlining. Using passthrough has
less impact on optimizations.

A new IR pass is introduced at the beginning of target-dependent
IR optimization, which does:
  - report fatal error if any reloc global in PHI nodes
  - remove all bpf passthrough builtin functions

Changes for existing CORE tests:
  - for clang tests, add "-Xclang -disable-llvm-passes" flags to
    avoid builtin->reloc_global transformation so the test is still
    able to check correctness for clang generated IR.
  - for llvm CodeGen/BPF tests, add "opt -O2 <ir_file> | llvm-dis" command
    before "llc" command since "opt" is needed to call newly-placed
    builtin->reloc_global transformation. Add target triple in the IR
    file since "opt" requires it.
  - Since target triple is added in IR file, if a test may produce
    different results for different endianness, two tests will be
    created, one for bpfeb and another for bpfel, e.g., some tests
    for relocation of lshift/rshift of bitfields.
  - field-reloc-bitfield-1.ll has different relocations compared to
    old codes. This is because for the structure in the test,
    new code returns struct layout alignment 4 while old code
    is 8. Align 8 is more precise and permits double load. With align 4,
    the new mechanism uses 4-byte load, so generating different
    relocations.
  - test intrinsic-transforms.ll is removed. This is used to test
    cse on intrinsics so we do not lose metadata. Now metadata is attached
    to global and not instruction, it won't get lost with cse.

Differential Revision: https://reviews.llvm.org/D87153
2020-09-28 16:56:22 -07:00
Amara Emerson
635632451e [GlobalISel] Add support for lowering of vector G_SELECT and use for AArch64.
The lowering is a port of the SDAG expansion.

Differential Revision: https://reviews.llvm.org/D88364
2020-09-28 14:00:46 -07:00
Benjamin Kramer
46460daf85 [wasm] Move WasmTraits.h to BinaryFormat
There's no dependency on Object in there and this avoids a cyclic
dependency between libMC and libObject.
2020-09-28 22:07:28 +02:00
Sanjay Patel
068a0e3768 [CostModel] remove hack for intrinsic cost based on cost type
This hack seems to only have been necessary because of the
constructor bug noted in 33125cffd.

Once again, it's hard to prove NFC, but that's the hope...
2020-09-28 15:58:42 -04:00
Sanjay Patel
1d7afc6eeb [CostModel] move early exit for free intrinsics
This should be NFC unless some target was expecting that
some form of cttz/ctlz/memcpy is free in terms of size/latency
but not free in throughput cost.
2020-09-28 13:30:55 -04:00
Sanjay Patel
2915eafc46 [CostModel] split handling of intrinsics from other calls
This should be close to NFC (no-functional-change), but I
can't completely rule out that some call on some target
travels down a different path. There's an especially large
amount of code spaghetti in this part of the cost model.

The goal is to clean up the intrinsic cost handling so
we can canonicalize to the new min/max intrinsics without
causing regressions.
2020-09-28 13:30:55 -04:00
Jessica Paquette
7a97485533 [GlobalISel] Combine (xor (and x, y), y) -> (and (not x), y)
When we see this:

```
%and = G_AND %x, %y
%xor = G_XOR %and, %y
```

Produce this:

```
%not = G_XOR %x, -1
%new_and = G_AND %not, %y
```

as long as we are guaranteed to eliminate the original G_AND.

Also matches all commuted forms. E.g.

```
%and = G_AND %y, %x
%xor = G_XOR %y, %and
```

will be matched as well.

Differential Revision: https://reviews.llvm.org/D88104
2020-09-28 10:08:14 -07:00
Paul C. Anagnostopoulos
90d4bf8784 [TableGen] Improved messages in PseudoLoweringEmitter. 2020-09-28 10:18:22 -04:00
Georgii Rymar
b76a0e7b80 [yaml2obj][obj2yaml] - Add a support for SHT_ARM_EXIDX section.
This adds the support for SHT_ARM_EXIDX sections to obj2yaml/yaml2obj tools.

SHT_ARM_EXIDX is a ARM specific index table filled with entries.
Each entry consists of two 4-bytes values (words).
(https://developer.arm.com/documentation/ihi0038/c/?lang=en#index-table-entries)

Differential revision: https://reviews.llvm.org/D88228
2020-09-28 11:45:49 +03:00
Benjamin Kramer
d83ca05fea [Coroutines] Remove unused includes. NFC. 2020-09-28 10:27:23 +02:00
Chuanqi Xu
5802e5931e [Coroutines] Reuse storage for local variables with non-overlapping lifetimes
bug 45566 shows the process of building coroutine frame won't consider
that the lifetimes of different local variables are not overlapped,
which means the compiler could generates smaller frame.

This patch calculate the lifetime range of each alloca by StackLifetime
class. Then the patch build non-overlapped sets for allocas whose
lifetime ranges are not overlapped. We use the largest type in a
non-overlapped set as the field type in the frame. In insertSpills
process, if we find the type of field is not the same with the alloca,
we cast the pointer to the field type to the pointer to the alloca type.
Since the lifetime range of alloca in one non-overlapped set is not
overlapped with each other, it should be ok to reuse the storage space
in the frame.

Test plan: check-llvm, check-clang, cppcoro, folly

Reviewers: junparser, lxfind, modocache

Differential Revision: https://reviews.llvm.org/D87596
2020-09-28 15:48:00 +08:00
David Sherwood
0927cfa9f6 [SVE] Replace / operator in TypeSize/ElementCount with divideCoefficientBy
After some recent upstream discussion we decided that it was best
to avoid having the / operator for both ElementCount and TypeSize,
since this could give the impression that these classes can be used
in the same way as basic integer integer types. However, division
for scalable types is a bit odd because we are only dividing the
minimum quantity by a value, as opposed to something like:

  (MinSize * Vscale) / SomeValue

This is why when performing division it's important the caller
first establishes whether the operation makes sense, perhaps by
calling isKnownMultipleOf() prior to division. The caller must now
explictly call divideCoefficientBy() on the class to perform the
operation.

Differential Revision: https://reviews.llvm.org/D87700
2020-09-28 08:03:00 +01:00
Arthur Eubanks
546a7d793e Revert "Reland [CodeGen] emit CG profile for COFF object file"
This reverts commit 506b6170cb513f1cb6e93a3b690c758f9ded18ac.

This still causes link errors, see https://crbug.com/1130780.
2020-09-27 22:43:14 -07:00
Nikita Popov
9408648c10 [LVI][CVP] Use block value when simplifying icmps
Add a flag to getPredicateAt() that allows making use of the block
value. This allows us to take into account range information from
the current block, rather than only information that is threaded
over edges, making the icmp simplification in CVP a lot more
powerful.

I'm not changing getPredicateAt() to use the block value
unconditionally to avoid any impact on the JumpThreading pass,
which is somewhat picky about LVI query order.

Most test changes here are just icmps that now get dropped (while
previously only a result used in a return was replaced). The three
tests in icmp.ll show some representative improvements. Some of
the folds this enables have been covered by IPSCCP in the meantime,
but LVI can reason about some cases which are hard to support in
IPSCCP, such as in test_br_cmp_with_offset.

The compile-time time cost of doing this is fairly minimal, with
a ~0.05% CTMark regression for ReleaseThinLTO:
https://llvm-compile-time-tracker.com/compare.php?from=709d03f8af4da4204849a70f01798e7cebba2e32&to=6236fd503761f43c99f4537121e057a01056f185&stat=instructions

This is because the block values will typically already be queried
and cached by other CVP optimizations anyway.

Differential Revision: https://reviews.llvm.org/D69686
2020-09-27 20:25:16 +02:00
Fangrui Song
cd74c48503 [NewPM] Port ConstraintElimination to the new pass manager
If -enable-constraint-elimination is specified, add it to the -O2/-O3 pipeline.
(-O1 uses a separate function now.)

Reviewed By: fhahn, aeubanks

Differential Revision: https://reviews.llvm.org/D88365
2020-09-27 11:12:26 -07:00
Nikita Popov
a436b4c09d [LVI] Require context instruction in external API (NFCI)
Require CxtI in getConstant() and getConstantRange() APIs.
Accordingly drop the BB parameter, as it is implied by
CxtI->getParent().

This makes sure we don't forget to pass the context instruction,
and makes the API contract clearer (also clean up the comments to
that effect -- the value holds at the context instruction, not
the end of the block).
2020-09-27 18:07:24 +02:00
Robert Widmann
8be02c632a [LLVM-C] Turn a ShuffleVector Constant Into a Getter.
It is not a good idea to expose raw constants in the LLVM C API. Replace this with an explicit getter.

Differential Revision: https://reviews.llvm.org/D88367
2020-09-26 17:32:57 -06:00
Paul C. Anagnostopoulos
2483a36836 [TableGen] Add/edit Doxygen comments to match "TableGen Backend Developer's Guide." 2020-09-26 09:09:22 -04:00
Qiu Chaofan
06278e0f7f [SelectionDAG] Add guard to automatically insert flags
This is like FastMathFlagGuard in IR. Since we use SDAG instance to get
values, it's with SelectionDAG. By creating a FlagInserter in current
scope, all values created by getNode will get the flags if no Flags
argument provided.

In this patch, I applied it to floating point operations folding part in
DAG combiner, and removed Flags passing to getNode to show its effect.
Other places in DAG combiner and other helper methods similar to getNode
also need this. They can be done in follow-up patches.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D87361
2020-09-26 13:57:52 +08:00
John Demme
789af735fa Common code preparation for tblgen-types patch
Cleanup and add methods which https://reviews.llvm.org/D86904 requires. Breaking up to lower review load.

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D88267
2020-09-26 02:47:48 +00:00