Honor always_inline attribute when processing -amdgpu-inline-max-bb.
It was lost during the ports of the heuristic. There is no reason
to honor inline hint, but not always inline.
Differential Revision: https://reviews.llvm.org/D97790
Previously we errored out when disassembling illegal instructions and there would be no profile generated. In fact illegal instructions are not uncommon and we'd better skip them and print "unknown" instead of erroring out. This matches the behavior of llvm-objdump (see disassembleObject in llvm-objdump.cpp).
Reviewed By: wlei, wenlei
Differential Revision: https://reviews.llvm.org/D97776
Currently, getSourceFile accesses file system to check if two paths are
the same file with a thread lock, which is a huge performance bottleneck
in some cases. Currently, it's accessing file system size(files) * size(files) times.
Thus, cache file status information, which reduces file system access to size(files) times.
When I tested it with two binaries and 16 cpu cores,
it saved over 70% of time.
Binary 1: 56 secs -> 3 secs
Binary 2: 17 hours -> 4 hours
Differential Revision: https://reviews.llvm.org/D97061
This is almost purely NFC, it just fits more obviously in the flow of the code now that we've standardized on the index different approach. The non-NFC bit is that because of canceling the VariableOffsets in the subtract, we can now handle the case where both sides involve a common variable offset. This isn't an "interesting" improvement; it just happens to fall out of the natural code structure.
One subtle point - the placement of this above the BaseAlias check is important in the original code as this can return NoAlias even when we can't find a relation between the bases otherwise.
Also added some enhancement TODOs noticed while understanding the existing code.
Note: This is slightly different than the LGTMed version. I fixed the "inbounds" issue Nikita noticed with the original code in e6e5ef4 and rebased this to include the same fix.
Differential Revision: https://reviews.llvm.org/D97520
VirtRegRewriter may sometimes fail to correctly apply the kill flag where necessary,
which causes unecessary code gen on PowerPC. This patch fixes the way masks for
defined lanes are computed and the way mask for used lanes is computed.
Contact albion.fung@ibm.com instead of author for problems related to this commit.
Differential Revision: https://reviews.llvm.org/D92405
This was pointed out in review of D97520 by Nikita, but existed in the original code as well.
The basic issue is that a decomposed GEP expression describes (potentially) more than one getelementptr. The "inbounds" derived UB which justifies this aliasing rule requires that the entire offset be composed of "inbounds" geps. Otherwise, as can be seen in the recently added and changes in this patch test, we can end up with a large commulative offset with only a small sub-offset actually being "inbounds". If that small sub-offset lies within the object, the result was unsound.
We could potentially be fancier here, but for the moment, simply be conservative when any of the GEPs parsed aren't inbounds.
Before we used the same argument as the entry point. The resume partial
function might want to use a different ABI for its context argument
Differential Revision: https://reviews.llvm.org/D97333
This caused miscompiles of Chromium tests for iOS due clobbering of live
registers. See discussion on the code review for details.
> Background:
>
> This fixes a longstanding problem where llvm breaks ARC's autorelease
> optimization (see the link below) by separating calls from the marker
> instructions or retainRV/claimRV calls. The backend changes are in
> https://reviews.llvm.org/D92569.
>
> https://clang.llvm.org/docs/AutomaticReferenceCounting.html#arc-runtime-objc-autoreleasereturnvalue
>
> What this patch does to fix the problem:
>
> - The front-end adds operand bundle "clang.arc.attachedcall" to calls,
> which indicates the call is implicitly followed by a marker
> instruction and an implicit retainRV/claimRV call that consumes the
> call result. In addition, it emits a call to
> @llvm.objc.clang.arc.noop.use, which consumes the call result, to
> prevent the middle-end passes from changing the return type of the
> called function. This is currently done only when the target is arm64
> and the optimization level is higher than -O0.
>
> - ARC optimizer temporarily emits retainRV/claimRV calls after the calls
> with the operand bundle in the IR and removes the inserted calls after
> processing the function.
>
> - ARC contract pass emits retainRV/claimRV calls after the call with the
> operand bundle. It doesn't remove the operand bundle on the call since
> the backend needs it to emit the marker instruction. The retainRV and
> claimRV calls are emitted late in the pipeline to prevent optimization
> passes from transforming the IR in a way that makes it harder for the
> ARC middle-end passes to figure out the def-use relationship between
> the call and the retainRV/claimRV calls (which is the cause of
> PR31925).
>
> - The function inliner removes an autoreleaseRV call in the callee if
> nothing in the callee prevents it from being paired up with the
> retainRV/claimRV call in the caller. It then inserts a release call if
> claimRV is attached to the call since autoreleaseRV+claimRV is
> equivalent to a release. If it cannot find an autoreleaseRV call, it
> tries to transfer the operand bundle to a function call in the callee.
> This is important since the ARC optimizer can remove the autoreleaseRV
> returning the callee result, which makes it impossible to pair it up
> with the retainRV/claimRV call in the caller. If that fails, it simply
> emits a retain call in the IR if retainRV is attached to the call and
> does nothing if claimRV is attached to it.
>
> - SCCP refrains from replacing the return value of a call with a
> constant value if the call has the operand bundle. This ensures the
> call always has at least one user (the call to
> @llvm.objc.clang.arc.noop.use).
>
> - This patch also fixes a bug in replaceUsesOfNonProtoConstant where
> multiple operand bundles of the same kind were being added to a call.
>
> Future work:
>
> - Use the operand bundle on x86-64.
>
> - Fix the auto upgrader to convert call+retainRV/claimRV pairs into
> calls with the operand bundles.
>
> rdar://71443534
>
> Differential Revision: https://reviews.llvm.org/D92808
This reverts commit ed4718eccb12bd42214ca4fb17d196d49561c0c7.
Some instructions (especially mov+pop instructions) were setting the
wrong operands. For example, the pop instruction had the register set as
a source operand while it is a destination operand (the value is loaded
into the register).
I have found these issues using the machine verifier and using manual
code inspection.
Differential Revision: https://reviews.llvm.org/D97159
The previous expansion used SBCI, which is incorrect because the NEGW
pseudo instruction accepts a DREGS operand (2xGPR8) and SBCI only allows
LD8 registers. One solution could be to correct the NEGW pseudo
instruction, but another solution is to use a different instruction
(sbc) that does accept a GPR8 register and therefore allows more freedom
to the register allocator.
The output now matches avr-gcc for the following code:
int foo(int n) {
return -n;
}
I've found this issue using the machine instruction verifier: it was
complaining about the wrong register class in NEGWRd.mir.
Differential Revision: https://reviews.llvm.org/D97131
These aliases are sometimes used in assembly code and make the code more
readable. They are supported by avr-gcc too.
Differential Revision: https://reviews.llvm.org/D96492
Refactor insertion of the asserting ops. This enables using them for
AMDGPU.
This code should essentially be the same for every target. Mips, X86
and ARM all have different code there now, but this seems to be an
accident. The assignment functions are called with different types
than they would be in the DAG, so this is all likely an assortment of
hacks to get around that.
* Add amdgcn_strict_wqm intrinsic.
* Add a corresponding STRICT_WQM machine instruction.
* The semantic is similar to amdgcn_strict_wwm with a notable difference that not all threads will be forcibly enabled during the computations of the intrinsic's argument, but only all threads in quads that have at least one thread active.
* The difference between amdgc_wqm and amdgcn_strict_wqm, is that in the strict mode an inactive lane will always be enabled irrespective of control flow decisions.
Reviewed By: critson
Differential Revision: https://reviews.llvm.org/D96258
* Introduce the new intrinsic amdgcn_strict_wwm
* Deprecate the old intrinsic amdgcn_wwm
The change is done for consistency as the "strict"
prefix will become an important, distinguishing factor
between amdgcn_wqm and amdgcn_strictwqm in the future.
The "strict" prefix indicates that inactive lanes do not
take part in control flow, specifically an inactive lane
enabled by a strict mode will always be enabled irrespective
of control flow decisions.
The amdgcn_wwm will be removed, but doing so in two steps
gives users time to switch to the new name at their own pace.
Reviewed By: critson
Differential Revision: https://reviews.llvm.org/D96257
While the underlying instruction is called image_msaa_load,
the resource must be x component only.
Rename the intrinsic for clarity.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D97829
When commit da108b4ed4e6e7267701e76d5fd3b87609c9ab77 introduced
the CHECK-NEXT directive, it added logic to skip to the next line when
printing a diagnostic if the current matching position is at the end of
a line. This was fine while FileCheck did not support regular expression
but since it does now it can be confusing when the pattern to match
starts with the expectation of a newline (e.g. CHECK-NEXT: {{\n}}foo).
It is also inconsistent with the column information in the diagnostic
which does point to the end of line.
This commit removes this logic altogether, such that failure to match
diagnostic for such cases would show the end of line and be consistent
with the column information. The commit also adapts all existing
testcases accordingly.
Note to reviewers: An alternative approach would be to restrict the code
to only skip to the next line if the first character of the pattern is
known not to match a whitespace-like character. This would respect the
original intent but keep the inconsistency in terms of column info and
requires more code. I've only chosen this current approach by laziness
and would be happy to restrict the logic instead.
Reviewed By: jdenny, jhenderson
Differential Revision: https://reviews.llvm.org/D93341
In some rare circumstances we can be using an undef register for a
compare. When folded into a CBZ/CBNZ the undef flags are lost, leading
to machine verifier problems. This propagates the existing flags to the
new instruction.
The WebAssembly text and binary formats have different operand orders
for the "type" and "table" fields of call_indirect (and
return_call_indirect). In LLVM we use the binary order for the MCInstr,
but when we produce or consume the text format we should use the text
order. For compilation units targetting WebAssembly 1.0 (without the
reference types feature), we omit the table operand entirely.
Differential Revision: https://reviews.llvm.org/D97761
This function isn't exercised in lit tests today today according to
the code coverage report. But will be after the tests in D97543 and
D97559.
Posting this patch to help a crash that Fraser hit.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97582
This is a part of https://reviews.llvm.org/D95835.
One issue is about origin load optimization: see the
comments of useCallbackLoadLabelAndOrigin
@gbalats This change may have some conflicts with your 8bit change. PTAL the change at visitLoad.
Reviewed By: morehouse, gbalats
Differential Revision: https://reviews.llvm.org/D97570
This addresses ~50 clang-tidy warnings on dfsan instrumentation pass.
It also contains some refactoring (all non-functional changes) to eliminate some variables and simplify code.
Reviewed By: stephan.yichao.zhao
Differential Revision: https://reviews.llvm.org/D97714
This patch allows generating TLS variables in assembly files on AIX.
Initialized and external uninitialized variables are generated with the
.csect pseudo-op and local uninitialized variables are generated with
the .comm/.lcomm pseudo-ops. The patch also adds a check to
explicitly say that TLS is not yet supported on AIX.
Reviewed by: daltenty, jasonliu, lei, nemanjai, sfertile
Originally patched by: bsaleil
Commandeered by: NeHuang
Differential Revision: https://reviews.llvm.org/D96184
fix attempt http://reviews.llvm.org/rGbbdb4c8c9bcef0e didn't work
The problem is that the test tries to look up
llvm_orc_registerJITLoaderGDBWrapper from the llvm-jitlink.exe
executable, but the symbol wasn't exported. Just manually export it
for now. There's a FIXME with a suggestion for a real fix.
This merges more AMDGPU ABI lowering code into the generic call
lowering. Start cleaning up by factoring away more of the pack/unpack
logic into the buildCopy{To|From}Parts functions. These could use more
improvement, and the SelectionDAG versions are significantly more
complex, and we'll eventually have to emulate all of those cases too.
This is mostly NFC, but does result in some minor instruction
reordering. It also removes some of the limitations with mismatched
sizes the old code had. However, similarly to the merge on the input,
this is forcing gfx6/gfx7 to use the gfx8+ ABI (which is what we
actually want, but SelectionDAG is stuck using the weird emergent
ABI).
This also changes the load/store size for stack passed EVTs for
AArch64, which makes it consistent with the DAG behavior.
Windows is in the unique position of having two drivers, clang-cl and normal GNU clang, depending on whether a GNU or MSVC target is used. The current implementation with the USE_TOOLCHAIN argument assumes that when CMAKE_SYSTEM_NAME is set to Windows that clang-cl should be used, which is the incorrect choice when targeting a GNU environment.
This patch solves this problem by adding an optional TARGET_TRIPLE argument to llvm_ExternalProject_Add, which sets the various CMAKE_<LANG>_COMPILER_TARGET variables. Additionally, if the triple is detected as an MSVC environment, clang-cl and similar MSVC specific tools will be used instead of the GNU tools.
This fixes two bugs in `WebAssemblyExceptionInfo` grouping, created by
D97247. These two bugs are not easy to split into two different CLs,
because tests that fail for one also tend to fail for the other.
- In D97247, when fixing `ExceptionInfo` grouping by taking out
the unwind destination' exception from the unwind src's exception, we
just iterated the BBs in the function order, but this was incorrect;
this changes it to dominator tree preorder. Please refer to the
comments in the code for the reason and an example.
- After this subexception-taking-out fix, there still can be remaining
BBs we have to take out. When Exception B is taken out of Exception A
(because EHPad B is the unwind destination of EHPad A), there can
still be BBs within Exception A that are reachable from Exception B,
which also should be taken out. Please refer to the comments in the
code for more detailed explanation on why this can happen. To make
this possible, this splits `WebAssemblyException::addBlock` into two
parts: adding to a set and adding to a vector. We need to iterate on
BBs within a `WebAssemblyException` to fix this, so we add BBs to sets
first. But we add BBs to vectors later after we fix all incorrectness
because deleting BBs from vectors is expensive. I considered removing
the vector from `WebAssemblyException`, but it was not easy because
this class has to maintain a similar interface with `MachineLoop` to
be wrapped into a single interface `SortRegion`, which is used in
CFGSort.
Other misc. drive-by fixes:
- Make `WebAssemblyExceptionInfo` do not even run when wasm EH is not
used or the function doesn't have any EH pads, not to waste time
- Add `LLVM_DEBUG` lines for easy debugging
- Fix `preds` comments in cfg-stackify-eh.ll
- Fix `__cxa_throw`'s signature in cfg-stackify-eh.ll
Fixes https://github.com/emscripten-core/emscripten/issues/13554.
Reviewed By: dschuff, tlively
Differential Revision: https://reviews.llvm.org/D97677
Even when MemorySSA-based LICM is used, an AST is still populated
for scalar promotion. As the AST has quadratic complexity, a lot
of time is spent in this step despite the existing access count
limit. This patch optimizes the identification of promotable stores.
The idea here is pretty simple: We're only interested in must-alias
mod sets of loop invariant pointers. As such, only populate the AST
with loop-invariant loads and stores (anything else is definitely
not promotable) and then discard any sets which alias with any of
the remaining, definitely non-promotable accesses.
If we promoted something, check whether this has made some other
accesses loop invariant and thus possible promotion candidates.
This is much faster in practice, because we need to perform AA
queries for O(NumPromotable^2 + NumPromotable*NumNonPromotable)
instead of O(NumTotal^2), and NumPromotable tends to be small.
Additionally, promotable accesses have loop invariant pointers,
for which AA is cheaper.
This has a signicant positive compile-time impact. We save ~1.8%
geomean on CTMark at O3, with 6% on lencod in particular and 25%
on individual files.
Conceptually, this change is NFC, but may not be so in practice,
because the AST is only an approximation, and can produce
different results depending on the order in which accesses are
added. However, there is at least no impact on the number of promotions
(licm.NumPromoted) in test-suite O3 configuration with this change.
Differential Revision: https://reviews.llvm.org/D89264
Andrei Matei reported a llvm11 core dump for his bpf program
https://bugs.llvm.org/show_bug.cgi?id=48578
The core dump happens in LiveVariables analysis phase.
#4 0x00007fce54356bb0 __restore_rt
#5 0x00007fce4d51785e llvm::LiveVariables::HandleVirtRegUse(unsigned int,
llvm::MachineBasicBlock*, llvm::MachineInstr&)
#6 0x00007fce4d519abe llvm::LiveVariables::runOnInstr(llvm::MachineInstr&,
llvm::SmallVectorImpl<unsigned int>&)
#7 0x00007fce4d519ec6 llvm::LiveVariables::runOnBlock(llvm::MachineBasicBlock*, unsigned int)
#8 0x00007fce4d51a4bf llvm::LiveVariables::runOnMachineFunction(llvm::MachineFunction&)
The bug can be reproduced with llvm12 and latest trunk as well.
Futher analysis shows that there is a bug in BPF peephole
TRUNC elimination optimization, which tries to remove
unnecessary TRUNC operations (a <<= 32; a >>= 32).
Specifically, the compiler did wrong transformation for the
following patterns:
%1 = LDW ...
%2 = SLL_ri %1, 32
%3 = SRL_ri %2, 32
... %3 ...
%4 = SRA_ri %2, 32
... %4 ...
The current transformation did not check how many uses of %2
and did transformation like
%1 = LDW ...
... %1 ...
%4 = SRL_ri %2, 32
... %4 ...
and pseudo register %2 is used by not defined and
caused LiveVariables analysis core dump.
To fix the issue, when traversing back from SRL_ri to SLL_ri,
check to ensure SLL_ri has only one use. Otherwise, don't
do transformation.
Differential Revision: https://reviews.llvm.org/D97792
To do this while supporting the existing functionality in SelectionDAG of using
PGO info, we add the ProfileSummaryInfo and LazyBlockFrequencyInfo analysis
dependencies to the instruction selector pass.
Then, use the predicate to generate constant pool loads for f32 materialization,
if we're targeting optsize/minsize.
Differential Revision: https://reviews.llvm.org/D97732
Instead of converting the 0 into a ZR reg during lowering, do that with
tablegen by matching the zero immediate. This when combined with other
optimizations is more likely to use ZR and helps keep the DAG more
easily optimizable. It should not otherwise effect code generation.
When a large "irregular" (e.g. i96) integer call argument is converted to
indirect, 64-bit parts are stored to the stack. The full stack space
(e.g. i128) was not allocated prior to this patch, but rather just the exact
space of the original type. This caused neighboring values on the stack to be
overwritten.
Thanks to Josh Stone for reporting this.
Review: Ulrich Weigand
Fixes https://bugs.llvm.org/show_bug.cgi?id=49322
Differential Revision: https://reviews.llvm.org/D97514
Make OMod explicit instead of implied by HasModifiers in the
operand list. Requires explicitly setting HasOMod=1 for
irregular OMod usage in instruction V_CVT_{U,I}*
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D97587
Change-Id: I230e1476f529e816eec60e242531f23a99e3839f
Currently, it was delibrately impleneted to not handle this case, but as it has turnt out, we need this feature.
The concrete use case is
`System/Library/Frameworks/Cocoa.framework/Versions/A/Cocoa` reexports
/System/Library/Frameworks/AppKit.framework/Versions/C/AppKit , which then rexports
/System/Library/PrivateFrameworks/UIFoundation.framework/Versions/A/UIFoundation
The current implemention uses a global currentTopLevelTapi, which is not reset until it finishes loading the whole tree.
This is a problem because if the top-level is set to Cocoa, then when we get to UIFoundation, it will try to find UIFoundation in the current top level, which is Cocoa and will not find it.
The right thing should be:
- When loading a library from a TBD file, re-exports need to be looked up in the auxiliary documents within the same TBD.
- When loading from an actual dylib, no additional TBD documents need to be examined.
- In no case does a re-export mentioned in one TBD file need to be looked up in a document in an auxiliary document from a different TBD file
Differential Revision: https://reviews.llvm.org/D97438