This removes some includes/forward-declarations that don't seem to be necessary in the MCA core headers
Based off a cppclean report
Differential Revision: https://reviews.llvm.org/D77073
Having the sync script work for this file seems better
than matching the order of headers in the cmake file.
Also, not having to manually sort the list is nice, even
if gn's automated sorting doesn't quite match the artisanal
order in the cmake file.
This adds MVE vmull patterns, which are conceptually the same as
mul(vmovl, vmovl), and so the tablegen patterns follow the same
structure.
For i8 and i16 this is simple enough, but in the i32 version the
multiply (in 64bits) is illegal, meaning we need to catch the pattern
earlier in a dag fold. Because bitcasts are involved in the zext
versions and the patterns are a little different in little and big
endian. I have only added little endian support in this patch.
Differential Revision: https://reviews.llvm.org/D76740
The unpredictable/hasSideEffects flag is usually inferred by tablegen
from whether the instruction has a tablegen pattern (and that pattern
only has a single output instruction). Now that the MVE intrinsics are
all committed and producing code, the remaining instructions still
marked as unpredictable need to be specially handled. This adds the flag
directly to instructions that need it, notably the V*MLAL instructions
and some of the MOV's.
Differential Revision: https://reviews.llvm.org/D76910
Summary:
In the patch: https://reviews.llvm.org/D42654
De-duplicate utils/update_{llc_,}test_checks.py, Some common part has
been move to common.py. The SCRUB_LOOP_COMMENT_RE has been moved to
common.py, but forgetting to remove from asm.py.
This patch is to remove the redundant SCRUB_LOOP_COMMENT_RE in asm.py
and use common.SCRUB_LOOP_COMMENT_RE.
Summary:
This allows doing `memcmp(p, q, 7)` with 2 loads instead of a call to
memcmp.
This fixes part of PR45147.
Reviewers: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76133
As pointed out by @thakis, currently CallSiteSplitting bails out after
checking the first PHI node. We should check all PHI nodes, until we
find one where call site splitting is beneficial.
This patch also slightly simplifies the code using BasicBlock::phis().
Reviewers: davidxl, junbuml, thakis
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D77089
When compiling AMDGPUDisassembler.cpp in a stage 1 trunk build with
CMAKE_BUILD_TYPE=RelWithDebInfo LLVM_USE_SANITIZER=Address LiveDebugVariables
accounts for 21.5% wall clock time. This fix reduces that to 1.2% by switching
out a linked list lookup with a map lookup.
Note that the linked list is still used to group UserValues by vreg. The vreg
lookups don't cause any problems in this pathological case.
This is the same idea as D68816, which was reverted, except that it is a less
intrusive fix.
Reviewed By: vsk
Differential Revision: https://reviews.llvm.org/D77226
Different file formats have different naming style for the debug
sections. The method is implemented for ELF, COFF and Mach-O formats.
Differential Revision: https://reviews.llvm.org/D76276
This is a cleanup and normalization patch that also enables reuse with
Flang later on. A follow up will clean up and move the directive ->
clauses mapping.
Differential Revision: https://reviews.llvm.org/D77112
Summary:
As noted in documentation, different repetition modes have different trade-offs:
> .. option:: -repetition-mode=[duplicate|loop]
>
> Specify the repetition mode. `duplicate` will create a large, straight line
> basic block with `num-repetitions` copies of the snippet. `loop` will wrap
> the snippet in a loop which will be run `num-repetitions` times. The `loop`
> mode tends to better hide the effects of the CPU frontend on architectures
> that cache decoded instructions, but consumes a register for counting
> iterations.
Indeed. Example:
>>! In D74156#1873657, @lebedev.ri wrote:
> At least for `CMOV`, i'm seeing wildly different results
> | | Latency | RThroughput |
> | duplicate | 1 | 0.8 |
> | loop | 2 | 0.6 |
> where latency=1 seems correct, and i'd expect the througput to be close to 1/2 (since there are two execution units).
This isn't great for analysis, at least for schedule model development.
As discussed in excruciating detail in
>>! In D74156#1924514, @gchatelet wrote:
>>>! In D74156#1920632, @lebedev.ri wrote:
>> ... did that explanation of the question i'm having made any sense?
>
> Thx for digging in the conversation !
> Ok it makes more sense now.
>
> I discussed it a bit with @courbet:
> - We want the analysis tool to stay simple so we'd rather not make it knowledgeable of the repetition mode.
> - We'd like to still be able to select either repetition mode to dig into special cases
>
> So we could add a third `min` repetition mode that would run both and take the minimum. It could be the default option.
> Would you have some time to look what it would take to add this third mode?
there appears to be an agreement that it is indeed sub-par,
and that we should provide an optional, measurement (not analysis!) -time
way to rectify the situation.
However, the solutions isn't entirely straight-forward.
We can just add an actual 'multiplexer' `MinSnippetRepetitor`, because
if we just concatenate snippets produced by `DuplicateSnippetRepetitor`
and `LoopSnippetRepetitor` and run+measure that, the measurement will
naturally be different from what we'd get by running+measuring
them separately and taking the min.
([[ https://www.wolframalpha.com/input/?i=%28x%2By%29%2F2+%21%3D+min%28x%2C+y%29 | `time(D+L)/2 != min(time(D), time(L))` ]])
Also, it seems best to me to have a single snippet instead of generating
a snippet per repetition mode, since the only difference here is that the
loop repetition mode reserves one register for loop counter.
As far as i can tell, we can either teach `BenchmarkRunner::runConfiguration()`
to produce a single report given multiple repetitors (as in the patch),
or do that one layer higher - don't modify `BenchmarkRunner::runConfiguration()`,
produce multiple reports, don't actually print each one, but aggregate them somehow
and only print the final one.
Initially i've gone ahead with the latter approach, but it didn't look like a natural fit;
the former (as in the diff) does seem like a better fit to me.
There's also a question of the test coverage. It sure currently does work here:
```
$ ./bin/llvm-exegesis --opcode-name=CMOV64rr --mode=inverse_throughput --repetition-mode=duplicate
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-8fb949.o
---
mode: inverse_throughput
key:
instructions:
- 'CMOV64rr RAX RAX R11 i_0x0'
- 'CMOV64rr RBP RBP R15 i_0x0'
- 'CMOV64rr RBX RBX RBX i_0x0'
- 'CMOV64rr RCX RCX RBX i_0x0'
- 'CMOV64rr RDI RDI R10 i_0x0'
- 'CMOV64rr RDX RDX RAX i_0x0'
- 'CMOV64rr RSI RSI RAX i_0x0'
- 'CMOV64rr R8 R8 R8 i_0x0'
- 'CMOV64rr R9 R9 RDX i_0x0'
- 'CMOV64rr R10 R10 RBX i_0x0'
- 'CMOV64rr R11 R11 R14 i_0x0'
- 'CMOV64rr R12 R12 R9 i_0x0'
- 'CMOV64rr R13 R13 R12 i_0x0'
- 'CMOV64rr R14 R14 R15 i_0x0'
- 'CMOV64rr R15 R15 R13 i_0x0'
config: ''
register_initial_values:
- 'RAX=0x0'
- 'R11=0x0'
- 'EFLAGS=0x0'
- 'RBP=0x0'
- 'R15=0x0'
- 'RBX=0x0'
- 'RCX=0x0'
- 'RDI=0x0'
- 'R10=0x0'
- 'RDX=0x0'
- 'RSI=0x0'
- 'R8=0x0'
- 'R9=0x0'
- 'R14=0x0'
- 'R12=0x0'
- 'R13=0x0'
cpu_name: bdver2
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: inverse_throughput, value: 0.819, per_snippet_value: 12.285 }
error: ''
info: instruction has tied variables, using static renaming.
assembled_snippet: 5541574156415541545348B8000000000000000049BB00000000000000004883EC08C7042400000000C7442404000000009D48BD000000000000000049BF000000000000000048BB000000000000000048B9000000000000000048BF000000000000000049BA000000000000000048BA000000000000000048BE000000000000000049B8000000000000000049B9000000000000000049BE000000000000000049BC000000000000000049BD0000000000000000490F40C3490F40EF480F40DB480F40CB490F40FA480F40D0480F40F04D0F40C04C0F40CA4C0F40D34D0F40DE4D0F40E14D0F40EC4D0F40F74D0F40FD490F40C35B415C415D415E415F5DC3
...
$ ./bin/llvm-exegesis --opcode-name=CMOV64rr --mode=inverse_throughput --repetition-mode=loop
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-051eb3.o
---
mode: inverse_throughput
key:
instructions:
- 'CMOV64rr RAX RAX R11 i_0x0'
- 'CMOV64rr RBP RBP RSI i_0x0'
- 'CMOV64rr RBX RBX R9 i_0x0'
- 'CMOV64rr RCX RCX RSI i_0x0'
- 'CMOV64rr RDI RDI RBP i_0x0'
- 'CMOV64rr RDX RDX R9 i_0x0'
- 'CMOV64rr RSI RSI RDI i_0x0'
- 'CMOV64rr R9 R9 R12 i_0x0'
- 'CMOV64rr R10 R10 R11 i_0x0'
- 'CMOV64rr R11 R11 R9 i_0x0'
- 'CMOV64rr R12 R12 RBP i_0x0'
- 'CMOV64rr R13 R13 RSI i_0x0'
- 'CMOV64rr R14 R14 R14 i_0x0'
- 'CMOV64rr R15 R15 R10 i_0x0'
config: ''
register_initial_values:
- 'RAX=0x0'
- 'R11=0x0'
- 'EFLAGS=0x0'
- 'RBP=0x0'
- 'RSI=0x0'
- 'RBX=0x0'
- 'R9=0x0'
- 'RCX=0x0'
- 'RDI=0x0'
- 'RDX=0x0'
- 'R12=0x0'
- 'R10=0x0'
- 'R13=0x0'
- 'R14=0x0'
- 'R15=0x0'
cpu_name: bdver2
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: inverse_throughput, value: 0.6083, per_snippet_value: 8.5162 }
error: ''
info: instruction has tied variables, using static renaming.
assembled_snippet: 5541574156415541545348B8000000000000000049BB00000000000000004883EC08C7042400000000C7442404000000009D48BD000000000000000048BE000000000000000048BB000000000000000049B9000000000000000048B9000000000000000048BF000000000000000048BA000000000000000049BC000000000000000049BA000000000000000049BD000000000000000049BE000000000000000049BF000000000000000049B80200000000000000490F40C3480F40EE490F40D9480F40CE480F40FD490F40D1480F40F74D0F40CC4D0F40D34D0F40D94C0F40E54C0F40EE4D0F40F64D0F40FA4983C0FF75C25B415C415D415E415F5DC3
...
$ ./bin/llvm-exegesis --opcode-name=CMOV64rr --mode=inverse_throughput --repetition-mode=min
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c7a47d.o
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2581f1.o
---
mode: inverse_throughput
key:
instructions:
- 'CMOV64rr RAX RAX R11 i_0x0'
- 'CMOV64rr RBP RBP R10 i_0x0'
- 'CMOV64rr RBX RBX R10 i_0x0'
- 'CMOV64rr RCX RCX RDX i_0x0'
- 'CMOV64rr RDI RDI RAX i_0x0'
- 'CMOV64rr RDX RDX R9 i_0x0'
- 'CMOV64rr RSI RSI RAX i_0x0'
- 'CMOV64rr R9 R9 RBX i_0x0'
- 'CMOV64rr R10 R10 R12 i_0x0'
- 'CMOV64rr R11 R11 RDI i_0x0'
- 'CMOV64rr R12 R12 RDI i_0x0'
- 'CMOV64rr R13 R13 RDI i_0x0'
- 'CMOV64rr R14 R14 R9 i_0x0'
- 'CMOV64rr R15 R15 RBP i_0x0'
config: ''
register_initial_values:
- 'RAX=0x0'
- 'R11=0x0'
- 'EFLAGS=0x0'
- 'RBP=0x0'
- 'R10=0x0'
- 'RBX=0x0'
- 'RCX=0x0'
- 'RDX=0x0'
- 'RDI=0x0'
- 'R9=0x0'
- 'RSI=0x0'
- 'R12=0x0'
- 'R13=0x0'
- 'R14=0x0'
- 'R15=0x0'
cpu_name: bdver2
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: inverse_throughput, value: 0.6073, per_snippet_value: 8.5022 }
error: ''
info: instruction has tied variables, using static renaming.
assembled_snippet: 5541574156415541545348B8000000000000000049BB00000000000000004883EC08C7042400000000C7442404000000009D48BD000000000000000049BA000000000000000048BB000000000000000048B9000000000000000048BA000000000000000048BF000000000000000049B9000000000000000048BE000000000000000049BC000000000000000049BD000000000000000049BE000000000000000049BF0000000000000000490F40C3490F40EA490F40DA480F40CA480F40F8490F40D1480F40F04C0F40CB4D0F40D44C0F40DF4C0F40E74C0F40EF4D0F40F14C0F40FD490F40C3490F40EA5B415C415D415E415F5DC35541574156415541545348B8000000000000000049BB00000000000000004883EC08C7042400000000C7442404000000009D48BD000000000000000049BA000000000000000048BB000000000000000048B9000000000000000048BA000000000000000048BF000000000000000049B9000000000000000048BE000000000000000049BC000000000000000049BD000000000000000049BE000000000000000049BF000000000000000049B80200000000000000490F40C3490F40EA490F40DA480F40CA480F40F8490F40D1480F40F04C0F40CB4D0F40D44C0F40DF4C0F40E74C0F40EF4D0F40F14C0F40FD4983C0FF75C25B415C415D415E415F5DC3
...
```
but i open to suggestions as to how test that.
I also have gone with the suggestion to default to this new mode.
This was irking me for some time, so i'm happy to finally see progress here.
Looking forward to feedback.
Reviewers: courbet, gchatelet
Reviewed By: courbet, gchatelet
Subscribers: mstojanovic, RKSimon, llvm-commits, courbet, gchatelet
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76921
It was added by D76591 for migration purposes (not all
printBranchOperand users have migrated to the overload with `uint64_t Address`).
Now that all have been migrated, the parameter can go away.
The requirement for deopt parameter to be in gc parameter if it can
be modified by GC is very strong and difficult to follow.
The key example of why this can't work:
%p1 = bitcast i8* %p to i8*
statepoint [gc = (%p1)], [deopt = (%p1)]
The optimizer is allowed to replace either use (or both) of %p1 with %p.
If it updates only one of the two (entirely legal), the two sets do not overlap.
So this change removes the strong wording.
Reviewers: reames, dantrushin
Reviewed By: reames
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D77122
Summary:
cmake fails with an error when attempting to evaluate $<TARGET_FILE:tgt>
where `tgt` is defined via an `add_custom_target` and thus the `TYPE`
is `UTILITY`. Requesting a TARGET_FILE only works on an `EXECUTABLE`
or one of a few differetnt types of `X_LIBRARY` (e.g. added via
`add_library` or `add_executable`). The logic as implemented in cmake
is below:
enum TargetType
{
EXECUTABLE,
STATIC_LIBRARY,
SHARED_LIBRARY,
MODULE_LIBRARY,
OBJECT_LIBRARY,
UTILITY,
GLOBAL_TARGET,
INTERFACE_LIBRARY,
UNKNOWN_LIBRARY
};
if (target->GetType() >= cmStateEnums::OBJECT_LIBRARY &&
target->GetType() != cmStateEnums::UNKNOWN_LIBRARY) {
::reportError(context, content->GetOriginalExpression(),
"Target \"" + name +
"\" is not an executable or library.");
return nullptr;
}
This has always been the case back to at least 3.12 (furthest I
checked) but this is causing a new failure in cmake 3.17 while
evaluating ExternalProjectAdd.
Subscribers: mgorny, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77284
There was a TODO in genericValueTraversal to provide the context
instruction and due to the lack of it users that wanted one just used
something available. Unfortunately, using a fixed instruction is wrong
in the presence of PHIs so we need to update the context instruction
properly.
Reviewed By: uenoku
Differential Revision: https://reviews.llvm.org/D76870
This cannot be triggered right now, as far as I know, but it doesn't
make sense to deduce a constant range on arguments of declarations.
Exposed during testing of AAValueSimplify extensions.
While D68850 allowed functions to be deleted I accidentally saved some
version of the function to be used once a suitable prefix was found.
This turned out to be problematic when the occasionally deleted function
is also occasionally modified. The test case is adjusted to resemble the
case in which the problem was found.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D76586
Use DL & ABI information for better alignment deduction, e.g., if a type
is accessed and the ABI specifies an alignment requirement for such an
access we can use it. This is based on a patch by @lebedev.ri and
inspired by getBaseAlign in Loads.cpp.
Depends on D76673.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D76674
If we have a must-tail call the callee and caller need to have matching
ABIs. Part of that is alignment which we might modify when we deduce
alignment of arguments of either. Since we would need to keep them in
sync, which is not as simple, we simply avoid deducing alignment for
arguments of the must-tail caller or callee.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D76673
Failure to export __cxa_atexit can lead to an attempt to import a definition
from the process itself (if __cxa_atexit is referenced from another JITDylib),
but the process definition will clash with the existing non-exported definition
to produce an unexpected DuplicateDefinitionError.
This patch fixes the immediate issue by exporting __cxa_atexit. It also fixes a
bug where atexit functions in other JITDylibs were not being run by adding a
copy of run_atexits_helper to every JITDylib.
A follow up patch will deal with the bug where definition generators are called
despite a non-exported definition being present.
We create a lot of AbstractAttributes and they live as long as
the Attributor does. It seems reasonable to allocate them via a
BumpPtrAllocator owned by the Attributor.
Reviewed By: baziotis
Differential Revision: https://reviews.llvm.org/D76589
We already mention that `noalias` is modeled after the C99 `restrict`
qualifier but we did omit one important requirement in the description.
For the restrict guarantees the object affected has to be modified
during the execution of the function, in any way (see 6.7.3.1.4 in [0]).
There are two reasons we want this restriction as well:
1) To match the `restrict` semantics when we lower it to `noalias`.
2) To allow the reasoning that the object pointed to by a `noalias`
pointer is not modified through means not derived from this
pointer. Hence, following the uses of that pointer is sufficient
to determine potential modifications.
The discussion on this came up as part of D73428. In that patch the
Attributor is taught to derive `noalias` for call site arguments based
on alias queries against objects that are accessed in the callee. This
is possible even if the pointer passed at the call site was "not-`noalias`".
To simplify the logic there *and* to allow the use of `noalias` as
described in 2) above, it is beneficial to follow the C `restrict`
semantics in cases where there might be "read-read-aliases". Note that
AliasAnalysis* queries for read only objects already result in
`NoAlias` even if the pointers might "alias".
* From this point of view our Alias Analysis is basically a Dependence
Analysis.
[0] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D74935
This means the linker will be expect them be undefined at link time an
will generate imports from the `env` module rather than reporting
undefined externals.
Differential Revision: https://reviews.llvm.org/D77192
The legalizer has a tendency to lose DebugLoc's when expanding or
combining instructions. The verifier that detected these isn't ready
for upstreaming yet but this patch fixes the cases that came up when
applying it to our out-of-tree backend's CodeGen tests.
This pattern comes up a few more times in this file and probably in
the backends too but I'd prefer to fix the others separately (and
preferably when the lost-location verifier detects them).
The MemoryBuffer::getMemBuffer method's RequiresNullTerminator parameter
defaults to true, but object files are not null terminated so we need to
explicitly pass false here.
Negation is equivalent to bitwise-not + 1, so try to convert more
subtracts into adds using this relationship:
0 - (A ^ C) => ((A ^ C) ^ -1) + 1 => A ^ ~C + 1
I doubt this will recover the regression noted in rGf2fbdf76d8d0,
but seems like we're going to need to improve here and/or revive D68408?
Alive2 proofs:
http://volta.cs.utah.edu:8080/z/Re5tMUhttp://volta.cs.utah.edu:8080/z/An-uns
Differential Revision: https://reviews.llvm.org/D77230
find() was altering the UserChain, even in cases where it subsequently
discovered that the resulting constant was a 0. This confuses
rebuildWithoutConstOffset() when it attempts to walk the chain later, since it
is expected that the chain itself be a path down the use-def edges of an
expression.
Make the attributor pass aware of aligned_alloc for converting heap
allocations to stack ones.
Depends on D76971.
Differential Revision: https://reviews.llvm.org/D76974