The AMDGPU non-strict fdiv lowering needs to introduce an FP mode
switch in some cases, and has custom nodes to provide chain/glue for
the intermediate FP operations. We need to propagate nofpexcept here,
but getNode was dropping the flags.
Adding nofpexcept in the AMDGPU custom lowering is left to a future
patch.
Also fix a second case where flags were dropped, but in this case it
seems it just didn't handle this number of operands.
Test will be included in future AMDGPU patch.
Summary:
The working set size heuristics (ProfileSummaryInfo::hasHugeWorkingSetSize)
under the partial sample PGO may not be accurate because the profile is partial
and the number of hot profile counters in the ProfileSummary may not reflect the
actual working set size of the program being compiled.
To improve this, the (approximated) ratio of the the number of profile counters
of the program being compiled to the number of profile counters in the partial
sample profile is computed (which is called the partial profile ratio) and the
working set size of the profile is scaled by this ratio to reflect the working
set size of the program being compiled and used for the working set size
heuristics.
The partial profile ratio is approximated based on the number of the basic
blocks in the program and the NumCounts field in the ProfileSummary and computed
through the thin LTO indexing. This means that there is the limitation that the
scaled working set size is available to the thin LTO post link passes only.
Reviewers: davidxl
Subscribers: mgorny, eraman, hiraditya, steven_wu, dexonsmith, arphaman, dang, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D79831
Summary:
While clustering mem ops, AMDGPU target needs to consider number of clustered bytes
to decide on max number of mem ops that can be clustered. This patch adds support to pass
number of clustered bytes to target mem ops clustering logic.
Reviewers: foad, rampitec, arsenm, vpykhtin, javedabsar
Reviewed By: foad
Subscribers: MatzeB, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, javed.absar, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D80545
This flag (and the whole field DT_FLAGS_1) originated from Solaris. I intend to use it in an LLD patch D80872.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D80871
As discussed in https://bugs.llvm.org/show_bug.cgi?id=45951 and
D80584, the name 'tmp' is almost always a bad choice, but we have
a legacy of regression tests with that name because it was baked
into utils/update_test_checks.py.
This change makes -instnamer more consistent (already using "arg"
and "bb", the common LLVM shorthand). And it avoids the conflict
in telling users of the FileCheck script to run "-instnamer" to
create a better regression test and having that cause a warn/fail
in update_test_checks.py.
This will ensure that nothing can ever start parsing data from a future
sequence and part-read data will be returned as 0 instead.
Reviewed by: aprantl, labath
Differential Revision: https://reviews.llvm.org/D80796
This is effectively reverting rGbfdc2552664d to avoid test churn
while we figure out a better way forward.
We at least salvage the warning on name conflict from that patch
though.
If we change the default string again, we may want to mass update
tests at the same time. Alternatively, we could live with the poor
naming if we change -instnamer.
This also adds a test to LLVM as suggested in the post-commit
review. There's a clang test that is also affected. That seems
like a layering violation, but I have not looked at fixing that yet.
Differential Revision: https://reviews.llvm.org/D80584
The debug_line_invalid.test test case was previously using the
interpreted line table dumping to identify which opcodes have been
parsed. This change moves to looking for the expected opcodes
explicitly. This is probably a little clearer and also allows for
testing some cases that wouldn't be easily identifiable from the
interpreted table.
Reviewed by: MaskRay
Differential Revision: https://reviews.llvm.org/D80795
For most tables, we already use commas in headers. This set of patches
unifies dumping the remaining ones.
Differential Revision: https://reviews.llvm.org/D80806
For most tables, we already use commas in headers. This set of patches
unifies dumping the remaining ones.
Differential Revision: https://reviews.llvm.org/D80806
For most tables, we already use commas in headers. This set of patches
unifies dumping the remaining ones.
Differential Revision: https://reviews.llvm.org/D80806
Partially reverts feee98645dde4be31a70cc6660d2fc4d4b9d32d8.
Add explicit braces to a different place to fix
"error: add explicit braces to avoid dangling else [-Werror,-Wdangling-else]"
This is a reimplementation of the `orderNodes` function, as the old
implementation didn't take into account all cases.
The new implementation uses SCCs instead of Loops to take account of
irreducible loops.
Fix PR41509
Differential Revision: https://reviews.llvm.org/D79037
This improves the next points for broken hash tables:
1) Use reportUniqueWarning to prevent duplication when
--hash-table and --elf-hash-histogram are used together.
2) Dump nbuckets and nchain fields. It is often possible
to dump them even when the table itself goes past the EOF etc.
Differential revision: https://reviews.llvm.org/D80373
When a stack offset was too big to materialize in a single instruction, we were
trying to do it in stages:
adds xD, sp, #imm
adds xD, xD, #imm
Unfortunately, if xD is xzr then the second instruction doesn't exist and
wouldn't do what was needed if it did. Instead we can use a temporary register
for all but the last addition.
Relying on the find method implies a roundtrip to the iterator world, which is
not costless because iterator creation involves a few check to ensure the
iterator is in a valid position (through the SmallPtrSetIteratorImpl::AdvanceIfNotValid
method). It turns out that the result of SmallPtrSetImpl::find_imp is either
valid or the EndPointer, so there's no need to go through that abstraction,
and the compiler cannot guess it.
Differential Revision: https://reviews.llvm.org/D80708
Summary: Exploit vabsd* for for absolute difference of vectors on P9,
for example:
void foo (char *restrict p, char *restrict q, char *restrict t)
{
for (int i = 0; i < 16; i++)
t[i] = abs (p[i] - q[i]);
}
this case should be matched to the HW instruction vabsdub.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D80271
Previously we walked the users of any vector binop looking for
more binops with the same opcode or phis that eventually ended up
in a reduction. While this is simple it also means visiting the
same nodes many times since we'll do a forward walk for each
BinaryOperator in the chain. It was also far more general than what
we have tests for or expect to see.
This patch replaces the algorithm with a new method that starts at
extract elements looking for a horizontal reduction. Once we find
a reduction we walk through backwards through phis and adds to
collect leaves that we can consider for rewriting.
We only consider single use adds and phis. Except for a special
case if the Add is used by a phi that forms a loop back to the
Add. Including other single use Adds to support unrolled loops.
Ultimately, I want to narrow the Adds, Phis, and final reduction
based on the partial reduction we're doing. I still haven't
figured out exactly what that looks like yet. But restricting
the types of graphs we expect to handle seemed like a good first
step. As does having all the leaves and the reduction at once.
Differential Revision: https://reviews.llvm.org/D79971
This matches what we do for the full sized vector ops at the start of combineX86ShufflesRecursively, and helps getFauxShuffleMask extract more INSERT_SUBVECTOR patterns.
I inverted the mask when I ported to the new form of G_PTRMASK in
8bc03d2168241f7b12265e9cd7e4eb7655709f34.
I don't think this really broke anything, since G_VASTART isn't
handled for types with an alignment higher than the stack alignment.
As discussed in PR45951:
https://bugs.llvm.org/show_bug.cgi?id=45951
There's a potential name collision between update_test_checks.py and -instnamer
and/or manually-generated IR test files because all of them try to use the
variable name that should never be used: "tmp".
This patch proposes to reduce the odds of collision and adds a warning if we
detect the problem. This will cause regression test churn when regenerating
CHECK lines on existing files.
Differential Revision: https://reviews.llvm.org/D80584
Goes with proposal in D80885.
This is adapted from the InstCombine tests that were added for
D50992
But these should be adjusted further to provide more interesting
scenarios for x86-specific codegen. Eg, vector types/sizes will
have different costs depending on ISA attributes.
We also need to add tests that include a load of the scalar
variable and add tests that include extra uses of the insert
to further exercise the cost model.
Try to prevent future node creation issues (as detailed in PR45974) by making the SelectionDAG reference const, so it can still be used for analysis, but not node creation.
As detailed on PR45974 and D79987, getFauxShuffleMask is creating nodes on the fly to create shuffles with inputs the same size as the result, causing problems for hasOneUse() checks in later simplification stages.
Currently only combineX86ShufflesRecursively benefits from these widened inputs so I've begun moving the functionality there, and out of getFauxShuffleMask. This allows us to remove the widening from VBROADCAST and *EXTEND* faux shuffle cases.
This just leaves the INSERT_SUBVECTOR case in getFauxShuffleMask still creating nodes, which will require more extensive refactoring.
In some cases ScheduleDAGRRList has to add new nodes to resolve problems
with interfering physical registers. When new nodes are added, it
completely re-computes the topological order, which can take a long
time, but is unnecessary. We only add nodes one by one, and initially
they do not have any predecessors. So we can just insert them at the end
of the vector. Later we add predecessors, but the helper function
properly updates the topological order much more efficiently. With this
change, the compile time for the program below drops from 300s to 30s on
my machine.
define i11129 @test1() {
%L1 = load i11129, i11129* undef
%B30 = ashr i11129 %L1, %L1
store i11129 %B30, i11129* undef
ret i11129 %L1
}
This should be generally beneficial, as we can skip a large amount of
work. Theoretically there are some scenarios where we might not safe
much, e.g. when we add a dependency between the first and last node.
Then we would have to shift all nodes. But we still do not have to spend
the time re-computing the initial order.
Reviewers: MatzeB, atrick, efriedma, niravd, paquette
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D59722