This CL is not big but contains changes that span multiple analyses and
passes. This description is very long because it tries to explain basics
on what each pass/analysis does and why we need this change on top of
that. Please feel free to skip parts that are not necessary for your
understanding.
---
`WasmEHFuncInfo` contains the mapping of <EH pad, the EH pad's next
unwind destination>. The value (unwind dest) here is where an exception
should end up when it is not caught by the key (EH pad). We record this
info in WasmEHPrepare to fix catch mismatches, because the CFG itself
does not have this info. A CFG only contains BBs and
predecessor-successor relationship between them, but in `WasmEHFuncInfo`
the unwind destination BB is not necessarily a successor or the key EH
pad BB. Their relationship can be intuitively explained by this C++ code
snippet:
```
try {
try {
foo();
} catch (int) { // EH pad
...
}
} catch (...) { // unwind destination
}
```
So when `foo()` throws, it goes to `catch (int)` first. But if it is not
caught by it, it ends up in the next unwind destination `catch (...)`.
This unwind destination is what you see in `catchswitch`'s
`unwind label %bb` part.
---
`WebAssemblyExceptionInfo` groups exceptions so that they can be sorted
continuously together in CFGSort, as we do for loops. What this analysis
does is very simple: it creates a single `WebAssemblyException` per EH
pad, and all BBs that are dominated by that EH pad are included in this
exception. We also identify subexception relationship in this way: if
EHPad A domiantes EHPad B, EHPad B's exception is a subexception of
EHPad A's exception.
This simple rule turns out to be incorrect in some cases. In
`WasmEHFuncInfo`, if EHPad A's unwind destination is EHPad B, it means
semantically EHPad B should not be included in EHPad A's exception,
because it does not make sense to rethrow/delegate to an inner scope.
This is what happened in CFGStackify as a result of this:
```
try
try
catch
... <- %dest_bb is among here!
end
delegate %dest_bb
```
So this patch adds a phase in `WebAssemblyExceptionInfo::recalculate` to
make sure excptions' unwind destinations are not subexceptions of
their unwind sources in `WasmEHFuncInfo`.
But this alone does not prevent `dest_bb` in the example above from
being sorted within the inner `catch`'s exception, even if its exception
is not a subexception of that `catch`'s exception anymore, because of
how CFGSort works, which will be explained below.
---
CFGSort places BBs within the same `SortRegion` (loop or exception)
continuously together so they can be demarcated with `loop`-`end_loop`
or `catch`-`end_try` in CFGStackify.
`SortRegion` is a wrapper for one of `MachineLoop` or
`WebAssemblyException`. `SortRegionInfo` already does some complicated
things because there discrepancies between those two data structures.
`WebAssemblyException` is what we control, and it is defined as an EH
pad as its header and BBs dominated by the header as its BBs (with a
newly added exception of unwind destinations explained in the previous
paragraph). But `MachineLoop` is an LLVM data structure and uses the
standard loop detection algorithm. So by the algorithm, BBs that are 1.
dominated by the loop header and 2. have a path back to its header.
Because of the second condition, many BBs that are dominated by the loop
header are not included in the loop. So BBs that contain `return` or
branches to outside of the loop are not technically included in
`MachineLoop`, but they can be sorted together with the loop with no
problem.
Maybe to relax the condition, in CFGSort, when we are in a `SortRegion`
we allow sorting of not only BBs that belong to the current innermost
region but also BBs that are by the current region header.
(This was written this way from the first version written by Dan, when
only loops existed.) But now, we have cases in exceptions when EHPad B
is the unwind destination for EHPad A, even if EHPad B is dominated by
EHPad A it should not be included in EHPad A's exception, and should not
be sorted within EHPad A.
One way to make things work, at least correctly, is change `dominates`
condition to `contains` condition for `SortRegion` when sorting BBs, but
this will change compilation results for existing non-EH code and I
can't be sure it will not degrade performance or code size. I think it
will degrade performance because it will force many BBs dominated by a
loop, which don't have the path back to the header, to be placed after
the loop and it will likely to create more branches and blocks.
So this does a little hacky check when adding BBs to `Preferred` list:
(`Preferred` list is a ready list. CFGSort maintains ready list in two
priority queues: `Preferred` and `Ready`. I'm not very sure why, but it
was written that way from the beginning. BBs are first added to
`Preferred` list and then some of them are pushed to `Ready` list, so
here we only need to guard condition for `Preferred` list.)
When adding a BB to `Preferred` list, we check if that BB is an unwind
destination of another BB. To do this, this adds the reverse mapping,
`UnwindDestToSrc`, and getter methods to `WasmEHFuncInfo`. And if the BB
is an unwind destination, it checks if the current stack of regions
(`Entries`) contains its source BB by traversing the stack backwards. If
we find its unwind source in there, we add the BB to its `Deferred`
list, to make sure that unwind destination BB is added to `Preferred`
list only after that region with the unwind source BB is sorted and
popped from the stack.
---
This does not contain a new test that crashes because of this bug, but
this fix changes the result for one of existing test case. This test
case didn't crash because it fortunately didn't contain `delegate` to
the incorrectly placed unwind destination BB.
Fixes https://github.com/emscripten-core/emscripten/issues/13514.
Reviewed By: dschuff, tlively
Differential Revision: https://reviews.llvm.org/D97247
In every catchpad except `catch (...)`, we add a call to
`_Unwind_CallPersonality`, which is a wapper to call the personality
function. (In most of other Itanium-based architectures the call is done
from libunwind, but in wasm we don't have the control over the VM.)
Because the personatlity function is called to figure out whether the
current exception is a type we should catch, such as `int` or
`SomeClass&`, `catch (...)` does not need the personality function call.
For the same reason, all cleanuppads don't need it.
When we call `_Unwind_CallPersonality`, we store some necessary info in
a data structure called `__wasm_lpad_context` of type
`_Unwind_LandingPadContext`, which is defined in the wasm's port of
libunwind in Emscripten. Also the personality wrapper function returns
some info (selector and the caught pointer) in that data structure, so
it is used as a medium for communication.
One of the info we need to store is the address for LSDA info for the
current function. `wasm.lsda()` intrinsic returns that address. (This
intrinsic will be lowered to a symbol that points to the LSDA address.)
The simpliest thing is call `wasm.lsda()` every time we need to call
`_Unwind_CallPersonality` and store that info in `__wasm_lpad_context`
data structure. But we tried to be better than that (D77423 and some
more previous CLs), so if catchpad A dominates catchpad B and catchpad A
is not `catch (...)`, we didn't insert `wasm.lsda()` call in catchpad B,
thinking that the LSDA address is the same for a single function and we
already visited catchpad A and `__wasm_lpad_context.lsda` field would
already have that value.
But this can be incorrect if there is a call to another function, which
also can have the personality function and LSDA, between catchpad A and
catchpad B, because `__wasm_lpad_context` is a globally defined
structure and the callee function will overwrite its `lsda` field.
So in this CL we don't try to do any optimizaions on adding
`wasm.lsda()` call; we store the result of `wasm.lsda()` every time we
call `_Unwind_CallPersonality`. We can do some complicated analysis,
like checking if there is a function call between the dominating
catchpad and the current catchpad, but at this time it seems overkill.
This deletes three tests because they all tested `wasm.ldsa()` call
optimization.
Fixes https://github.com/emscripten-core/emscripten/issues/13548.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D97309
Under certain (currently unknown) conditions, llvm-profdata is outputting
profiles that have two consecutive entries in the MemOPSize section for the
value 0. This causes the PGOMemOPSizeOpt pass to output an invalid switch
instruction with two cases for 0. As mentioned, we’re not quite sure what’s
causing this to happen, but this patch prevents llvm-profdata from outputting a
profile that has this problem and gives an error with a request for a
reproducible.
Differential Revision: https://reviews.llvm.org/D92074
This is used to lower UDOT/SDOT instructions, as opposed to relying on
the intrinsic. Subsequent optimizations will be able to optimize them
more cleanly based on these nodes.
Pulled out from D90479 - this recognises invalid nsw shl patterns with signbit changes that result in poison.
Differential Revision: https://reviews.llvm.org/D97305
This is to limit compile time. I did experiments with some
inputs and found that compile time keeps reasonable for this
pass if we have less than 100000 virtual registers and then
starts to explode somewhere between 100000 and 150000.
Differential Revision: https://reviews.llvm.org/D97218
This code creates 3 setccs that need to be expanded. It was
creating a sign bit test as setge X, 0 which is non-canonical.
Canonical would be setgt X, -1. This misses the special case in
IntegerExpandSetCCOperands for sign bit tests that assumes
canonical form. If we don't hit this special case we end up
with a multipart setcc instead of just checking the sign of
the high part.
To fix this I've reversed the polarity of all of the setccs to
setlt X, 0 which is canonical. The rest of the logic should
still work. This seems to produce better code on RISCV which
lacks a setgt instruction.
This probably still isn't the best code sequence we could use here.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D97181
The Linux kernel when built with CONFIG_THUMB2_KERNEL makes use of these
instructions with immediate operands and wide encodings.
These are the T4 variants of the follow sections from the Arm ARM.
F5.1.72 LDR (immediate)
F5.1.229 STR (immediate)
I wasn't able to represent these simple aliases using t2InstAlias due to
the Constraints on the non-suffixed existing instructions, which results
in some manual parsing logic needing to be added.
F1.2 Standard assembler syntax fields
describes the use of the .w (wide) vs .n (narrow) encoding suffix.
Link: https://bugs.llvm.org/show_bug.cgi?id=49118
Link: https://github.com/ClangBuiltLinux/linux/issues/1296
Reported-by: Stefan Agner <stefan@agner.ch>
Reported-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed By: DavidSpickett
Differential Revision: https://reviews.llvm.org/D96632
This reverts commit 4efa097eb4c87d7ffe09a95a5b4ff372bdddda85, because
some the compilers used for some bots do not support automatic
conversions to PointerUnion.
Generalize the return value of tryToCreateWidenRecipe to return either a
newly create recipe or an existing VPValue. Use this to avoid creating
unnecessary VPBlendRecipes.
Fixes PR44800.
Prefer to keep uniform (non-divergent) multiplies on the scalar ALU when
possible. This significantly improves some game cases by eliminating
v_readfirstlane instructions when the result feeds into a scalar
operation, like the address calculation for a scalar load or store.
Since isDivergent is only an approximation of whether a value is in
SGPRs, it can potentially regress some situations where a uniform value
ends up in a VGPR. These should be rare in real code, although the test
changes do contain a number of examples.
Most of the test changes are just using s_mul instead of v_mul/mad which
is generally better for both register pressure and latency (at least on
GFX10 where sgpr pressure doesn't affect occupancy and vector ALU
instructions have significantly longer latency than scalar ALU). Some
R600 tests now use MULLO_INT instead of MUL_UINT24.
GlobalISel appears to handle more scenarios in the desirable way,
although it can also be thrown off and fails to select the 24-bit
multiplies in some cases.
Alternative solution considered and rejected was to allow selecting
MUL_[UI]24 to S_MUL_I32. I've rejected this because the definition of
those SD operations works is don't-care on the most significant 8 bits,
and this fact is used in some combines via SimplifyDemandedBits.
Based on a patch by Nicolai Hähnle.
Differential Revision: https://reviews.llvm.org/D97063
The new intrinsic replaces the size in one specified AsyncFunctionPointer with
the size in another. This ability is necessary for functions which merely
forward to async functions such as those defined for partial applications.
Reviewed By: aschwaighofer
Differential Revision: https://reviews.llvm.org/D97229
Change some test cases to use divergent addresses for vector loads,
which should be the common case in real world code. Using uniform
addresses causes poor instruction selection for the surrounding
code which has to be fixed up post-register-allocation, and this causes
a lot of testsuite churn for a forthcoming patch to stop selecting
24-bit vector multiply instructions for uniform multiplies.
This shows up some problems in the idot tests where we fail to select
v_dot instructions because the patterns only match MUL_[UI]24 ISD nodes,
but the DAG contains i16 mul nodes instead.
Differential Revision: https://reviews.llvm.org/D97062
Early versions of the ARMv7 reference manuals considered the sp register
as a deprecated register for ldm/stm familiy of instructions. However,
later versions such as ARM DDI 0406C.d added a note to the Appendix:
D9.3 Use of the SP as a general-purpose register
Most ARM instructions, unlike Thumb instructions, provide exactly the
same access to the SP as to R0-R12. This means that it is possible to
use the SP as a general-purpose register. Earlier issues of this manual
deprecated the use of SP in an ARM instruction, in any way that is
deprecated, not permitted, or not possible in the corresponding
Thumb instruction. However, user feedback indicates a number of cases
where these instructions are useful. Therefore, ARM no longer deprecates
these instruction uses.
Also Armv8 manuals no longer consider SP as deprecated register for ldm/
stm A32 instructions.
Furthermore, GNU as also does not print a deprecated warning when using
SP with those instructions.
Drop deprecation warning for pop/ldm/push/stm instructions.
Patch by: Stefan Agner.
Differential Revision: https://reviews.llvm.org/D82692
As a followup to D95291, getOperandsScalarizationOverhead was still
using a VF as a vector factor if the arguments were scalar, and would
assert on certain matrix intrinsics with differently sized vector
arguments. This patch removes the VF arg, instead passing the Types
through directly. This should allow it to more accurately compute the
cost without having to guess at which operands will be vectorized,
something difficult with more complex intrinsics.
This adjusts one SVE test as it is now calling the wrong intrinsic vs
veccall. Without invalid InstructCosts the cost of the scalarized
intrinsic is too low. This should get fixed when the cost of
scalarization is accounted for with scalable types.
Differential Revision: https://reviews.llvm.org/D96287
getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various
parameters of the intrinsic being costed. It can either be called with a
scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction
(RetTy==Vector, VF==1) or from the vectorizer with a scalar type and
vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered
an error. Both of the vector modes are expected to be treated the same,
but because this is confusing many backends end up getting it wrong.
Instead of trying work with those two values separately this removes the
VF parameter, widening the RetTy/ArgTys by VF used called from the
vectorizer. This keeps things simpler, but does require some other
modifications to keep things consistent.
Most backends look like this will be an improvement (or were not using
getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code
from c230965ccf36af5c88c working. ARM removed the fix in
dfac521da1b90db683, webassembly happens to get a fixup for an SLP cost
issue and both X86 and AArch64 seem to now be using better costs from
the vectorizer.
Differential Revision: https://reviews.llvm.org/D95291
The **IsGuaranteedLoopInvariant** function is making sure to check if the
incoming pointer is guaranteed to be loop invariant, therefore I think
the case where the pointer is defined in the entry block of a function
automatically guarantees the pointer to be loop invariant, as the entry
block of a function cannot have predecessors or be part of a loop.
I implemented this small patch and tested it using
**ninja check-llvm-unit** and **ninja check-llvm**. I added a contained test
file that shows the problem and used **opt -O3 -debug** on it to make sure
the case is not currently handled (in fact the debug log is showing that
the DSE pass is bailing out when testing if the killer store is able to
clobber the dead store).
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D96979
If resulting size of the output stream is already known,
then the space for stream data could be preliminary
allocated in some cases. f.e. raw_string_ostream could
preallocate the space for the target string(it allows
to avoid reallocations during writing into the stream).
Differential Revision: https://reviews.llvm.org/D91693
This patch extends the support for RVV INSERT_SUBVECTOR to cover those
which don't align to a vector register boundary. Like the support for
EXTRACT_SUBVECTOR in D96959, it accomplishes this by extracting the
nearest register-sized subvector (a subregister operation), then sliding
the vector down with VSLIDEDOWN, inserting the subvector to the first
position, and sliding the vector back up again afterwards.
Unlike subvector extraction, for vectors that occupy less than a full
vector register we must preserve the untouched elements. We do this by
lowering to an LMUL=1 INSERT_SUBVECTOR using the above method and
lowering that to a VSLIDEUP with a zero offset. This uses a
tail-undisturbed policy and so has the effect of "sliding in" the
subvector elements while preserving the surrounding ones.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96972
This is a patch to explicitly mark the size parameter of allocator functions like malloc/realloc/... as noundef.
For C/C++: undef can be created from reading an uninitialized variable or padding.
Calling a function with uninitialized variable is already UB.
Calling malloc with padding value is.. something that's not expected. Padding bits may appear in a coerced aggregate, which doesn't apply to malloc's size.
Therefore, malloc's size can be marked as noundef.
For transformations that introduce malloc/realloc/..: I ran LLVM unit tests with an updated Alive2 semantics, and found no regression, so it seems okay.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D97045
These verify calls are causing a lot of slowdown on some files, up to 8x.
The LazyCallGraph infra has been tested a lot over the years, so I'm fairly confident that we don't always need to run the verifys.
These verifies took >90% of total time in one of the compilations I looked at.
Reviewed By: thakis
Differential Revision: https://reviews.llvm.org/D97225