With vector mask registers only allocatable to V0 (VMV0Regs) it is
relatively simple to generate code which uses multiple masks and naively
requires spilling.
This patch aims to improve codegen in such cases by telling LLVM it can
use VRRegs to hold masks. This will prevent spilling in many cases by
having LLVM copy to an available VR register.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97055
recognizeBSwapOrBitReverseIdiom + collectBitParts have pattern matching to bail out early if a bswap/bitreverse pattern isn't possible - we should be able to rely on this instead without any notable change in compile time.
This is part of a cleanup towards letting matchBSwapOrBitReverse /recognizeBSwapOrBitReverseIdiom use 'root' instructions that aren't ORs (FSHL/FSHRs in particular which can be prematurely created).
Differential Revision: https://reviews.llvm.org/D97056
This is a minor pattern-match update to BPFAdjustOpt.cpp to accept
not only 'or i1 a, b' but also 'select i1 a, i1 true, i1 b'.
This resolves regression after SimplifyCFG's creating select form
of and/or instead (https://reviews.llvm.org/D95026).
This is a small change, and currently such select form isn't created
or doesn't reach to the late pipeline (because InstCombine eagerly
folds it into and/or i1), so I chose to commit without a review process.
For ThinLTO, PreLink ICP is skipped to favor better profile annotation during LTO PostLink. This change applies the same tweak for MonoLTO. Note that PreLink ICP not only makes PostLink profile annotation harder, it is also uncoordinated with PostLink ICP so duplicated ICP could happen.
Differential Revision: https://reviews.llvm.org/D97028
We don't currently create memory operands for these intrinsics,
but there was a suggestion of using the indexed load/store
intrinsics to implement isel for scalable vector gather/scatter.
That may propagate the memory operand from the gather/scatter
ISD nodes.
I think we can use here same logic as for nonnull.
strlen(X) - X must be noundef => valid pointer.
for libcalls with size arg, we add noundef only if size is known and greater than 0 - so pointers must be noundef (valid ones)
Reviewed By: jdoerfert, aqjune
Differential Revision: https://reviews.llvm.org/D95122
There is a trailing dot in text section name if it has prefix, don't add
repeated dot when connect text section name and symbol name.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D96327
Previously we would use the extended implementation, but
the extended implementation requires the vector type to be extended
so that we can access the LLVMContext. In theory we could
detect this case and use the context from the element type instead,
but since I know of no cases hitting this in practice today
I've done the simplest thing.
Also add asserts to several extended EVT functions that assume
LLVMTy is non-null.
Follow from discussion in D97036
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D97070
This commit adds the initial changes to the SystemZ target
description for the XPLINK 64-bit calling convention on z/OS.
Additions include:
- a new predicate IsTargetXPLINK64
- different register allocation order
- generaton of nopr after a call
Reviewed-by: uweigand
Differential Revision: https://reviews.llvm.org/D96887
The existing implementation was relying on order of evaluation to achieve a particular result. This got really confusing when wanting to change the handling for arguments in a later patch.
The current getFoldedSizeOf() implementation uses naive recursion, which
could be really slow when the input structure type is too complex.
This issue was first brought up in
http://llvm.org/bugs/show_bug.cgi?id=8281; this change fixes it by
adding memoization.
Differential Revision: https://reviews.llvm.org/D6594
This patch fixed a bug when elbabi was supplied with a tbe file
contains no non-local symbol. Before this patch, it wrote 0 to
sh_info of the .dynsym section, making the ELF stub file invalid.
This patch fixed this issue.
Differential Revision: https://reviews.llvm.org/D96930
This is a fix for https://llvm.org/PR49215 either before/after
we make a verifier enhancement for vector reductions with D96904.
I'm not sure what the current thinking is for pointer math/logic
in IR. We allow icmp on pointer values. Therefore, we match min/max
patterns, so without this patch, the vectorizer could form a vector
reduction from that sequence.
But the LangRef definitions for min/max and vector reduction
intrinsics do not allow pointer types:
https://llvm.org/docs/LangRef.html#llvm-smax-intrinsichttps://llvm.org/docs/LangRef.html#llvm-vector-reduce-umax-intrinsic
So we would crash/assert at some point - either in IR verification,
in the cost model, or in codegen. If we do want to allow this kind
of transform, we will need to update the LangRef and all of those
parts of the compiler.
Differential Revision: https://reviews.llvm.org/D97047
We had more combinations of data and index lmuls than we needed.
Also add some asserts to verify that the IndexVT and data VT have
the same element count when we isel these pseudo instructions.
There are many legal combinations of index and data VTs supported
for these intrinsics. This results in a lot of isel patterns in
RISCVGenDAGISel.inc.
By adding a separate table similar to what we use for segment
load/stores, we can more efficiently manually select these
intrinsics. We should also be able to reuse this table scalable
vector gather/scatter.
This reduces the llc binary size by ~56K.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D97033
Just like we do for isel patterns, we need to call selectVLOp
to prevent 0 from being selected to X0 by the default isel.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97021
We previously used isel patterns for this, but that used quite
a bit of space in the isel table due to OR being associative
and commutative. It also wouldn't handle shifts/ands being in
reversed order.
This generalizes the shift/and matching from GREVI to
take the expected mask table as input so we can reuse it for
SHFLI.
There is no SHFLIW instruction, but we can promote a 32-bit
SHFLI to i64 on RV64. As long as bit 4 of the control bit isn't
set, a 64-bit SHFLI will preserve 33 sign bits if the input had
at least 33 sign bits. ComputeNumSignBits has been updated to
account for that to avoid sext.w in the tests.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D96661
In https://reviews.llvm.org/rG5fb65c02ca5e91e7e1a00e0efdb8edc899f3e4b9,
We use 0 count value profile to memorize which target has been promoted
and prevent repeated ICP for the same target, so we delete PromotedInsns.
However, I found the implementation in the patch has some shortcomings
to be fixed otherwise there will still be repeated ICP. So I add
PromotedInsns back temorarily. Will remove it after I get a thorough fix.
This is to ensure that we can eliminate G_ASSERT_SEXT.
In a follow-up patch, I'm going to make CallLowering emit G_ASSERT_SEXT for
signext parameters.
Differential Revision: https://reviews.llvm.org/D96913
This enables use of MemorySSA instead of MemDep in MemCpyOpt. To
allow this without significant compile-time impact, the MemCpyOpt
pass is moved directly before DSE (in the cases where this was not
already the case), which allows us to reuse the existing MemorySSA
analysis.
Unlike the MemDep-based implementation, the MemorySSA-based MemCpyOpt
can also perform simple optimizations across basic blocks.
Differential Revision: https://reviews.llvm.org/D94376
When computing a range for a SCEVUnknown, today we use computeKnownBits for unsigned ranges, and computeNumSignBots for signed ranges. This means we miss opportunities to improve range results.
One common missed pattern is that we have a signed range of a value which CKB can determine is positive, but CNSB doesn't convey that information. The current range includes the negative part, and is thus double the size.
Per the removed comment, the original concern which delayed using both (after some code merging years back) was a compile time concern. CTMark results (provided by Nikita, thanks!) showed a geomean impact of about 0.1%. This doesn't seem large enough to avoid higher quality results.
Differential Revision: https://reviews.llvm.org/D96534
VirtRegAuxInfo is an extensibility point, so the register allocator's
decision on which implementation to use should be communicated to the
other users - namely, LiveRangeEdit.
Differential Revision: https://reviews.llvm.org/D96898
fixed-abi uses pre-defined and predictable
SGPR/VGPRs for passing arguments. This patch makes
this scheme default when HSA OS is specified in triple.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D96340
Now that all state for generated instructions is managed directly in
VPTransformState, VPCallBack is no longer needed. This patch updates the
last use of `getOrCreateScalarValue` to instead manage the value
directly in VPTransformState and removes VPCallback.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D95383
Track lanes when processing definitions for marking WQM/WWM.
If all lanes have been defined then marking can stop.
This prevents marking unnecessary instructions as WQM/WWM.
In particular this fixes a bug where values passing through
V_SET_INACTIVE would me marked as requiring WWM.
Reviewed By: piotr
Differential Revision: https://reviews.llvm.org/D95503
In both ADCE and BDCE (via DemandedBits) we should not remove
instructions that are not guaranteed to return. This issue was
pointed out by fhahn in the recent llvm-dev thread.
Differential Revision: https://reviews.llvm.org/D96993
This moves the willReturn() helper from CallBase to Instruction,
so that it can be used in a more generic manner. This will make
it easier to fix additional passes (ADCE and BDCE), and will give
us one place to change if additional instructions should become
non-willreturn (e.g. there has been talk about handling volatile
operations this way).
I have also included the IntrinsicInst workaround directly in
here, so that it gets applied consistently. (As such this change
is not entirely NFC -- FuncAttrs will now use this as well.)
Differential Revision: https://reviews.llvm.org/D96992
This is a simpler variant of D96647. It just adds a straightforward
depth limit with a high cutoff, without introducing complex logic
for BatchAA consistency. It accepts that we may cache a sub-optimal
result if the depth limit is hit.
Eventually this should be more fully addressed by D96647 or similar,
but in the meantime this avoids stack overflows in a cheap way.
Differential Revision: https://reviews.llvm.org/D96996
This patch fixes some crashes coming from
X86ISelLowering::getSetCCResultType, which would occasionally return
an EVT constructed from an invalid MVT, which has a null Type pointer.
This patch refers to D95434.
Differential Revision: https://reviews.llvm.org/D97036
This enables AES fusion and the post RA scheduler for the Neoverse cores.
And while we are it also for the A55 that we had missed earlier.
Differential Revision: https://reviews.llvm.org/D96866
As discussed on the RFC [0], I am sharing the set of patches that
enables checking of original Debug Info metadata preservation in
optimizations. The proof-of-concept/proposal can be found at [1].
The implementation from the [1] was full of duplicated code,
so this set of patches tries to merge this approach into the existing
debugify utility.
For example, the utility pass in the original-debuginfo-check
mode could be invoked as follows:
$ opt -verify-debuginfo-preserve -pass-to-test sample.ll
Since this is very initial stage of the implementation,
there is a space for improvements such as:
- Add support for the new pass manager
- Add support for metadata other than DILocations and DISubprograms
[0] https://groups.google.com/forum/#!msg/llvm-dev/QOyF-38YPlE/G213uiuwCAAJ
[1] https://github.com/djolertrk/llvm-di-checker
Differential Revision: https://reviews.llvm.org/D82545
The test that was failing is now forced to use the old PM.
We were creating more combinations of value and index lmul than
we needed.
I've copied the loop structure used here from VPseudoAMOEI with
all data sew values instead of just 32/64.
Similar can be done for segment loads/store.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D97008
As discussed in D94834, we don't really need to do complicated analysis. It's safe to just drop the tail call attribute.
Differential Revision: https://reviews.llvm.org/D96926
Intrinsic ID is a 32-bit value which made each row of the table 4
byte aligned. The remaining fields used 5 bytes. This meant 3 bytes
of padding per row.
This patch breaks the table into 4 separate tables and indexes them
by properties we know about the intrinsic. NF, masked,
strided, ordered, etc. The indexed load/store tables have no
padding in their rows now.
All together this reduces the size of llc binary by ~28K.
I'm considering adding similar tables for isel of non-segment
load/store as well to cut down the size of the isel table and
probably improve our isel performance. Those tables would need to
indexed from intrinsics, IR loads/stores, gathers/scatters, and
RISCVISD opcodes. So having a table that can be indexed without using
intrinsic ID is more flexible.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D96894
This table is queried in RISCVMCInstLower without knowing
whether the instruction is a vector pseudo. Due to the way the
binary search works, we have to do log2(tablesize) checks just
to determine a non-vector instruction isn't in the table.
Conveniently, all the vector pseudos are pretty tightly
packed within the internal instruction enum. By enabling the
PrimaryKeyEarlyOut, tablegen will emit a check against the
beginning and end of the table before doing the binary search.
This gives a quick early out on the search for the majority
of non-vector instructions.
Differential Revision: https://reviews.llvm.org/D97016
Found a problem in indirect call promotion in sample loader pass. Currently
if an indirect call is promoted for a target, and if the parent function is
inlined into some other function, the indirect call can be promoted for the
same target again. That is redundent which can harm performance and can cause
excessive compile time in some extreme case.
The patch fixes the issue. If a target is promoted for an indirect call, the
patch will write ICP metadata with the target call count being set to 0.
In the later ICP in sample profile loader, if it sees a target has 0 count
for an indirect call, it knows the target has been promoted and won't do
indirect call promotion for the indirect call.
The fix brings 0.1~0.2% performance on our search benchmark.
Differential Revision: https://reviews.llvm.org/D96806
This patch provides two major changes:
1. Add getRelocationInfo to check if a constant will have static, dynamic, or
no relocations. (Also rename the original needsRelocation to needsDynamicRelocation.)
2. Only allow a constant with no relocations (static or dynamic) to be placed
in a mergeable section.
This will allow unused symbols that contain static relocations and happen to
fit in mergeable constant sections (.rodata.cstN) to instead be placed in
unique-named sections if -fdata-sections is used and subsequently garbage collected
by --gc-sections.
See https://lists.llvm.org/pipermail/llvm-dev/2021-February/148281.html.
Differential Revision: https://reviews.llvm.org/D95960
With CSSPGO all indirect call targets are counted torwards the original indirect call site in the profile, including both inlined and non-inlined targets. Therefore no need to look for callee entry counts. This also fixes the issue where callee entry count doesn't match callsite count due to the nature of CS sampling.
I'm also cleaning up the orginal code that called `findIndirectCallFunctionSamples` just to compute the sum, the return value of which was disgarded.
Reviewed By: wmi, wenlei
Differential Revision: https://reviews.llvm.org/D96990
We currently always store absolute filenames in coverage mapping. This
is problematic for several reasons. It poses a problem for distributed
compilation as source location might vary across machines. We are also
duplicating the path prefix potentially wasting space.
This change modifies how we store filenames in coverage mapping. Rather
than absolute paths, it stores the compilation directory and file paths
as given to the compiler, either relative or absolute. Later when
reading the coverage mapping information, we recombine relative paths
with the working directory. This approach is similar to handling
ofDW_AT_comp_dir in DWARF.
Finally, we also provide a new option, -fprofile-compilation-dir akin
to -fdebug-compilation-dir which can be used to manually override the
compilation directory which is useful in distributed compilation cases.
Differential Revision: https://reviews.llvm.org/D95753
AMDGPU currently has a lot of pre-processing code to pre-split
argument types into 32-bit pieces before passing it to the generic
code in handleAssignments. This is a bit sloppy and also requires some
overly fancy iterator work when building the calls. It's better if all
argument marshalling code is handled directly in
handleAssignments. This handles more situations like decomposing large
element vectors into sub-element sized pieces.
This should mostly be NFC, but does change the generated code by
shifting where the initial argument packing instructions are placed. I
think this is nicer looking, since it now emits the packing code
directly after the relevant copies, rather than after the copies for
the remaining arguments.
This doubles down on gfx6/gfx7 using the gfx8+ ABI for 16-bit
types. This is ultimately the better option, but incompatible with the
DAG. Fixing this requires more work, especially for f16.
We can always look through single-argument (LCSSA) phi nodes when
performing alias analysis. getUnderlyingObject() already does this,
but stripPointerCastsAndInvariantGroups() does not. We still look
through these phi nodes with the usual aliasPhi() logic, but
sometimes get sub-optimal results due to the restrictions on value
equivalence when looking through arbitrary phi nodes. I think it's
generally beneficial to keep the underlying object logic and the
pointer cast stripping logic in sync, insofar as it is possible.
With this patch we get marginally better results:
aa.NumMayAlias | 5010069 | 5009861
aa.NumMustAlias | 347518 | 347674
aa.NumNoAlias | 27201336 | 27201528
...
licm.NumPromoted | 1293 | 1296
I've renamed the relevant strip method to stripPointerCastsForAliasAnalysis(),
as we're past the point where we can explicitly spell out everything
that's getting stripped.
Differential Revision: https://reviews.llvm.org/D96668
If extload is legal, following transform
(zext (select c, load1, load2)) -> (select c, zextload1, zextload2)
can save one ext instruction.
Differential Revision: https://reviews.llvm.org/D95086
We currently always store absolute filenames in coverage mapping. This
is problematic for several reasons. It poses a problem for distributed
compilation as source location might vary across machines. We are also
duplicating the path prefix potentially wasting space.
This change modifies how we store filenames in coverage mapping. Rather
than absolute paths, it stores the compilation directory and file paths
as given to the compiler, either relative or absolute. Later when
reading the coverage mapping information, we recombine relative paths
with the working directory. This approach is similar to handling
ofDW_AT_comp_dir in DWARF.
Finally, we also provide a new option, -fprofile-compilation-dir akin
to -fdebug-compilation-dir which can be used to manually override the
compilation directory which is useful in distributed compilation cases.
Differential Revision: https://reviews.llvm.org/D95753
This patch adds functionality to compare for the equality between `InterfaceFile`s based on attributes specific to linking.
Reviewed By: cishida, steven_wu
Differential Revision: https://reviews.llvm.org/D96629
Fixes assert in: https://bugs.llvm.org/show_bug.cgi?id=49036
getWasmSection creates sections if they don't exist, but doesn't add them to the Symbols table. This may cause problems in subsequent calls to getOrCreateSymbol which checks this table, the calls createSymbol assuming it doesn't exist, which then checks UsedNames and finds out it does exist, causing an assert on trying to rename a non-temp symbol.
I tried also fixing the somewhat unintuitive forced suffixing (adding `0`), but it turns out that WasmObjectWriter currently assumes these section symbols are unique, so that may have to be a separate fix: https://bugs.llvm.org/show_bug.cgi?id=49252
Also worth noting is that getWasmSection calling createSymbol may not be correct to start with, given that createSymbol seems to assume it is creating non-section symbols. But again, for a future fix.
Related: where some of this was introduced: 8d396acac3
Differential Revision: https://reviews.llvm.org/D96473
This avoids tedious repetition and matches what we do for the
ValueTypeByHwMode uses.
Reviewed By: craig.topper, luismarques
Differential Revision: https://reviews.llvm.org/D96649
This fixes https://bugs.llvm.org/show_bug.cgi?id=49185
When `NDEBUG` is not set, `LPMUpdater` checks if the added loops have the same parent loop as the current one in `addSiblingLoops`.
If multiple loop passes are executed through `LoopPassManager`, `U.ParentL` will be the same across all passes.
However, the parent loop might change after running a loop pass, resulting in assertion failures in subsequent passes.
This patch resets `U.ParentL` after running individual loop passes in `LoopPassManager`.
Reviewed By: asbirlea, ychen
Differential Revision: https://reviews.llvm.org/D96727
Usually `EH_LABEL`s are placed in
- Before an `invoke` (which becomes calls in the backend)
- After an `invoke`
- At the start of an EH pad
I don't know exactly why, but I noticed there are cases of multiple, not
a single, `EH_LABEL` instructions in the beginning of an EH pad. In that
case `global.set` instruction placed to restore `__stack_pointer` ended
up between two `EH_LABEL` instructions before `CATCH`. It should follow
after the `EH_LABEL`s and `CATCH`. This CL fixes that case.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D96970
This uses to division by constant optimization to use MULHU/MULHS.
Reviewed By: frasercrmck, arcbbb
Differential Revision: https://reviews.llvm.org/D96934
I've now hit several cases where a mistake in the regalloc main loop caused corrupt live intervals that didn't get caught until either the next verify or during post-optimization. The later case is rather confusing and tends to lead one down false trails, so let's catch corruption before that.
Due to vXi64 on RV32, I've directly emitted this using _VL ISD
opcodes. If it wasn't for that we could just use fixed vector
BUILD_VECTOR and VSELECT and let those each be legalized.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D96910
These should be NOPs so we can just replace with the input. This
matches what SVE does with isel patterns for all permutations.
Custom isel saves us from having to list all permurations for
all LMULs.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D96921
Adjust generateFMAsInMachineCombiner to return false if SVE is present
in order to combine fmul+fadd into fma. Also add new pseudo instructions
so as to select the most appropriate of FMLA/FMAD depending on register
allocation.
Depends on D96599
Differential Revision: https://reviews.llvm.org/D96424
CheckInteger uses an int64_t encoded using a variable width encoding
that is optimized for encoding a number with a lot of leading zeros.
Negative numbers have no leading zeros so use the largest encoding
requiring 9 bytes.
I believe its most like we want to check for positive and negative
numbers near 0. -1 is quite common due to its use in the 'not'
idiom.
To optimize for this, we can borrow an idea from the bitcode format
and move the sign bit to bit 0 with the magnitude stored in the
upper bits. This will drastically increase the number of leading
zeros for small magnitudes. Then we can run this value through
VBR encoding.
This gives a small reduction in the table size on all in tree
targets except VE where size increased by about 300 bytes due
to intrinsic ids now requiring 3 bytes instead of 2. Since the
intrinsic enum space is shared by all targets this an unfortunate
consquence of where VE is currently located in the range.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D96317
isFMAFasterThanFMulAndFAdd should return true for FP16 types when
HasFullFP16 is present, since we have the instructions to handle it for
both SVE and NEON. (SVE patterns and tests will follow).
Differential Revision: https://reviews.llvm.org/D96599
This patch simply implements the documented UB of the current nofree attributes as specified. It doesn't try to be fancy about inference (yet), it just implements the cases already specified and inferred.
Note: When this lands, it may expose miscompiles. If so, please revert and provide a test case. It's likely the bug is in the existing inference code and without a relatively complete test case, it will be hard to debug.
Differential Revision: https://reviews.llvm.org/D96349
Currently the load/store optimizer will only fold in increments of the
same size as the load/store. This patch expands that to any legal
immediate for the post-inc instruction.
Differential Revision: https://reviews.llvm.org/D95885
This patch generates the vinsw, vinsd, vinsblx, vinshlx, vinswlx, vinsdlx,
vinsbrx, vinshrx, vinswrx and vinsdrx instructions for vector insertion on P10.
Differential Revision: https://reviews.llvm.org/D94454