As commented by @craig.topper on rG1ba5c550d418, we can't guarantee that we'll be extending zero bits, just sign bit. So, revert to the old code for zero_extend_vector_inreg cases.
This adds an extra pattern for inserting an f16 into a odd vector lane
via an VINS. If the dual-insert-lane pattern does not happen to apply,
this can help with some simple cases.
Differential Revision: https://reviews.llvm.org/D95471
The reason for generating mv a0, a0 instruction is when the stack object offset is large then int<12>. To deal this situation, in the elimintateFrameIndex function, it will
create a virtual register, which needs the register scavenger to scavenge it. If the machine instruction that contains the stack object and the opcode is ADDI(the addi
was generated by frameindexNode), and then this instruction's destination register was the same as the register that was generated by the register scavenger, then the
mv a0, a0 was generated. So to eliminnate this instruction, in the eliminateFrameIndex function, if the instrution opcode is ADDI, then the virtual register can't be created.
Differential Revision: https://reviews.llvm.org/D92479
The target shuffle code handles vector sources, but X86ISD::VBROADCAST can also accept a scalar source for splatting.
Suggested by @craig.topper on PR49658
If the mul add two users, one of which was a sext.w, the mul
would also be selected to a MULW before our pattern runs. This
causes the ANDs to now be used by the already selected MULW and
the mul we still need to select. They are unneeded on the MULW
since MULW only reads the lower bits. So they get selected to
SLLI+SRLI for the MULW use. The use for the
(mul (and X, 0xffffffff), (and Y, 0xffffffff)) manages to reuse
the SLLI.
The end result is increased register pressure and no improvement
to how soon we can start the MULW.
This optimization is trying to save SRLI instructions needed to
implement the ANDs. If we have zext.w we won't save anything.
Because we don't check that the multiply is the only user of the
AND we might even increase instruction count.
This patterns computes the full 64 bit product of a 32x32 unsigned
multiply. This requires a two pairs of SLLI+SRLI to zero the
upper 32 bits of the inputs.
We can do better than this by using two SLLI to move the lower
bits to the upper bits then use MULHU to compute the product. This
is the high half of a full 64x64 product. Since we put 32 0s in the lower
bits of the inputs we know the 128-bit product will have zeros in the
lower 64 bits. So the upper 64 bits, which MULHU computes, will contain
the original 64 bit product we were after.
The same trick would work for (mul (sext_inreg X, i32), (sext_inreg Y, i32))
using MULHS, but sext_inreg is sext.w which is already one instruction so we
wouldn't save anything.
Differential Revision: https://reviews.llvm.org/D99026
This is a similarity visualization tool that accepts a Module and
passes it to the IRSimilarityIdentifier. The resulting SimilarityGroups
are output in a JSON file.
Tests are found in test/tools/llvm-sim and check for the file not found,
a bad module, and that the JSON is created correctly.
Reviewers: paquette, jroelofs, MaskRay
Recommit of: 15645d044bcfe2a0f63156048b302f997a717688 to fix linking
errors.
Differential Revision: https://reviews.llvm.org/D86974
BUILD_SHARED_LIBS build llvm component as shared library,
which can reduce the size a lot.
Normally, the binary use ORIGIN../lib to load component libraries,
unfortunatly, ORIGIN is not supported by AIX ld.
We hardcoded the build lib and install lib path in rpath for now
to enable BUILD_SHARED_LIBS build.
Understand that this is not perfect solution,
we can update this when we find better solution.
Reviewed By: hubert.reinterpretcast
Differential Revision: https://reviews.llvm.org/D98901
This makes the settings available for use in other passes by housing
them within the Support lib, but NFC otherwise.
See D98898 for the proposed usage in SimplifyCFG
(where this change was originally included).
Differential Revision: https://reviews.llvm.org/D98945
Also:
- Fix a bug that crept in when fixing a buildbot failure in
f7be9db622
- Use mlsize_t for cstr_to_string as that is what
caml_alloc_string specifies.
Differential Revision: https://reviews.llvm.org/D98851
Now that intrinsic name mangling can cope with unnamed types, the custom name mangling in PredicateInfo (introduced by D49126) can be removed.
(See D91250, D48541)
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D91661
The BB we initialized the ldtilecfg is special. We don't need to check
if its predecessor BBs need to insert ldtilecfg for calls.
We reused the flag HasCallBeforeAMX, so that the predecessors won't be
added to CfgNeedInsert.
This case happens only when the entry BB is in a loop. We need to hoist
the first tile config point out of the loop in future.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D98845
There are some instances where we produce constants of type MVT::i64
unconditionally in the target DAG combines. This is not actually
valid in 32-bit mode.
Previously only immediate shifts were in WriteShift. Register
shifts were grouped with IALU. Seems likely that immediate shifts
would be as fast or faster than register shifts. And that immediate
shifts wouldn't be any faster than IALU. So if any deserved to be in
their own group it should be register shifts not immediate shifts.
Rather than try to flip them let's just add more granularity
and give each kind their own class. I've used new names for both to
make them unambiguous and to force any downstream implementations to
be forced to put correct information in their scheduler models.
Reviewed By: evandro
Differential Revision: https://reviews.llvm.org/D98911
Pass no longer handles skips. Pass now removes unnecessary
unconditional branches and lowers early termination branches.
Hence rename to SILateBranchLowering.
Move code to handle returns to epilog from SIPreEmitPeephole
into SILateBranchLowering. This means SIPreEmitPeephole only
contains optional optimisations, and all required transforms
are in SILateBranchLowering.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D98915
SIRemoveShortExecBranches is an optimisation so fits well in the
context of SIPreEmitPeephole.
Test changes relate to early termination from kills which have now
been lowered prior to considering branches for removal.
As these use s_cbranch the execz skips are now retained instead.
Currently either behaviour is valid as kill with EXEC=0 is a nop;
however, if early termination is used differently in future then
the new behaviour is the correct one.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D98917
Add code so duplication index register changes can be removed from
inside bundles.
Reviewed By: rampitec, foad
Differential Revision: https://reviews.llvm.org/D98940
The TargetMachine uses the triple to determine endianness. Just
use that logic rather than replicating it in PPCSubtarget.
Differential revision: https://reviews.llvm.org/D98674
- D96404 defaulted to libunwind which isn't provided by NDK r21
(or r22), so specify -rtlib=libgcc on non-arm32.
- D97993 means that we need to use --gcc-toolchain instead of -B
to let the driver find libgcc.
Issuing a lookup for an empty symbol set is legal, but can actually result in
unrelated work being done if there was a work queue left over from the previous
lookup. We can avoid doing this unrelated work (reducing stack depth and
interleaving of debugging output) by not issuing these no-op lookups in the
first place.
All loop passes should preserve all analyses in LoopAnalysisResults. Add
checks for those when the checks are enabled (which is by default with
expensive checks on).
Note that due to PR44815, we don't check LAR's ScalarEvolution.
Apparently calling SE.verify() can change its results.
This is a reland of https://reviews.llvm.org/D98820 which was reverted
due to unacceptably large compile time regressions in normal debug
builds.
There is a bunch of similar bitfield extraction code throughout *ISelDAGToDAG.
E.g, ARMISelDAGToDAG, AArch64ISelDAGToDAG, and AMDGPUISelDAGToDAG all contain
code that matches a bitfield extract from an and + right shift.
Rather than duplicating code in the same way, this adds two opcodes:
- G_UBFX (unsigned bitfield extract)
- G_SBFX (signed bitfield extract)
They work like this
```
%x = G_UBFX %y, %lsb, %width
```
Where `lsb` and `width` are
- The least-significant bit of the extraction
- The width of the extraction
This will extract `width` bits from `%y`, starting at `lsb`. G_UBFX zero-extends
the result, while G_SBFX sign-extends the result.
This should allow us to use the combiner to match the bitfield extraction
patterns rather than duplicating pattern-matching code in each target.
Differential Revision: https://reviews.llvm.org/D98464
This reverts commit 94c269baf58330a5e303a4f86f64681f2f7a858b.
Still causes too large of compile time regression in normal debug
builds. Will put under expensive checks instead.
Split from D91844.
The return value of function `ModuleLazyLoaderCache::operator()` in file llvm/tools/llvm-link/llvm-link.cpp. According to the bug report of my static analyzer, the std::function variable `ModuleLazyLoaderCache::createLazyModule` points to function `loadFile`, which may return `nullptr` when error. And the pointer is dereferenced without a check.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D97258
All loop passes should preserve all analyses in LoopAnalysisResults. Add
checks for those.
Note that due to PR44815, we don't check LAR's ScalarEvolution.
Apparently calling SE.verify() can change its results.
Only verify MSSA when VerifyMemorySSA, normally it's very expensive.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D98820