This modified patch avoids redirecting the unit in which a subprogram is
created if type units are enabled -- DIEs were getting children allocated
from different units memory pools. Original commit message:
[DWARF] Create subprogram's DIE in DISubprogram's unit
This is a fix for PR48790. Over in D70350, subprogram DIEs were permitted
to be shared between CUs. However, the creation of a subprogram DIE can be
triggered early, from other CUs. The subprogram definition is then created
in one CU, and when the function is actually emitted children are attached
to the subprogram that expect to be in another CU. This breaks internal CU
references in the children.
Fix this by redirecting the creation of subprogram DIEs in
getOrCreateContextDIE to the CU specified by it's DISubprogram definition.
This ensures that the subprogram DIE is always created in the correct CU.
Differential Revision: https://reviews.llvm.org/D94976
This new f16 shuffle under Neon would hit an assert in
GeneratePerfectShuffle as it would try to treat a f16 vector as an i8.
Add f16 handling, treating them like an i16.
Differential Revision: https://reviews.llvm.org/D95446
This patch implements generation of remaining header search arguments.
It's done manually in C++ as opposed to TableGen, because we need the flexibility and don't anticipate reuse.
This patch also tests the generation of header search options via a round-trip. This way, the code gets exercised whenever Clang is built and tested in asserts mode. All `check-clang` tests pass.
Reviewed By: dexonsmith
Differential Revision: https://reviews.llvm.org/D94472
As noted in https://reviews.llvm.org/D93459, the formatting of
multi-line descriptions of clEnumValN and the likes is unfavorable.
Thus this patch adds support for correctly indenting these.
Reviewed By: serge-sans-paille
Differential Revision: https://reviews.llvm.org/D93494
When SGPRs are spilled to VGPRs, they can overwrite any lane. We need
to preserve the value of inactive lanes in function calls, so we save
the register even if it is marked as caller saved.
Also, teach buildPrologSpill to work when no registers are free like in
CodeGen/AMDGPU/pei-scavenge-vgpr-spill.mir and update the comment on
findScratchNonCalleeSaveRegister as it is not used anymore to realign
the stack pointer since D95865.
Differential Revision: https://reviews.llvm.org/D95946
This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic.
Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration.
For example:
Considering a input context string stack:
[“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For first iteration,, it removed all adjacent repeated frames of size 1:
[“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For second iteration, it removed all adjacent repeated frames of size 2:
[“a”, “b”, “c”, “a”, “b”, “c”, “d”]
So in the end, we get compressed output:
[“a”, “b”, “c”, “d”]
Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator.
Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit.
Added unit tests and regression test for this.
Differential Revision: https://reviews.llvm.org/D93556
For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding.
Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly.
Results:
Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark.
Differential Revision: https://reviews.llvm.org/D94110
This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic.
Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration.
For example:
Considering a input context string stack:
[“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For first iteration,, it removed all adjacent repeated frames of size 1:
[“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For second iteration, it removed all adjacent repeated frames of size 2:
[“a”, “b”, “c”, “a”, “b”, “c”, “d”]
So in the end, we get compressed output:
[“a”, “b”, “c”, “d”]
Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator.
Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit.
Added unit tests and regression test for this.
Differential Revision: https://reviews.llvm.org/D93556
The collapseLoops method implements a transformations facilitating the implementation of the collapse-clause. It takes a list of loops from a loop nest and reduces it to a single loop that can be used by other methods that are implemented on just a single loop, such as createStaticWorkshareLoop.
This patch shares some changes with D92974 (such as adding some getters to CanonicalLoopNest), used by both patches.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D93268
This change implements profile generation infra for pseudo probe in llvm-profgen. During virtual unwinding, the raw profile is extracted into range counter and branch counter and aggregated to sample counter map indexed by the call stack context. This change introduces the last step and produces the eventual profile. Specifically, the body of function sample is recorded by going through each probe among the range and callsite target sample is recorded by extracting the callsite probe from branch's source.
Please refer https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s and https://reviews.llvm.org/D89707 for more context about CSSPGO and llvm-profgen.
**Implementation**
- Extended `PseudoProbeProfileGenerator` for pseudo probe based profile generation.
- `populateBodySamplesWithProbes` reading range counter is responsible for recording function body samples and inferring caller's body samples.
- `populateBoundarySamplesWithProbes` reading branch counter is responsible for recording call site target samples.
- Each sample is recorded with its calling context(named `ContextId`). Remind that the probe based context key doesn't include the leaf frame probe info, so the `ContextId` string is created from two part: one from the probe stack strings' concatenation and other one from the leaf frame probe.
- Added regression test
Test Plan:
ninja & ninja check-llvm
Differential Revision: https://reviews.llvm.org/D92998
Similar to the G_PTR_ADD + G_LOAD twiddling we do in `preISelLower`.
The imported patterns expect scalars only, so they can't handle things like
```
G_STORE %ptr1, %ptr2
```
To get around this, use s64 instead.
(This probably makes a good portion of the manual selection code for G_STORE
dead.)
This is a 0.2% geomean code size improvement on CTMark at -Os.
(Best is consumer-typeset @ -0.7%)
Differential Revision: https://reviews.llvm.org/D95908
When we have a zeroext parameter coming in on the stack, build
```
%x = G_LOAD ...
%x_assert_zext = G_ASSERT_ZEXT %x, narrow_size
%trunc = G_TRUNC %x_assert_zext
```
Rather than just loading into the truncated type.
This allows us to optimize cases like this: https://godbolt.org/z/vfjhW8
Differential Revision: https://reviews.llvm.org/D95805
This turns on the new pass manager by default for the optimization pipeline in
Clang and ThinLTO in various LLD backends. This also makes uses of `opt
-instcombine` use the new pass manager (unless specifically opted out).
This does not affect the backend target-dependent codegen pipeline.
If this causes regressions, you can opt out of the new pass manager
either via the -DENABLE_EXPERIMENTAL_NEW_PASS_MANAGER=OFF CMake flag
while building LLVM, or via various compiler flags, e.g.
-flegacy-pass-manager for Clang or -Wl,--lto-legacy-pass-manager for
ELF LLD. Please file bugs for any regressions.
Major differences:
* The inliner works slightly differently
* -O1 does some amount of inlining
* LCSSA and LoopSimplify are run before all loop passes
* Loop unswitching is implemented slightly differently
* A new SpeculateAroundPHIs pass is added to the pipeline
https://lists.llvm.org/pipermail/llvm-dev/2021-January/148098.html
Reviewed By: asbirlea, ychen, MaskRay, echristo
Differential Revision: https://reviews.llvm.org/D95380
This patch detaches SampleProfileLoader from class
SampleCoverageTracker. We plan to move SampleProfileLoader
to a template class. This would remain SampleCoverageTracker
as a class.
Also make callsiteIsHot() as a file static function.
Differential Revision: https://reviews.llvm.org/D95823
These two cases have identical implementations other than an
unreachable part of `G_ADD` that checks if the scalar we're narrowing
is a vector. Combining them to avoid unnecessary divergence.
This was only adding undef to the use if the copy itself had a
subregister index. It did not consider the subrange liveness if the
use had a subreg index to begin with.
If we had a pair of copies inside a loop which introduced new liveness
to a subregister which was undef before the loop, we would have a
dummy phi-only segment remaining across the loop body. Later, this
false segment would confuse RenameIndependentSubregs causing it to
introduce IMPLICIT_DEFs with broken value numbering.
It seems always adding the lanes to ShrinkMask is OK, so any
conditions should be purely a compile time filter.
If sext_inreg is supported, we will turn this into sext_inreg. That
will then remove it if there are enough sign bits. But if sext_inreg
isn't supported, we can still remove the shift pair based on sign
bits.
Split from D95890.
This patch updates the induction value creation to use VPValues of
recipes to map the created values. This should bring is one step closer
to being able to optimize induction recipes directly in VPlan.
Currently widenIntOrFpInduction also generates vector values for a cast
of the induction, if it exists. Make this explicit by adding the cast
instruction to the values defined by the recipe.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D92284
Discussed in this thread:
https://lists.llvm.org/pipermail/llvm-dev/2021-January/148139.html
DwarfDebug::collectEntityInfo accidentally distinguishes between variable
locations that never have a location specified, and variable locations that
have an empty location specified. The latter leads to the creation of an
empty variable referring to the abstract origin.
Fix this by seeking a non-empty location before producing a concrete
entity, to guarantee a DW_AT_location will be produced. Other loops in
collectEntityInfo and endFunctionImpl take care of examining the
retainedNodes collection and ensuring optimised-out variables are created.
Differential Revision: https://reviews.llvm.org/D95617
On z/OS, other error messages are not matched correctly in lit tests.
```
EDC5121I Invalid argument.
EDC5111I Permission denied.
```
This patch adds a lit substitution to fix it.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D95808
For the fixed ABI, set this in the initial argument constructor,
rather than relying on the allocation logic to set the values. Also
stop passing them for amdgpu_gfx, since the DAG path seems to skip
these. I'm unclear on what amdgpu_gfx's expectations are. This will
allow moving the special input registers out of the normal argument
range.
Extends D95779 to permit insertion into float/doubles vectors while avoiding a lot of aliased memory traffic.
The scalar value is already on the simd unit, so we only need to transfer and splat the index value, then perform the select.
SSE4 codegen is a little bulky due to the tied register requirements of (non-VEX) BLENDPS/PD but the extra moves are cheap so shouldn't be an actual problem.
Differential Revision: https://reviews.llvm.org/D95866
This reverts commits 62af0305b7cc..677a3529d3e6 from D93708.
They cause failures in the sanitizer builds because of uninitialized
values.
A fix is in D95878, but it might take some time until this is pushed,
so reverting the changes for now.
This patch adds a cost model for SK_Broadcast in
AArch64TTIImpl::getShuffleCost with scalable vector.
Without this patch, the scalable vector type relies on BasicTTIImpl cost
implementation and assert.
Differential Revision: https://reviews.llvm.org/D95598
This patch adds constructors to VPIteration as a cleaner way of
initialising the struct and replaces existing constructions of
the form:
{Part, Lane}
with
VPIteration(Part, Lane)
I have also added a default constructor, which is used by VPlan.cpp
when deciding whether to replicate a block or not.
This refactoring will be required in a later patch that adds more
members and functions to VPIteration.
Differential Revision: https://reviews.llvm.org/D95676