mirror of
https://github.com/RPCS3/llvm-mirror.git
synced 2024-11-25 12:12:47 +01:00
4864887dc5
This patch introduces a new VPDef class, which can be used to manage VPValues defined by recipes/VPInstructions. The idea here is to mirror VPUser for values defined by a recipe. A VPDef can produce either zero (e.g. a store recipe), one (most recipes) or multiple (VPInterleaveRecipe) result VPValues. To traverse the def-use chain from a VPDef to its users, one has to traverse the users of all values defined by a VPDef. VPValues now contain a pointer to their corresponding VPDef, if one exists. To traverse the def-use chain upwards from a VPValue, we first need to check if the VPValue is defined by a VPDef. If it does not have a VPDef, this means we have a VPValue that is not directly defined iniside the plan and we are done. If we have a VPDef, it is defined inside the region by a recipe, which is a VPUser, and the upwards def-use chain traversal continues by traversing all its operands. Note that we need to add an additional field to to VPVAlue to link them to their defs. The space increase is going to be offset by being able to remove the SubclassID field in future patches. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D90558
251 lines
12 KiB
ReStructuredText
251 lines
12 KiB
ReStructuredText
==================
|
|
Vectorization Plan
|
|
==================
|
|
|
|
.. contents::
|
|
:local:
|
|
|
|
Abstract
|
|
========
|
|
The vectorization transformation can be rather complicated, involving several
|
|
potential alternatives, especially for outer-loops [1]_ but also possibly for
|
|
innermost loops. These alternatives may have significant performance impact,
|
|
both positive and negative. A cost model is therefore employed to identify the
|
|
best alternative, including the alternative of avoiding any transformation
|
|
altogether.
|
|
|
|
The Vectorization Plan is an explicit model for describing vectorization
|
|
candidates. It serves for both optimizing candidates including estimating their
|
|
cost reliably, and for performing their final translation into IR. This
|
|
facilitates dealing with multiple vectorization candidates.
|
|
|
|
High-level Design
|
|
=================
|
|
|
|
Vectorization Workflow
|
|
----------------------
|
|
VPlan-based vectorization involves three major steps, taking a "scenario-based
|
|
approach" to vectorization planning:
|
|
|
|
1. Legal Step: check if a loop can be legally vectorized; encode constraints and
|
|
artifacts if so.
|
|
2. Plan Step:
|
|
|
|
a. Build initial VPlans following the constraints and decisions taken by
|
|
Legal Step 1, and compute their cost.
|
|
b. Apply optimizations to the VPlans, possibly forking additional VPlans.
|
|
Prune sub-optimal VPlans having relatively high cost.
|
|
3. Execute Step: materialize the best VPlan. Note that this is the only step
|
|
that modifies the IR.
|
|
|
|
Design Guidelines
|
|
-----------------
|
|
In what follows, the term "input IR" refers to code that is fed into the
|
|
vectorizer whereas the term "output IR" refers to code that is generated by the
|
|
vectorizer. The output IR contains code that has been vectorized or "widened"
|
|
according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed
|
|
according to an Unroll Factor (UF).
|
|
The design of VPlan follows several high-level guidelines:
|
|
|
|
1. Analysis-like: building and manipulating VPlans must not modify the input IR.
|
|
In particular, if the best option is not to vectorize at all, the
|
|
vectorization process terminates before reaching Step 3, and compilation
|
|
should proceed as if VPlans had not been built.
|
|
|
|
2. Align Cost & Execute: each VPlan must support both estimating the cost and
|
|
generating the output IR code, such that the cost estimation evaluates the
|
|
to-be-generated code reliably.
|
|
|
|
3. Support vectorizing additional constructs:
|
|
|
|
a. Outer-loop vectorization. In particular, VPlan must be able to model the
|
|
control-flow of the output IR which may include multiple basic-blocks and
|
|
nested loops.
|
|
b. SLP vectorization.
|
|
c. Combinations of the above, including nested vectorization: vectorizing
|
|
both an inner loop and an outer-loop at the same time (each with its own
|
|
VF and UF), mixed vectorization: vectorizing a loop with SLP patterns
|
|
inside [4]_, (re)vectorizing input IR containing vector code.
|
|
d. Function vectorization [2]_.
|
|
|
|
4. Support multiple candidates efficiently. In particular, similar candidates
|
|
related to a range of possible VF's and UF's must be represented efficiently.
|
|
Potential versioning needs to be supported efficiently.
|
|
|
|
5. Support vectorizing idioms, such as interleaved groups of strided loads or
|
|
stores. This is achieved by modeling a sequence of output instructions using
|
|
a "Recipe", which is responsible for computing its cost and generating its
|
|
code.
|
|
|
|
6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization
|
|
such regions may need to be, for example, predicated and linearized, or
|
|
replicated VF*UF times to handle scalarized and predicated instructions.
|
|
Innerloops are also modelled as SESE regions.
|
|
|
|
7. Support instruction-level analysis and transformation, as part of Planning
|
|
Step 2.b: During vectorization instructions may need to be traversed, moved,
|
|
replaced by other instructions or be created. For example, vector idiom
|
|
detection and formation involves searching for and optimizing instruction
|
|
patterns.
|
|
|
|
Definitions
|
|
===========
|
|
The low-level design of VPlan comprises of the following classes.
|
|
|
|
:LoopVectorizationPlanner:
|
|
A LoopVectorizationPlanner is designed to handle the vectorization of a loop
|
|
or a loop nest. It can construct, optimize and discard one or more VPlans,
|
|
each VPlan modelling a distinct way to vectorize the loop or the loop nest.
|
|
Once the best VPlan is determined, including the best VF and UF, this VPlan
|
|
drives the generation of output IR.
|
|
|
|
:VPlan:
|
|
A model of a vectorized candidate for a given input IR loop or loop nest. This
|
|
candidate is represented using a Hierarchical CFG. VPlan supports estimating
|
|
the cost and driving the generation of the output IR code it represents.
|
|
|
|
:Hierarchical CFG:
|
|
A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The
|
|
Hierarchical CFG data structure is similar to the Tile Tree [5]_, where
|
|
cross-Tile edges are lifted to connect Tiles instead of the original
|
|
basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms
|
|
Region and Block are used rather than Tile [5]_ to avoid confusion with loop
|
|
tiling.
|
|
|
|
:VPBlockBase:
|
|
The building block of the Hierarchical CFG. A pure-virtual base-class of
|
|
VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical
|
|
control-flow relations with other VPBlocks. Note that in contrast to the IR
|
|
BasicBlock, a VPBlockBase models its control-flow successors and predecessors
|
|
directly, rather than through a Terminator branch or through predecessor
|
|
branches that "use" the VPBlockBase.
|
|
|
|
:VPBasicBlock:
|
|
VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the
|
|
Hierarchical CFG. It represents a sequence of output IR instructions that will
|
|
appear consecutively in an output IR basic-block. The instructions of this
|
|
basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a
|
|
sequence of zero or more VPRecipes that model the cost and generation of the
|
|
output IR instructions.
|
|
|
|
:VPRegionBlock:
|
|
VPRegionBlock is a subclass of VPBlockBase. It models a collection of
|
|
VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR
|
|
CFG. A VPRegionBlock may indicate that its contents are to be replicated a
|
|
constant number of times when output IR is generated, effectively representing
|
|
a loop with constant trip-count that will be completely unrolled. This is used
|
|
to support scalarized and predicated instructions with a single model for
|
|
multiple candidate VF's and UF's.
|
|
|
|
:VPRecipeBase:
|
|
A pure-virtual base class modeling a sequence of one or more output IR
|
|
instructions, possibly based on one or more input IR instructions. These
|
|
input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe
|
|
may specify how its ingredients are to be transformed to produce the output IR
|
|
instructions; e.g., cloned once, replicated multiple times or widened
|
|
according to selected VF.
|
|
|
|
:VPValue:
|
|
The base of VPlan's def-use relations class hierarchy. When instantiated, it
|
|
models a constant or a live-in Value in VPlan. It has users, which are of type
|
|
VPUser, but no operands.
|
|
|
|
:VPUser:
|
|
A VPUser represents an entity that uses a number of VPValues as operands.
|
|
VPUser is similar in some aspects to LLVM's User class.
|
|
|
|
:VPDef:
|
|
A VPDef represents an entity that defines zero, one or multiple VPValues.
|
|
It is used to model the fact that recipes in VPlan can define multiple
|
|
VPValues.
|
|
|
|
:VPInstruction:
|
|
A VPInstruction is both a VPRecipe and a VPUser. It models a single
|
|
VPlan-level instruction to be generated if the VPlan is executed, including
|
|
its opcode and possibly additional characteristics. It is the basis for
|
|
writing instruction-level analyses and optimizations in VPlan as creating,
|
|
replacing or moving VPInstructions record both def-use and scheduling
|
|
decisions. VPInstructions also extend LLVM IR's opcodes with idiomatic
|
|
operations that enrich the Vectorizer's semantics.
|
|
|
|
:VPTransformState:
|
|
Stores information used for generating output IR, passed from
|
|
LoopVectorizationPlanner to its selected VPlan for execution, and used to pass
|
|
additional information down to VPBlocks and VPRecipes.
|
|
|
|
The Planning Process and VPlan Roadmap
|
|
======================================
|
|
|
|
Transforming the Loop Vectorizer to use VPlan follows a staged approach. First,
|
|
VPlan is used to record the final vectorization decisions, and to execute them:
|
|
the Hierarchical CFG models the planned control-flow, and Recipes capture
|
|
decisions taken inside basic-blocks. Next, VPlan will be used also as the basis
|
|
for taking these decisions, effectively turning them into a series of
|
|
VPlan-to-VPlan algorithms. Finally, VPlan will support the planning process
|
|
itself including cost-based analyses for making these decisions, to fully
|
|
support compositional and iterative decision making.
|
|
|
|
Some decisions are local to an instruction in the loop, such as whether to widen
|
|
it into a vector instruction or replicate it, keeping the generated instructions
|
|
in place. Other decisions, however, involve moving instructions, replacing them
|
|
with other instructions, and/or introducing new instructions. For example, a
|
|
cast may sink past a later instruction and be widened to handle first-order
|
|
recurrence; an interleave group of strided gathers or scatters may effectively
|
|
move to one place where they are replaced with shuffles and a common wide vector
|
|
load or store; new instructions may be introduced to compute masks, shuffle the
|
|
elements of vectors, and pack scalar values into vectors or vice-versa.
|
|
|
|
In order for VPlan to support making instruction-level decisions and analyses,
|
|
it needs to model the relevant instructions along with their def/use relations.
|
|
This too follows a staged approach: first, the new instructions that compute
|
|
masks are modeled as VPInstructions, along with their induced def/use subgraph.
|
|
This effectively models masks in VPlan, facilitating VPlan-based predication.
|
|
Next, the logic embedded within each Recipe for generating its instructions at
|
|
VPlan execution time, will instead take part in the planning process by modeling
|
|
them as VPInstructions. Finally, only logic that applies to instructions as a
|
|
group will remain in Recipes, such as interleave groups and potentially other
|
|
idiom groups having synergistic cost.
|
|
|
|
Related LLVM components
|
|
-----------------------
|
|
1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP
|
|
tree, where TSLP [3]_ adds Plan Step 2.b.
|
|
|
|
2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by
|
|
Polly [7]_.
|
|
|
|
3. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of
|
|
the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_.
|
|
|
|
References
|
|
----------
|
|
.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
|
|
Nuzman and Ayal Zaks, PACT 2008.
|
|
|
|
.. [2] "Proposal for function vectorization and loop vectorization with function
|
|
calls", Xinmin Tian, [`cfe-dev
|
|
<http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_].,
|
|
March 2, 2016.
|
|
See also `review <https://reviews.llvm.org/D22792>`_.
|
|
|
|
.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios
|
|
Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.
|
|
|
|
.. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization
|
|
overhead", Hao Zhou and Jingling Xue, CGO 2016.
|
|
|
|
.. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and
|
|
Brian Koblenz, PLDI 1991
|
|
|
|
.. [6] "Structural analysis: A new approach to flow analysis in optimizing
|
|
compilers", M. Sharir, Journal of Computer Languages, Jan. 1980
|
|
|
|
.. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma
|
|
thesis, 2011.
|
|
|
|
.. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks,
|
|
European LLVM Developers' Meeting 2017.
|
|
|
|
.. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop
|
|
Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016.
|