1
0
mirror of https://github.com/RPCS3/llvm-mirror.git synced 2024-10-26 22:42:46 +02:00
llvm-mirror/lib/Transforms/Scalar/SimplifyCFGPass.cpp

213 lines
7.7 KiB
C++
Raw Normal View History

//===- SimplifyCFGPass.cpp - CFG Simplification Pass ----------------------===//
//
// The LLVM Compiler Infrastructure
//
// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.
//
//===----------------------------------------------------------------------===//
//
2007-12-03 20:43:18 +01:00
// This file implements dead code elimination and basic block merging, along
// with a collection of other peephole control flow optimizations. For example:
//
// * Removes basic blocks with no predecessors.
// * Merges a basic block into its predecessor if there is only one and the
// predecessor only has one successor.
// * Eliminates PHI nodes for basic blocks with a single predecessor.
// * Eliminates a basic block that only contains an unconditional branch.
2007-12-03 20:43:18 +01:00
// * Changes invoke instructions to nounwind functions to be calls.
// * Change things like "if (x) if (y)" into "if (x&y)".
// * etc..
//
//===----------------------------------------------------------------------===//
#include "llvm/Transforms/Scalar.h"
#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"
Make use of @llvm.assume in ValueTracking (computeKnownBits, etc.) This change, which allows @llvm.assume to be used from within computeKnownBits (and other associated functions in ValueTracking), adds some (optional) parameters to computeKnownBits and friends. These functions now (optionally) take a "context" instruction pointer, an AssumptionTracker pointer, and also a DomTree pointer, and most of the changes are just to pass this new information when it is easily available from InstSimplify, InstCombine, etc. As explained below, the significant conceptual change is that known properties of a value might depend on the control-flow location of the use (because we care that the @llvm.assume dominates the use because assumptions have control-flow dependencies). This means that, when we ask if bits are known in a value, we might get different answers for different uses. The significant changes are all in ValueTracking. Two main changes: First, as with the rest of the code, new parameters need to be passed around. To make this easier, I grouped them into a structure, and I made internal static versions of the relevant functions that take this structure as a parameter. The new code does as you might expect, it looks for @llvm.assume calls that make use of the value we're trying to learn something about (often indirectly), attempts to pattern match that expression, and uses the result if successful. By making use of the AssumptionTracker, the process of finding @llvm.assume calls is not expensive. Part of the structure being passed around inside ValueTracking is a set of already-considered @llvm.assume calls. This is to prevent a query using, for example, the assume(a == b), to recurse on itself. The context and DT params are used to find applicable assumptions. An assumption needs to dominate the context instruction, or come after it deterministically. In this latter case we only handle the specific case where both the assumption and the context instruction are in the same block, and we need to exclude assumptions from being used to simplify their own ephemeral values (those which contribute only to the assumption) because otherwise the assumption would prove its feeding comparison trivial and would be removed. This commit adds the plumbing and the logic for a simple masked-bit propagation (just enough to write a regression test). Future commits add more patterns (and, correspondingly, more regression tests). llvm-svn: 217342
2014-09-07 20:57:58 +02:00
#include "llvm/Analysis/AssumptionTracker.h"
#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/IR/Attributes.h"
#include "llvm/IR/CFG.h"
#include "llvm/IR/Constants.h"
#include "llvm/IR/DataLayout.h"
#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Module.h"
#include "llvm/Pass.h"
[SimplifyCFG] threshold for folding branches with common destination Summary: This patch adds a threshold that controls the number of bonus instructions allowed for folding branches with common destination. The original code allows at most one bonus instruction. With this patch, users can customize the threshold to allow multiple bonus instructions. The default threshold is still 1, so that the code behaves the same as before when users do not specify this threshold. The motivation of this change is that tuning this threshold significantly (up to 25%) improves the performance of some CUDA programs in our internal code base. In general, branch instructions are very expensive for GPU programs. Therefore, it is sometimes worth trading more arithmetic computation for a more straightened control flow. Here's a reduced example: __global__ void foo(int a, int b, int c, int d, int e, int n, const int *input, int *output) { int sum = 0; for (int i = 0; i < n; ++i) sum += (((i ^ a) > b) && (((i | c ) ^ d) > e)) ? 0 : input[i]; *output = sum; } The select statement in the loop body translates to two branch instructions "if ((i ^ a) > b)" and "if (((i | c) ^ d) > e)" which share a common destination. With the default threshold, SimplifyCFG is unable to fold them, because computing the condition of the second branch "(i | c) ^ d > e" requires two bonus instructions. With the threshold increased, SimplifyCFG can fold the two branches so that the loop body contains only one branch, making the code conceptually look like: sum += (((i ^ a) > b) & (((i | c ) ^ d) > e)) ? 0 : input[i]; Increasing the threshold significantly improves the performance of this particular example. In the configuration where both conditions are guaranteed to be true, increasing the threshold from 1 to 2 improves the performance by 18.24%. Even in the configuration where the first condition is false and the second condition is true, which favors shortcuts, increasing the threshold from 1 to 2 still improves the performance by 4.35%. We are still looking for a good threshold and maybe a better cost model than just counting the number of bonus instructions. However, according to the above numbers, we think it is at least worth adding a threshold to enable more experiments and tuning. Let me know what you think. Thanks! Test Plan: Added one test case to check the threshold is in effect Reviewers: nadav, eliben, meheff, resistor, hfinkel Reviewed By: hfinkel Subscribers: hfinkel, llvm-commits Differential Revision: http://reviews.llvm.org/D5529 llvm-svn: 218711
2014-10-01 00:23:38 +02:00
#include "llvm/Support/CommandLine.h"
#include "llvm/Transforms/Utils/Local.h"
using namespace llvm;
#define DEBUG_TYPE "simplifycfg"
[SimplifyCFG] threshold for folding branches with common destination Summary: This patch adds a threshold that controls the number of bonus instructions allowed for folding branches with common destination. The original code allows at most one bonus instruction. With this patch, users can customize the threshold to allow multiple bonus instructions. The default threshold is still 1, so that the code behaves the same as before when users do not specify this threshold. The motivation of this change is that tuning this threshold significantly (up to 25%) improves the performance of some CUDA programs in our internal code base. In general, branch instructions are very expensive for GPU programs. Therefore, it is sometimes worth trading more arithmetic computation for a more straightened control flow. Here's a reduced example: __global__ void foo(int a, int b, int c, int d, int e, int n, const int *input, int *output) { int sum = 0; for (int i = 0; i < n; ++i) sum += (((i ^ a) > b) && (((i | c ) ^ d) > e)) ? 0 : input[i]; *output = sum; } The select statement in the loop body translates to two branch instructions "if ((i ^ a) > b)" and "if (((i | c) ^ d) > e)" which share a common destination. With the default threshold, SimplifyCFG is unable to fold them, because computing the condition of the second branch "(i | c) ^ d > e" requires two bonus instructions. With the threshold increased, SimplifyCFG can fold the two branches so that the loop body contains only one branch, making the code conceptually look like: sum += (((i ^ a) > b) & (((i | c ) ^ d) > e)) ? 0 : input[i]; Increasing the threshold significantly improves the performance of this particular example. In the configuration where both conditions are guaranteed to be true, increasing the threshold from 1 to 2 improves the performance by 18.24%. Even in the configuration where the first condition is false and the second condition is true, which favors shortcuts, increasing the threshold from 1 to 2 still improves the performance by 4.35%. We are still looking for a good threshold and maybe a better cost model than just counting the number of bonus instructions. However, according to the above numbers, we think it is at least worth adding a threshold to enable more experiments and tuning. Let me know what you think. Thanks! Test Plan: Added one test case to check the threshold is in effect Reviewers: nadav, eliben, meheff, resistor, hfinkel Reviewed By: hfinkel Subscribers: hfinkel, llvm-commits Differential Revision: http://reviews.llvm.org/D5529 llvm-svn: 218711
2014-10-01 00:23:38 +02:00
static cl::opt<unsigned>
UserBonusInstThreshold("bonus-inst-threshold", cl::Hidden, cl::init(1),
cl::desc("Control the number of bonus instructions (default = 1)"));
STATISTIC(NumSimpl, "Number of blocks simplified");
namespace {
struct CFGSimplifyPass : public FunctionPass {
static char ID; // Pass identification, replacement for typeid
[SimplifyCFG] threshold for folding branches with common destination Summary: This patch adds a threshold that controls the number of bonus instructions allowed for folding branches with common destination. The original code allows at most one bonus instruction. With this patch, users can customize the threshold to allow multiple bonus instructions. The default threshold is still 1, so that the code behaves the same as before when users do not specify this threshold. The motivation of this change is that tuning this threshold significantly (up to 25%) improves the performance of some CUDA programs in our internal code base. In general, branch instructions are very expensive for GPU programs. Therefore, it is sometimes worth trading more arithmetic computation for a more straightened control flow. Here's a reduced example: __global__ void foo(int a, int b, int c, int d, int e, int n, const int *input, int *output) { int sum = 0; for (int i = 0; i < n; ++i) sum += (((i ^ a) > b) && (((i | c ) ^ d) > e)) ? 0 : input[i]; *output = sum; } The select statement in the loop body translates to two branch instructions "if ((i ^ a) > b)" and "if (((i | c) ^ d) > e)" which share a common destination. With the default threshold, SimplifyCFG is unable to fold them, because computing the condition of the second branch "(i | c) ^ d > e" requires two bonus instructions. With the threshold increased, SimplifyCFG can fold the two branches so that the loop body contains only one branch, making the code conceptually look like: sum += (((i ^ a) > b) & (((i | c ) ^ d) > e)) ? 0 : input[i]; Increasing the threshold significantly improves the performance of this particular example. In the configuration where both conditions are guaranteed to be true, increasing the threshold from 1 to 2 improves the performance by 18.24%. Even in the configuration where the first condition is false and the second condition is true, which favors shortcuts, increasing the threshold from 1 to 2 still improves the performance by 4.35%. We are still looking for a good threshold and maybe a better cost model than just counting the number of bonus instructions. However, according to the above numbers, we think it is at least worth adding a threshold to enable more experiments and tuning. Let me know what you think. Thanks! Test Plan: Added one test case to check the threshold is in effect Reviewers: nadav, eliben, meheff, resistor, hfinkel Reviewed By: hfinkel Subscribers: hfinkel, llvm-commits Differential Revision: http://reviews.llvm.org/D5529 llvm-svn: 218711
2014-10-01 00:23:38 +02:00
unsigned BonusInstThreshold;
CFGSimplifyPass(int T = -1) : FunctionPass(ID) {
BonusInstThreshold = (T == -1) ? UserBonusInstThreshold : unsigned(T);
initializeCFGSimplifyPassPass(*PassRegistry::getPassRegistry());
}
bool runOnFunction(Function &F) override;
void getAnalysisUsage(AnalysisUsage &AU) const override {
Make use of @llvm.assume in ValueTracking (computeKnownBits, etc.) This change, which allows @llvm.assume to be used from within computeKnownBits (and other associated functions in ValueTracking), adds some (optional) parameters to computeKnownBits and friends. These functions now (optionally) take a "context" instruction pointer, an AssumptionTracker pointer, and also a DomTree pointer, and most of the changes are just to pass this new information when it is easily available from InstSimplify, InstCombine, etc. As explained below, the significant conceptual change is that known properties of a value might depend on the control-flow location of the use (because we care that the @llvm.assume dominates the use because assumptions have control-flow dependencies). This means that, when we ask if bits are known in a value, we might get different answers for different uses. The significant changes are all in ValueTracking. Two main changes: First, as with the rest of the code, new parameters need to be passed around. To make this easier, I grouped them into a structure, and I made internal static versions of the relevant functions that take this structure as a parameter. The new code does as you might expect, it looks for @llvm.assume calls that make use of the value we're trying to learn something about (often indirectly), attempts to pattern match that expression, and uses the result if successful. By making use of the AssumptionTracker, the process of finding @llvm.assume calls is not expensive. Part of the structure being passed around inside ValueTracking is a set of already-considered @llvm.assume calls. This is to prevent a query using, for example, the assume(a == b), to recurse on itself. The context and DT params are used to find applicable assumptions. An assumption needs to dominate the context instruction, or come after it deterministically. In this latter case we only handle the specific case where both the assumption and the context instruction are in the same block, and we need to exclude assumptions from being used to simplify their own ephemeral values (those which contribute only to the assumption) because otherwise the assumption would prove its feeding comparison trivial and would be removed. This commit adds the plumbing and the logic for a simple masked-bit propagation (just enough to write a regression test). Future commits add more patterns (and, correspondingly, more regression tests). llvm-svn: 217342
2014-09-07 20:57:58 +02:00
AU.addRequired<AssumptionTracker>();
AU.addRequired<TargetTransformInfo>();
}
};
}
char CFGSimplifyPass::ID = 0;
INITIALIZE_PASS_BEGIN(CFGSimplifyPass, "simplifycfg", "Simplify the CFG", false,
false)
INITIALIZE_AG_DEPENDENCY(TargetTransformInfo)
Make use of @llvm.assume in ValueTracking (computeKnownBits, etc.) This change, which allows @llvm.assume to be used from within computeKnownBits (and other associated functions in ValueTracking), adds some (optional) parameters to computeKnownBits and friends. These functions now (optionally) take a "context" instruction pointer, an AssumptionTracker pointer, and also a DomTree pointer, and most of the changes are just to pass this new information when it is easily available from InstSimplify, InstCombine, etc. As explained below, the significant conceptual change is that known properties of a value might depend on the control-flow location of the use (because we care that the @llvm.assume dominates the use because assumptions have control-flow dependencies). This means that, when we ask if bits are known in a value, we might get different answers for different uses. The significant changes are all in ValueTracking. Two main changes: First, as with the rest of the code, new parameters need to be passed around. To make this easier, I grouped them into a structure, and I made internal static versions of the relevant functions that take this structure as a parameter. The new code does as you might expect, it looks for @llvm.assume calls that make use of the value we're trying to learn something about (often indirectly), attempts to pattern match that expression, and uses the result if successful. By making use of the AssumptionTracker, the process of finding @llvm.assume calls is not expensive. Part of the structure being passed around inside ValueTracking is a set of already-considered @llvm.assume calls. This is to prevent a query using, for example, the assume(a == b), to recurse on itself. The context and DT params are used to find applicable assumptions. An assumption needs to dominate the context instruction, or come after it deterministically. In this latter case we only handle the specific case where both the assumption and the context instruction are in the same block, and we need to exclude assumptions from being used to simplify their own ephemeral values (those which contribute only to the assumption) because otherwise the assumption would prove its feeding comparison trivial and would be removed. This commit adds the plumbing and the logic for a simple masked-bit propagation (just enough to write a regression test). Future commits add more patterns (and, correspondingly, more regression tests). llvm-svn: 217342
2014-09-07 20:57:58 +02:00
INITIALIZE_PASS_DEPENDENCY(AssumptionTracker)
INITIALIZE_PASS_END(CFGSimplifyPass, "simplifycfg", "Simplify the CFG", false,
false)
// Public interface to the CFGSimplification pass
[SimplifyCFG] threshold for folding branches with common destination Summary: This patch adds a threshold that controls the number of bonus instructions allowed for folding branches with common destination. The original code allows at most one bonus instruction. With this patch, users can customize the threshold to allow multiple bonus instructions. The default threshold is still 1, so that the code behaves the same as before when users do not specify this threshold. The motivation of this change is that tuning this threshold significantly (up to 25%) improves the performance of some CUDA programs in our internal code base. In general, branch instructions are very expensive for GPU programs. Therefore, it is sometimes worth trading more arithmetic computation for a more straightened control flow. Here's a reduced example: __global__ void foo(int a, int b, int c, int d, int e, int n, const int *input, int *output) { int sum = 0; for (int i = 0; i < n; ++i) sum += (((i ^ a) > b) && (((i | c ) ^ d) > e)) ? 0 : input[i]; *output = sum; } The select statement in the loop body translates to two branch instructions "if ((i ^ a) > b)" and "if (((i | c) ^ d) > e)" which share a common destination. With the default threshold, SimplifyCFG is unable to fold them, because computing the condition of the second branch "(i | c) ^ d > e" requires two bonus instructions. With the threshold increased, SimplifyCFG can fold the two branches so that the loop body contains only one branch, making the code conceptually look like: sum += (((i ^ a) > b) & (((i | c ) ^ d) > e)) ? 0 : input[i]; Increasing the threshold significantly improves the performance of this particular example. In the configuration where both conditions are guaranteed to be true, increasing the threshold from 1 to 2 improves the performance by 18.24%. Even in the configuration where the first condition is false and the second condition is true, which favors shortcuts, increasing the threshold from 1 to 2 still improves the performance by 4.35%. We are still looking for a good threshold and maybe a better cost model than just counting the number of bonus instructions. However, according to the above numbers, we think it is at least worth adding a threshold to enable more experiments and tuning. Let me know what you think. Thanks! Test Plan: Added one test case to check the threshold is in effect Reviewers: nadav, eliben, meheff, resistor, hfinkel Reviewed By: hfinkel Subscribers: hfinkel, llvm-commits Differential Revision: http://reviews.llvm.org/D5529 llvm-svn: 218711
2014-10-01 00:23:38 +02:00
FunctionPass *llvm::createCFGSimplificationPass(int Threshold) {
return new CFGSimplifyPass(Threshold);
}
/// mergeEmptyReturnBlocks - If we have more than one empty (other than phi
/// node) return blocks, merge them together to promote recursive block merging.
static bool mergeEmptyReturnBlocks(Function &F) {
bool Changed = false;
2012-07-24 12:51:42 +02:00
BasicBlock *RetBlock = nullptr;
2012-07-24 12:51:42 +02:00
// Scan all the blocks in the function, looking for empty return blocks.
for (Function::iterator BBI = F.begin(), E = F.end(); BBI != E; ) {
BasicBlock &BB = *BBI++;
2012-07-24 12:51:42 +02:00
// Only look at return blocks.
ReturnInst *Ret = dyn_cast<ReturnInst>(BB.getTerminator());
if (!Ret) continue;
2012-07-24 12:51:42 +02:00
// Only look at the block if it is empty or the only other thing in it is a
// single PHI node that is the operand to the return.
if (Ret != &BB.front()) {
// Check for something else in the block.
BasicBlock::iterator I = Ret;
--I;
// Skip over debug info.
while (isa<DbgInfoIntrinsic>(I) && I != BB.begin())
--I;
if (!isa<DbgInfoIntrinsic>(I) &&
(!isa<PHINode>(I) || I != BB.begin() ||
Ret->getNumOperands() == 0 ||
Ret->getOperand(0) != I))
continue;
}
// If this is the first returning block, remember it and keep going.
if (!RetBlock) {
RetBlock = &BB;
continue;
}
2012-07-24 12:51:42 +02:00
// Otherwise, we found a duplicate return block. Merge the two.
Changed = true;
2012-07-24 12:51:42 +02:00
// Case when there is no input to the return or when the returned values
// agree is trivial. Note that they can't agree if there are phis in the
// blocks.
if (Ret->getNumOperands() == 0 ||
2012-07-24 12:51:42 +02:00
Ret->getOperand(0) ==
cast<ReturnInst>(RetBlock->getTerminator())->getOperand(0)) {
BB.replaceAllUsesWith(RetBlock);
BB.eraseFromParent();
continue;
}
2012-07-24 12:51:42 +02:00
// If the canonical return block has no PHI node, create one now.
PHINode *RetBlockPHI = dyn_cast<PHINode>(RetBlock->begin());
if (!RetBlockPHI) {
Value *InVal = cast<ReturnInst>(RetBlock->getTerminator())->getOperand(0);
pred_iterator PB = pred_begin(RetBlock), PE = pred_end(RetBlock);
RetBlockPHI = PHINode::Create(Ret->getOperand(0)->getType(),
std::distance(PB, PE), "merge",
&RetBlock->front());
2012-07-24 12:51:42 +02:00
for (pred_iterator PI = PB; PI != PE; ++PI)
RetBlockPHI->addIncoming(InVal, *PI);
RetBlock->getTerminator()->setOperand(0, RetBlockPHI);
}
2012-07-24 12:51:42 +02:00
// Turn BB into a block that just unconditionally branches to the return
// block. This handles the case when the two return blocks have a common
// predecessor but that return different things.
RetBlockPHI->addIncoming(Ret->getOperand(0), &BB);
BB.getTerminator()->eraseFromParent();
BranchInst::Create(RetBlock, &BB);
}
2012-07-24 12:51:42 +02:00
return Changed;
}
/// iterativelySimplifyCFG - Call SimplifyCFG on all the blocks in the function,
/// iterating until no more changes are made.
static bool iterativelySimplifyCFG(Function &F, const TargetTransformInfo &TTI,
Make use of @llvm.assume in ValueTracking (computeKnownBits, etc.) This change, which allows @llvm.assume to be used from within computeKnownBits (and other associated functions in ValueTracking), adds some (optional) parameters to computeKnownBits and friends. These functions now (optionally) take a "context" instruction pointer, an AssumptionTracker pointer, and also a DomTree pointer, and most of the changes are just to pass this new information when it is easily available from InstSimplify, InstCombine, etc. As explained below, the significant conceptual change is that known properties of a value might depend on the control-flow location of the use (because we care that the @llvm.assume dominates the use because assumptions have control-flow dependencies). This means that, when we ask if bits are known in a value, we might get different answers for different uses. The significant changes are all in ValueTracking. Two main changes: First, as with the rest of the code, new parameters need to be passed around. To make this easier, I grouped them into a structure, and I made internal static versions of the relevant functions that take this structure as a parameter. The new code does as you might expect, it looks for @llvm.assume calls that make use of the value we're trying to learn something about (often indirectly), attempts to pattern match that expression, and uses the result if successful. By making use of the AssumptionTracker, the process of finding @llvm.assume calls is not expensive. Part of the structure being passed around inside ValueTracking is a set of already-considered @llvm.assume calls. This is to prevent a query using, for example, the assume(a == b), to recurse on itself. The context and DT params are used to find applicable assumptions. An assumption needs to dominate the context instruction, or come after it deterministically. In this latter case we only handle the specific case where both the assumption and the context instruction are in the same block, and we need to exclude assumptions from being used to simplify their own ephemeral values (those which contribute only to the assumption) because otherwise the assumption would prove its feeding comparison trivial and would be removed. This commit adds the plumbing and the logic for a simple masked-bit propagation (just enough to write a regression test). Future commits add more patterns (and, correspondingly, more regression tests). llvm-svn: 217342
2014-09-07 20:57:58 +02:00
const DataLayout *DL,
[SimplifyCFG] threshold for folding branches with common destination Summary: This patch adds a threshold that controls the number of bonus instructions allowed for folding branches with common destination. The original code allows at most one bonus instruction. With this patch, users can customize the threshold to allow multiple bonus instructions. The default threshold is still 1, so that the code behaves the same as before when users do not specify this threshold. The motivation of this change is that tuning this threshold significantly (up to 25%) improves the performance of some CUDA programs in our internal code base. In general, branch instructions are very expensive for GPU programs. Therefore, it is sometimes worth trading more arithmetic computation for a more straightened control flow. Here's a reduced example: __global__ void foo(int a, int b, int c, int d, int e, int n, const int *input, int *output) { int sum = 0; for (int i = 0; i < n; ++i) sum += (((i ^ a) > b) && (((i | c ) ^ d) > e)) ? 0 : input[i]; *output = sum; } The select statement in the loop body translates to two branch instructions "if ((i ^ a) > b)" and "if (((i | c) ^ d) > e)" which share a common destination. With the default threshold, SimplifyCFG is unable to fold them, because computing the condition of the second branch "(i | c) ^ d > e" requires two bonus instructions. With the threshold increased, SimplifyCFG can fold the two branches so that the loop body contains only one branch, making the code conceptually look like: sum += (((i ^ a) > b) & (((i | c ) ^ d) > e)) ? 0 : input[i]; Increasing the threshold significantly improves the performance of this particular example. In the configuration where both conditions are guaranteed to be true, increasing the threshold from 1 to 2 improves the performance by 18.24%. Even in the configuration where the first condition is false and the second condition is true, which favors shortcuts, increasing the threshold from 1 to 2 still improves the performance by 4.35%. We are still looking for a good threshold and maybe a better cost model than just counting the number of bonus instructions. However, according to the above numbers, we think it is at least worth adding a threshold to enable more experiments and tuning. Let me know what you think. Thanks! Test Plan: Added one test case to check the threshold is in effect Reviewers: nadav, eliben, meheff, resistor, hfinkel Reviewed By: hfinkel Subscribers: hfinkel, llvm-commits Differential Revision: http://reviews.llvm.org/D5529 llvm-svn: 218711
2014-10-01 00:23:38 +02:00
AssumptionTracker *AT,
unsigned BonusInstThreshold) {
bool Changed = false;
bool LocalChange = true;
while (LocalChange) {
LocalChange = false;
2012-07-24 12:51:42 +02:00
// Loop over all of the basic blocks and remove them if they are unneeded...
//
for (Function::iterator BBIt = F.begin(); BBIt != F.end(); ) {
[SimplifyCFG] threshold for folding branches with common destination Summary: This patch adds a threshold that controls the number of bonus instructions allowed for folding branches with common destination. The original code allows at most one bonus instruction. With this patch, users can customize the threshold to allow multiple bonus instructions. The default threshold is still 1, so that the code behaves the same as before when users do not specify this threshold. The motivation of this change is that tuning this threshold significantly (up to 25%) improves the performance of some CUDA programs in our internal code base. In general, branch instructions are very expensive for GPU programs. Therefore, it is sometimes worth trading more arithmetic computation for a more straightened control flow. Here's a reduced example: __global__ void foo(int a, int b, int c, int d, int e, int n, const int *input, int *output) { int sum = 0; for (int i = 0; i < n; ++i) sum += (((i ^ a) > b) && (((i | c ) ^ d) > e)) ? 0 : input[i]; *output = sum; } The select statement in the loop body translates to two branch instructions "if ((i ^ a) > b)" and "if (((i | c) ^ d) > e)" which share a common destination. With the default threshold, SimplifyCFG is unable to fold them, because computing the condition of the second branch "(i | c) ^ d > e" requires two bonus instructions. With the threshold increased, SimplifyCFG can fold the two branches so that the loop body contains only one branch, making the code conceptually look like: sum += (((i ^ a) > b) & (((i | c ) ^ d) > e)) ? 0 : input[i]; Increasing the threshold significantly improves the performance of this particular example. In the configuration where both conditions are guaranteed to be true, increasing the threshold from 1 to 2 improves the performance by 18.24%. Even in the configuration where the first condition is false and the second condition is true, which favors shortcuts, increasing the threshold from 1 to 2 still improves the performance by 4.35%. We are still looking for a good threshold and maybe a better cost model than just counting the number of bonus instructions. However, according to the above numbers, we think it is at least worth adding a threshold to enable more experiments and tuning. Let me know what you think. Thanks! Test Plan: Added one test case to check the threshold is in effect Reviewers: nadav, eliben, meheff, resistor, hfinkel Reviewed By: hfinkel Subscribers: hfinkel, llvm-commits Differential Revision: http://reviews.llvm.org/D5529 llvm-svn: 218711
2014-10-01 00:23:38 +02:00
if (SimplifyCFG(BBIt++, TTI, BonusInstThreshold, DL, AT)) {
LocalChange = true;
++NumSimpl;
}
}
Changed |= LocalChange;
}
return Changed;
}
// It is possible that we may require multiple passes over the code to fully
// simplify the CFG.
//
bool CFGSimplifyPass::runOnFunction(Function &F) {
if (skipOptnoneFunction(F))
return false;
Make use of @llvm.assume in ValueTracking (computeKnownBits, etc.) This change, which allows @llvm.assume to be used from within computeKnownBits (and other associated functions in ValueTracking), adds some (optional) parameters to computeKnownBits and friends. These functions now (optionally) take a "context" instruction pointer, an AssumptionTracker pointer, and also a DomTree pointer, and most of the changes are just to pass this new information when it is easily available from InstSimplify, InstCombine, etc. As explained below, the significant conceptual change is that known properties of a value might depend on the control-flow location of the use (because we care that the @llvm.assume dominates the use because assumptions have control-flow dependencies). This means that, when we ask if bits are known in a value, we might get different answers for different uses. The significant changes are all in ValueTracking. Two main changes: First, as with the rest of the code, new parameters need to be passed around. To make this easier, I grouped them into a structure, and I made internal static versions of the relevant functions that take this structure as a parameter. The new code does as you might expect, it looks for @llvm.assume calls that make use of the value we're trying to learn something about (often indirectly), attempts to pattern match that expression, and uses the result if successful. By making use of the AssumptionTracker, the process of finding @llvm.assume calls is not expensive. Part of the structure being passed around inside ValueTracking is a set of already-considered @llvm.assume calls. This is to prevent a query using, for example, the assume(a == b), to recurse on itself. The context and DT params are used to find applicable assumptions. An assumption needs to dominate the context instruction, or come after it deterministically. In this latter case we only handle the specific case where both the assumption and the context instruction are in the same block, and we need to exclude assumptions from being used to simplify their own ephemeral values (those which contribute only to the assumption) because otherwise the assumption would prove its feeding comparison trivial and would be removed. This commit adds the plumbing and the logic for a simple masked-bit propagation (just enough to write a regression test). Future commits add more patterns (and, correspondingly, more regression tests). llvm-svn: 217342
2014-09-07 20:57:58 +02:00
AssumptionTracker *AT = &getAnalysis<AssumptionTracker>();
const TargetTransformInfo &TTI = getAnalysis<TargetTransformInfo>();
DataLayoutPass *DLP = getAnalysisIfAvailable<DataLayoutPass>();
const DataLayout *DL = DLP ? &DLP->getDataLayout() : nullptr;
bool EverChanged = removeUnreachableBlocks(F);
EverChanged |= mergeEmptyReturnBlocks(F);
[SimplifyCFG] threshold for folding branches with common destination Summary: This patch adds a threshold that controls the number of bonus instructions allowed for folding branches with common destination. The original code allows at most one bonus instruction. With this patch, users can customize the threshold to allow multiple bonus instructions. The default threshold is still 1, so that the code behaves the same as before when users do not specify this threshold. The motivation of this change is that tuning this threshold significantly (up to 25%) improves the performance of some CUDA programs in our internal code base. In general, branch instructions are very expensive for GPU programs. Therefore, it is sometimes worth trading more arithmetic computation for a more straightened control flow. Here's a reduced example: __global__ void foo(int a, int b, int c, int d, int e, int n, const int *input, int *output) { int sum = 0; for (int i = 0; i < n; ++i) sum += (((i ^ a) > b) && (((i | c ) ^ d) > e)) ? 0 : input[i]; *output = sum; } The select statement in the loop body translates to two branch instructions "if ((i ^ a) > b)" and "if (((i | c) ^ d) > e)" which share a common destination. With the default threshold, SimplifyCFG is unable to fold them, because computing the condition of the second branch "(i | c) ^ d > e" requires two bonus instructions. With the threshold increased, SimplifyCFG can fold the two branches so that the loop body contains only one branch, making the code conceptually look like: sum += (((i ^ a) > b) & (((i | c ) ^ d) > e)) ? 0 : input[i]; Increasing the threshold significantly improves the performance of this particular example. In the configuration where both conditions are guaranteed to be true, increasing the threshold from 1 to 2 improves the performance by 18.24%. Even in the configuration where the first condition is false and the second condition is true, which favors shortcuts, increasing the threshold from 1 to 2 still improves the performance by 4.35%. We are still looking for a good threshold and maybe a better cost model than just counting the number of bonus instructions. However, according to the above numbers, we think it is at least worth adding a threshold to enable more experiments and tuning. Let me know what you think. Thanks! Test Plan: Added one test case to check the threshold is in effect Reviewers: nadav, eliben, meheff, resistor, hfinkel Reviewed By: hfinkel Subscribers: hfinkel, llvm-commits Differential Revision: http://reviews.llvm.org/D5529 llvm-svn: 218711
2014-10-01 00:23:38 +02:00
EverChanged |= iterativelySimplifyCFG(F, TTI, DL, AT, BonusInstThreshold);
// If neither pass changed anything, we're done.
if (!EverChanged) return false;
// iterativelySimplifyCFG can (rarely) make some loops dead. If this happens,
// removeUnreachableBlocks is needed to nuke them, which means we should
// iterate between the two optimizations. We structure the code like this to
// avoid reruning iterativelySimplifyCFG if the second pass of
// removeUnreachableBlocks doesn't do anything.
if (!removeUnreachableBlocks(F))
return true;
do {
[SimplifyCFG] threshold for folding branches with common destination Summary: This patch adds a threshold that controls the number of bonus instructions allowed for folding branches with common destination. The original code allows at most one bonus instruction. With this patch, users can customize the threshold to allow multiple bonus instructions. The default threshold is still 1, so that the code behaves the same as before when users do not specify this threshold. The motivation of this change is that tuning this threshold significantly (up to 25%) improves the performance of some CUDA programs in our internal code base. In general, branch instructions are very expensive for GPU programs. Therefore, it is sometimes worth trading more arithmetic computation for a more straightened control flow. Here's a reduced example: __global__ void foo(int a, int b, int c, int d, int e, int n, const int *input, int *output) { int sum = 0; for (int i = 0; i < n; ++i) sum += (((i ^ a) > b) && (((i | c ) ^ d) > e)) ? 0 : input[i]; *output = sum; } The select statement in the loop body translates to two branch instructions "if ((i ^ a) > b)" and "if (((i | c) ^ d) > e)" which share a common destination. With the default threshold, SimplifyCFG is unable to fold them, because computing the condition of the second branch "(i | c) ^ d > e" requires two bonus instructions. With the threshold increased, SimplifyCFG can fold the two branches so that the loop body contains only one branch, making the code conceptually look like: sum += (((i ^ a) > b) & (((i | c ) ^ d) > e)) ? 0 : input[i]; Increasing the threshold significantly improves the performance of this particular example. In the configuration where both conditions are guaranteed to be true, increasing the threshold from 1 to 2 improves the performance by 18.24%. Even in the configuration where the first condition is false and the second condition is true, which favors shortcuts, increasing the threshold from 1 to 2 still improves the performance by 4.35%. We are still looking for a good threshold and maybe a better cost model than just counting the number of bonus instructions. However, according to the above numbers, we think it is at least worth adding a threshold to enable more experiments and tuning. Let me know what you think. Thanks! Test Plan: Added one test case to check the threshold is in effect Reviewers: nadav, eliben, meheff, resistor, hfinkel Reviewed By: hfinkel Subscribers: hfinkel, llvm-commits Differential Revision: http://reviews.llvm.org/D5529 llvm-svn: 218711
2014-10-01 00:23:38 +02:00
EverChanged = iterativelySimplifyCFG(F, TTI, DL, AT, BonusInstThreshold);
EverChanged |= removeUnreachableBlocks(F);
} while (EverChanged);
return true;
}