1
0
mirror of https://github.com/RPCS3/llvm-mirror.git synced 2024-10-18 10:32:48 +02:00

[llvm-exegesis] Loop unrolling for loop snippet repetitor mode

I really needed this, like, factually, yesterday,
when verifying dependency breaking idioms for AMD Zen 3 scheduler model.

Consider the following example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=duplicate
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-4a7e50.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VPXORYrr YMM0 YMM0 YMM0'
  config:          ''
  register_initial_values: []
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
  - { key: inverse_throughput, value: 0.31025, per_snippet_value: 0.31025 }
error:           ''
info:            ''
assembled_snippet: C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C3
...

```
What does it tell us?
So wait, it can only execute ~3 x86 AVX YMM PXOR zero-idioms per cycle?
That doesn't seem right. That's even less than there are pipes supporting this type of op.

Now, second example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2418b5.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VPXORYrr YMM0 YMM0 YMM0'
  config:          ''
  register_initial_values: []
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
  - { key: inverse_throughput, value: 1.00011, per_snippet_value: 1.00011 }
error:           ''
info:            ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...
```
Now that's just worse. Due to the looping, the throughput completely plummeted,
and now we can only do a single instruction/cycle!?

That's not great.
And final example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop --loop-body-size=1000
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c402e2.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VPXORYrr YMM0 YMM0 YMM0'
  config:          ''
  register_initial_values: []
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
  - { key: inverse_throughput, value: 0.167087, per_snippet_value: 0.167087 }
error:           ''
info:            ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...
```

So if we merge the previous two approaches, do duplicate this single-instruction snippet 1000x
(loop-body-size/instruction count in snippet), and run a loop with 1000 iterations
over that duplicated/unrolled snippet, the measured throughput goes through the roof,
up to 5.9 instructions/cycle, which finally tells us that this idiom is zero-cycle!

Reviewed By: courbet

Differential Revision: https://reviews.llvm.org/D102522
This commit is contained in:
Roman Lebedev 2021-05-25 11:48:43 +03:00
parent a85df00b51
commit 5d534d8259
8 changed files with 69 additions and 27 deletions

View File

@ -189,7 +189,8 @@ OPTIONS
`latency` mode can be make use of either RDTSC or LBR.
`latency[LBR]` is only available on X86 (at least `Skylake`).
To run in `latency` mode, a positive value must be specified for `x86-lbr-sample-period` and `--repetition-mode=loop`.
To run in `latency` mode, a positive value must be specified
for `x86-lbr-sample-period` and `--repetition-mode=loop`.
In `analysis` mode, you also need to specify at least one of the
`-analysis-clusters-output-file=` and `-analysis-inconsistencies-output-file=`.
@ -202,23 +203,36 @@ OPTIONS
On choosing the "right" sampling period, a small value is preferred, but throttling
could occur if the sampling is too frequent. A prime number should be used to
avoid consistently skipping certain blocks.
.. option:: -repetition-mode=[duplicate|loop|min]
Specify the repetition mode. `duplicate` will create a large, straight line
basic block with `num-repetitions` copies of the snippet. `loop` will wrap
the snippet in a loop which will be run `num-repetitions` times. The `loop`
mode tends to better hide the effects of the CPU frontend on architectures
basic block with `num-repetitions` instructions (repeating the snippet
`num-repetitions`/`snippet size` times). `loop` will, optionally, duplicate the
snippet until the loop body contains at least `loop-body-size` instructions,
and then wrap the result in a loop which will execute `num-repetitions`
instructions (thus, again, repeating the snippet
`num-repetitions`/`snippet size` times). The `loop` mode, especially with loop
unrolling tends to better hide the effects of the CPU frontend on architectures
that cache decoded instructions, but consumes a register for counting
iterations. If performing an analysis over many opcodes, it may be best
to instead use the `min` mode, which will run each other mode, and produce
the minimal measured result.
iterations. If performing an analysis over many opcodes, it may be best to
instead use the `min` mode, which will run each other mode,
and produce the minimal measured result.
.. option:: -num-repetitions=<Number of repetitions>
Specify the number of repetitions of the asm snippet.
Specify the target number of executed instructions. Note that the actual
repetition count of the snippet will be `num-repetitions`/`snippet size`.
Higher values lead to more accurate measurements but lengthen the benchmark.
.. option:: -loop-body-size=<Preferred loop body size>
Only effective for `-repetition-mode=[loop|min]`.
Instead of looping over the snippet directly, first duplicate it so that the
loop body contains at least this many instructions. This potentially results
in loop body being cached in the CPU Op Cache / Loop Cache, which allows to
which may have higher throughput than the CPU decoders.
.. option:: -max-configs-per-opcode=<value>
Specify the maximum configurations that can be generated for each opcode.

View File

@ -67,7 +67,7 @@ struct InstructionBenchmark {
const MCInst &keyInstruction() const { return Key.Instructions[0]; }
// The number of instructions inside the repeated snippet. For example, if a
// snippet of 3 instructions is repeated 4 times, this is 12.
int NumRepetitions = 0;
unsigned NumRepetitions = 0;
enum RepetitionModeE { Duplicate, Loop, AggregateMin };
// Note that measurements are per instruction.
std::vector<BenchmarkMeasure> Measurements;

View File

@ -133,7 +133,7 @@ private:
} // namespace
Expected<InstructionBenchmark> BenchmarkRunner::runConfiguration(
const BenchmarkCode &BC, unsigned NumRepetitions,
const BenchmarkCode &BC, unsigned NumRepetitions, unsigned LoopBodySize,
ArrayRef<std::unique_ptr<const SnippetRepetitor>> Repetitors,
bool DumpObjectToDisk) const {
InstructionBenchmark InstrBenchmark;
@ -168,14 +168,16 @@ Expected<InstructionBenchmark> BenchmarkRunner::runConfiguration(
// Assemble at least kMinInstructionsForSnippet instructions by repeating
// the snippet for debug/analysis. This is so that the user clearly
// understands that the inside instructions are repeated.
constexpr const int kMinInstructionsForSnippet = 16;
const int MinInstructionsForSnippet = 4 * Instructions.size();
const int LoopBodySizeForSnippet = 2 * Instructions.size();
{
SmallString<0> Buffer;
raw_svector_ostream OS(Buffer);
if (Error E = assembleToStream(
State.getExegesisTarget(), State.createTargetMachine(),
BC.LiveIns, BC.Key.RegisterInitialValues,
Repetitor->Repeat(Instructions, kMinInstructionsForSnippet),
Repetitor->Repeat(Instructions, MinInstructionsForSnippet,
LoopBodySizeForSnippet),
OS)) {
return std::move(E);
}
@ -187,8 +189,8 @@ Expected<InstructionBenchmark> BenchmarkRunner::runConfiguration(
// Assemble NumRepetitions instructions repetitions of the snippet for
// measurements.
const auto Filler =
Repetitor->Repeat(Instructions, InstrBenchmark.NumRepetitions);
const auto Filler = Repetitor->Repeat(
Instructions, InstrBenchmark.NumRepetitions, LoopBodySize);
object::OwningBinary<object::ObjectFile> ObjectFile;
if (DumpObjectToDisk) {

View File

@ -41,6 +41,7 @@ public:
Expected<InstructionBenchmark>
runConfiguration(const BenchmarkCode &Configuration, unsigned NumRepetitions,
unsigned LoopUnrollFactor,
ArrayRef<std::unique_ptr<const SnippetRepetitor>> Repetitors,
bool DumpObjectToDisk) const;

View File

@ -11,6 +11,7 @@
#include "SnippetRepetitor.h"
#include "Target.h"
#include "llvm/ADT/Sequence.h"
#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/CodeGen/TargetSubtargetInfo.h"
@ -24,8 +25,8 @@ public:
// Repeats the snippet until there are at least MinInstructions in the
// resulting code.
FillFunction Repeat(ArrayRef<MCInst> Instructions,
unsigned MinInstructions) const override {
FillFunction Repeat(ArrayRef<MCInst> Instructions, unsigned MinInstructions,
unsigned LoopBodySize) const override {
return [Instructions, MinInstructions](FunctionFiller &Filler) {
auto Entry = Filler.getEntry();
if (!Instructions.empty()) {
@ -53,17 +54,26 @@ public:
State.getTargetMachine().getTargetTriple())) {}
// Loop over the snippet ceil(MinInstructions / Instructions.Size()) times.
FillFunction Repeat(ArrayRef<MCInst> Instructions,
unsigned MinInstructions) const override {
return [this, Instructions, MinInstructions](FunctionFiller &Filler) {
FillFunction Repeat(ArrayRef<MCInst> Instructions, unsigned MinInstructions,
unsigned LoopBodySize) const override {
return [this, Instructions, MinInstructions,
LoopBodySize](FunctionFiller &Filler) {
const auto &ET = State.getExegesisTarget();
auto Entry = Filler.getEntry();
auto Loop = Filler.addBasicBlock();
auto Exit = Filler.addBasicBlock();
const unsigned LoopUnrollFactor =
LoopBodySize <= Instructions.size()
? 1
: divideCeil(LoopBodySize, Instructions.size());
assert(LoopUnrollFactor >= 1 && "Should end up with at least 1 snippet.");
// Set loop counter to the right value:
const APInt LoopCount(32, (MinInstructions + Instructions.size() - 1) /
Instructions.size());
const APInt LoopCount(
32,
divideCeil(MinInstructions, LoopUnrollFactor * Instructions.size()));
assert(LoopCount.uge(1) && "Trip count should be at least 1.");
for (const MCInst &Inst :
ET.setRegTo(State.getSubtargetInfo(), LoopCounter, LoopCount))
Entry.addInstruction(Inst);
@ -78,7 +88,10 @@ public:
Loop.MBB->addLiveIn(Reg);
for (const auto &LiveIn : Entry.MBB->liveins())
Loop.MBB->addLiveIn(LiveIn);
Loop.addInstructions(Instructions);
for (auto _ : seq(0U, LoopUnrollFactor)) {
(void)_;
Loop.addInstructions(Instructions);
}
ET.decrementLoopCounterAndJump(*Loop.MBB, *Loop.MBB,
State.getInstrInfo());

View File

@ -39,7 +39,8 @@ public:
// Returns a functor that repeats `Instructions` so that the function executes
// at least `MinInstructions` instructions.
virtual FillFunction Repeat(ArrayRef<MCInst> Instructions,
unsigned MinInstructions) const = 0;
unsigned MinInstructions,
unsigned LoopBodySize) const = 0;
explicit SnippetRepetitor(const LLVMState &State) : State(State) {}

View File

@ -116,6 +116,13 @@ static cl::opt<unsigned>
cl::desc("number of time to repeat the asm snippet"),
cl::cat(BenchmarkOptions), cl::init(10000));
static cl::opt<unsigned>
LoopBodySize("loop-body-size",
cl::desc("when repeating the instruction snippet by looping "
"over it, duplicate the snippet until the loop body "
"contains at least this many instruction"),
cl::cat(BenchmarkOptions), cl::init(0));
static cl::opt<unsigned> MaxConfigsPerOpcode(
"max-configs-per-opcode",
cl::desc(
@ -365,7 +372,7 @@ void benchmarkMain() {
for (const BenchmarkCode &Conf : Configurations) {
InstructionBenchmark Result = ExitOnErr(Runner->runConfiguration(
Conf, NumRepetitions, Repetitors, DumpObjectToDisk));
Conf, NumRepetitions, LoopBodySize, Repetitors, DumpObjectToDisk));
ExitOnFileError(BenchmarkFile, Result.writeYaml(State, BenchmarkFile));
}
exegesis::pfm::pfmTerminate();

View File

@ -42,11 +42,13 @@ protected:
const auto Repetitor = SnippetRepetitor::Create(RepetitionMode, State);
const std::vector<MCInst> Instructions = {MCInstBuilder(X86::NOOP)};
FunctionFiller Sink(*MF, {X86::EAX});
const auto Fill = Repetitor->Repeat(Instructions, kMinInstructions);
const auto Fill =
Repetitor->Repeat(Instructions, kMinInstructions, kLoopBodySize);
Fill(Sink);
}
static constexpr const unsigned kMinInstructions = 3;
static constexpr const unsigned kLoopBodySize = 5;
std::unique_ptr<LLVMTargetMachine> TM;
std::unique_ptr<LLVMContext> Context;
@ -78,7 +80,9 @@ TEST_F(X86SnippetRepetitorTest, Loop) {
ASSERT_EQ(MF->getNumBlockIDs(), 3u);
const auto &LoopBlock = *MF->getBlockNumbered(1);
EXPECT_THAT(LoopBlock.instrs(),
ElementsAre(HasOpcode(X86::NOOP), HasOpcode(X86::ADD64ri8),
ElementsAre(HasOpcode(X86::NOOP), HasOpcode(X86::NOOP),
HasOpcode(X86::NOOP), HasOpcode(X86::NOOP),
HasOpcode(X86::NOOP), HasOpcode(X86::ADD64ri8),
HasOpcode(X86::JCC_1)));
EXPECT_THAT(LoopBlock.liveins(),
UnorderedElementsAre(