mirror of
https://github.com/RPCS3/llvm-mirror.git
synced 2024-11-25 20:23:11 +01:00
5d534d8259
I really needed this, like, factually, yesterday, when verifying dependency breaking idioms for AMD Zen 3 scheduler model. Consider the following example: ``` $ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=duplicate Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-4a7e50.o --- mode: inverse_throughput key: instructions: - 'VPXORYrr YMM0 YMM0 YMM0' config: '' register_initial_values: [] cpu_name: znver3 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 1000000 measurements: - { key: inverse_throughput, value: 0.31025, per_snippet_value: 0.31025 } error: '' info: '' assembled_snippet: C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C3 ... ``` What does it tell us? So wait, it can only execute ~3 x86 AVX YMM PXOR zero-idioms per cycle? That doesn't seem right. That's even less than there are pipes supporting this type of op. Now, second example: ``` $ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2418b5.o --- mode: inverse_throughput key: instructions: - 'VPXORYrr YMM0 YMM0 YMM0' config: '' register_initial_values: [] cpu_name: znver3 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 1000000 measurements: - { key: inverse_throughput, value: 1.00011, per_snippet_value: 1.00011 } error: '' info: '' assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3 ... ``` Now that's just worse. Due to the looping, the throughput completely plummeted, and now we can only do a single instruction/cycle!? That's not great. And final example: ``` $ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop --loop-body-size=1000 Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c402e2.o --- mode: inverse_throughput key: instructions: - 'VPXORYrr YMM0 YMM0 YMM0' config: '' register_initial_values: [] cpu_name: znver3 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 1000000 measurements: - { key: inverse_throughput, value: 0.167087, per_snippet_value: 0.167087 } error: '' info: '' assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3 ... ``` So if we merge the previous two approaches, do duplicate this single-instruction snippet 1000x (loop-body-size/instruction count in snippet), and run a loop with 1000 iterations over that duplicated/unrolled snippet, the measured throughput goes through the roof, up to 5.9 instructions/cycle, which finally tells us that this idiom is zero-cycle! Reviewed By: courbet Differential Revision: https://reviews.llvm.org/D102522 |
||
---|---|---|
.. | ||
bugpoint | ||
bugpoint-passes | ||
dsymutil | ||
gold | ||
llc | ||
lli | ||
llvm-ar | ||
llvm-as | ||
llvm-as-fuzzer | ||
llvm-bcanalyzer | ||
llvm-c-test | ||
llvm-cat | ||
llvm-cfi-verify | ||
llvm-config | ||
llvm-cov | ||
llvm-cvtres | ||
llvm-cxxdump | ||
llvm-cxxfilt | ||
llvm-cxxmap | ||
llvm-diff | ||
llvm-dis | ||
llvm-dwarfdump | ||
llvm-dwp | ||
llvm-elfabi | ||
llvm-exegesis | ||
llvm-extract | ||
llvm-go | ||
llvm-gsymutil | ||
llvm-ifs | ||
llvm-isel-fuzzer | ||
llvm-itanium-demangle-fuzzer | ||
llvm-jitlink | ||
llvm-jitlistener | ||
llvm-libtool-darwin | ||
llvm-link | ||
llvm-lipo | ||
llvm-lto | ||
llvm-lto2 | ||
llvm-mc | ||
llvm-mc-assemble-fuzzer | ||
llvm-mc-disassemble-fuzzer | ||
llvm-mca | ||
llvm-microsoft-demangle-fuzzer | ||
llvm-ml | ||
llvm-modextract | ||
llvm-mt | ||
llvm-nm | ||
llvm-objcopy | ||
llvm-objdump | ||
llvm-opt-fuzzer | ||
llvm-opt-report | ||
llvm-pdbutil | ||
llvm-profdata | ||
llvm-profgen | ||
llvm-rc | ||
llvm-readobj | ||
llvm-reduce | ||
llvm-rtdyld | ||
llvm-rust-demangle-fuzzer | ||
llvm-shlib | ||
llvm-size | ||
llvm-special-case-list-fuzzer | ||
llvm-split | ||
llvm-stress | ||
llvm-strings | ||
llvm-symbolizer | ||
llvm-undname | ||
llvm-xray | ||
llvm-yaml-numeric-parser-fuzzer | ||
llvm-yaml-parser-fuzzer | ||
lto | ||
msbuild | ||
obj2yaml | ||
opt | ||
opt-viewer | ||
remarks-shlib | ||
sancov | ||
sanstats | ||
split-file | ||
verify-uselistorder | ||
vfabi-demangle-fuzzer | ||
xcode-toolchain | ||
yaml2obj | ||
CMakeLists.txt |