mirror of
https://github.com/RPCS3/llvm-mirror.git
synced 2025-01-31 20:51:52 +01:00
17 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Roman Lebedev
|
db63a47135 |
[InstCombine] Take 3: Perform trivial PHI CSE
The original take 1 was 6102310d814ad73eab60a88b21dd70874f7a056f, which taught InstSimplify to do that, which seemed better at time, since we got EarlyCSE support for free. However, it was proven that we can not do that there, the simplified-to PHI would not be reachable from the original PHI, and that is not something InstSimplify is allowed to do, as noted in the commit ed90f15efb40d26b5d3ead3bb8e9e284218e0186 that reverted it: > It appears to cause compilation non-determinism and caused stage3 mismatches. Then there was take 2 3e69871ab5a66fb55913a2a2f5e7f5b42899a4c9, which was InstCombine-specific, but it again showed stage2-stage3 differences, and reverted in bdaa3f86a040b138c58de41d73d35b76fdec1380. This is quite alarming. Here, let's try to change how we find existing PHI candidate: due to the worklist order, and the way PHI nodes are inserted (it may be inserted as the first one, or maybe not), let's look at *all* PHI nodes in the block. Effects on vanilla llvm test-suite + RawSpeed: ``` | statistic name | baseline | proposed | Δ | % | \|%\| | |----------------------------------------------------|-----------|-----------|-------:|---------:|---------:| | asm-printer.EmittedInsts | 7942329 | 7942457 | 128 | 0.00% | 0.00% | | assembler.ObjectBytes | 254295632 | 254312480 | 16848 | 0.01% | 0.01% | | correlated-value-propagation.NumPhis | 18412 | 18347 | -65 | -0.35% | 0.35% | | early-cse.NumCSE | 2183283 | 2183267 | -16 | 0.00% | 0.00% | | early-cse.NumSimplify | 550105 | 541842 | -8263 | -1.50% | 1.50% | | instcombine.NumAggregateReconstructionsSimplified | 73 | 4506 | 4433 | 6072.60% | 6072.60% | | instcombine.NumCombined | 3640311 | 3644419 | 4108 | 0.11% | 0.11% | | instcombine.NumDeadInst | 1778204 | 1783205 | 5001 | 0.28% | 0.28% | | instcombine.NumPHICSEs | 0 | 22490 | 22490 | 0.00% | 0.00% | | instcombine.NumWorklistIterations | 2023272 | 2024400 | 1128 | 0.06% | 0.06% | | instcount.NumCallInst | 1758395 | 1758802 | 407 | 0.02% | 0.02% | | instcount.NumInvokeInst | 59478 | 59502 | 24 | 0.04% | 0.04% | | instcount.NumPHIInst | 330557 | 330545 | -12 | 0.00% | 0.00% | | instcount.TotalBlocks | 1077138 | 1077220 | 82 | 0.01% | 0.01% | | instcount.TotalFuncs | 101442 | 101441 | -1 | 0.00% | 0.00% | | instcount.TotalInsts | 8831946 | 8832606 | 660 | 0.01% | 0.01% | | simplifycfg.NumHoistCommonCode | 24186 | 24187 | 1 | 0.00% | 0.00% | | simplifycfg.NumInvokes | 4300 | 4410 | 110 | 2.56% | 2.56% | | simplifycfg.NumSimpl | 1019813 | 999767 | -20046 | -1.97% | 1.97% | ``` So it fires 22490 times, which is less than ~24k the take 1 did, but more than what take 2 did (22228 times) . It allows foldAggregateConstructionIntoAggregateReuse() to actually work after PHI-of-extractvalue folds did their thing. Previously SimplifyCFG would have done this PHI CSE, of all places. Additionally, allows some more `invoke`->`call` folds to happen (+110, +2.56%). All in all, expectedly, this catches less things overall, but all the motivational cases are still caught, so all good. |
||
Roman Lebedev
|
96e421f745 |
Revert "[InstCombine] Take 2: Perform trivial PHI CSE"
While the original variant with doing this in InstSimplify (rightfully) caused questions and ultimately was detected to be a culprit of stage2-stage3 mismatch, it was expected that InstCombine-based implementation would be fine. But apparently it's not, as http://lab.llvm.org:8011/builders/clang-with-thin-lto-ubuntu/builds/24095/steps/compare-compilers/logs/stdio suggests. Which suggests that somewhere in InstCombine there is a loop over nondeterministically sorted container, which causes different worklist ordering. This reverts commit 3e69871ab5a66fb55913a2a2f5e7f5b42899a4c9. |
||
Roman Lebedev
|
2bf651df9f |
[InstCombine] Take 2: Perform trivial PHI CSE
The original take was 6102310d814ad73eab60a88b21dd70874f7a056f, which taught InstSimplify to do that, which seemed better at time, since we got EarlyCSE support for free. However, it was proven that we can not do that there, the simplified-to PHI would not be reachable from the original PHI, and that is not something InstSimplify is allowed to do, as noted in the commit ed90f15efb40d26b5d3ead3bb8e9e284218e0186 that reverted it : > It appears to cause compilation non-determinism and caused stage3 mismatches. However InstCombine already does many different optimizations, so it should be a safe place to do it here. Note that we still can't just compare incoming values ranges, because there is no guarantee that these PHI's we'd simplify to were already re-visited and sorted. However coming up with a test is problematic. Effects on vanilla llvm test-suite + RawSpeed: ``` | statistic name | baseline | proposed | Δ | % | |%| | |----------------------------------------------------|-----------|-----------|-------:|---------:|---------:| | instcombine.NumPHICSEs | 0 | 22228 | 22228 | 0.00% | 0.00% | | asm-printer.EmittedInsts | 7942329 | 7942456 | 127 | 0.00% | 0.00% | | assembler.ObjectBytes | 254295632 | 254313792 | 18160 | 0.01% | 0.01% | | early-cse.NumCSE | 2183283 | 2183272 | -11 | 0.00% | 0.00% | | early-cse.NumSimplify | 550105 | 541842 | -8263 | -1.50% | 1.50% | | instcombine.NumAggregateReconstructionsSimplified | 73 | 4506 | 4433 | 6072.60% | 6072.60% | | instcombine.NumCombined | 3640311 | 3666911 | 26600 | 0.73% | 0.73% | | instcombine.NumDeadInst | 1778204 | 1783318 | 5114 | 0.29% | 0.29% | | instcount.NumCallInst | 1758395 | 1758804 | 409 | 0.02% | 0.02% | | instcount.NumInvokeInst | 59478 | 59502 | 24 | 0.04% | 0.04% | | instcount.NumPHIInst | 330557 | 330549 | -8 | 0.00% | 0.00% | | instcount.TotalBlocks | 1077138 | 1077221 | 83 | 0.01% | 0.01% | | instcount.TotalFuncs | 101442 | 101441 | -1 | 0.00% | 0.00% | | instcount.TotalInsts | 8831946 | 8832611 | 665 | 0.01% | 0.01% | | simplifycfg.NumInvokes | 4300 | 4410 | 110 | 2.56% | 2.56% | | simplifycfg.NumSimpl | 1019813 | 999740 | -20073 | -1.97% | 1.97% | ``` So it fires ~22k times, which is less than ~24k the take 1 did. It allows foldAggregateConstructionIntoAggregateReuse() to actually work after PHI-of-extractvalue folds did their thing. Previously SimplifyCFG would have done this PHI CSE, of all places. Additionally, allows some more `invoke`->`call` folds to happen (+110, +2.56%). All in all, expectedly, this catches less things overall, but all the motivational cases are still caught, so all good. |
||
Owen Anderson
|
df34423d50 |
Revert "[InstSimplify][EarlyCSE] Try to CSE PHI nodes in the same basic block"
This reverts commit 6102310d814ad73eab60a88b21dd70874f7a056f. It appears to cause compilation non-determinism and caused stage3 mismatches. |
||
Roman Lebedev
|
2088bfe3c4 |
[InstSimplify][EarlyCSE] Try to CSE PHI nodes in the same basic block
Apparently, we don't do this, neither in EarlyCSE, nor in InstSimplify, nor in (old) GVN, but do in NewGVN and SimplifyCFG of all places.. While i could teach EarlyCSE how to hash PHI nodes, we can't really do much (anything?) even if we find two identical PHI nodes in different basic blocks, same-BB case is the interesting one, and if we teach InstSimplify about it (which is what i wanted originally, https://reviews.llvm.org/D86530), we get EarlyCSE support for free. So i would think this is pretty uncontroversial. On vanilla llvm test-suite + RawSpeed, this has the following effects: ``` | statistic name | baseline | proposed | Δ | % | \|%\| | |----------------------------------------------------|-----------|-----------|-------:|---------:|---------:| | instsimplify.NumPHICSE | 0 | 23779 | 23779 | 0.00% | 0.00% | | asm-printer.EmittedInsts | 7942328 | 7942392 | 64 | 0.00% | 0.00% | | assembler.ObjectBytes | 273069192 | 273084704 | 15512 | 0.01% | 0.01% | | correlated-value-propagation.NumPhis | 18412 | 18539 | 127 | 0.69% | 0.69% | | early-cse.NumCSE | 2183283 | 2183227 | -56 | 0.00% | 0.00% | | early-cse.NumSimplify | 550105 | 542090 | -8015 | -1.46% | 1.46% | | instcombine.NumAggregateReconstructionsSimplified | 73 | 4506 | 4433 | 6072.60% | 6072.60% | | instcombine.NumCombined | 3640264 | 3664769 | 24505 | 0.67% | 0.67% | | instcombine.NumDeadInst | 1778193 | 1783183 | 4990 | 0.28% | 0.28% | | instcount.NumCallInst | 1758401 | 1758799 | 398 | 0.02% | 0.02% | | instcount.NumInvokeInst | 59478 | 59502 | 24 | 0.04% | 0.04% | | instcount.NumPHIInst | 330557 | 330533 | -24 | -0.01% | 0.01% | | instcount.TotalInsts | 8831952 | 8832286 | 334 | 0.00% | 0.00% | | simplifycfg.NumInvokes | 4300 | 4410 | 110 | 2.56% | 2.56% | | simplifycfg.NumSimpl | 1019808 | 999607 | -20201 | -1.98% | 1.98% | ``` I.e. it fires ~24k times, causes +110 (+2.56%) more `invoke` -> `call` transforms, and counter-intuitively results in *more* instructions total. That being said, the PHI count doesn't decrease that much, and looking at some examples, it seems at least some of them were previously getting PHI CSE'd in SimplifyCFG of all places.. I'm adjusting `Instruction::isIdenticalToWhenDefined()` at the same time. As a comment in `InstCombinerImpl::visitPHINode()` already stated, there are no guarantees on the ordering of the operands of a PHI node, so if we just naively compare them, we may false-negatively say that the nodes are not equal when the only difference is operand order, which is especially important since the fold is in InstSimplify, so we can't rely on InstCombine sorting them beforehand. Fixing this for the general case is costly (geomean +0.02%), and does not appear to catch anything in test-suite, but for the same-BB case, it's trivial, so let's fix at least that. As per http://llvm-compile-time-tracker.com/compare.php?from=04879086b44348cad600a0a1ccbe1f7776cc3cf9&to=82bdedb888b945df1e9f130dd3ac4dd3c96e2925&stat=instructions this appears to cause geomean +0.03% compile time increase (regression), but geomean -0.01%..-0.04% code size decrease (improvement). |
||
Roman Lebedev
|
b6a0e78067 |
[Value][InstCombine] Fix one-use checks in PHI-of-op -> Op-of-PHI[s] transforms to be one-user checks
As FIXME said, they really should be checking for a single user, not use, so let's do that. It is not *that* unusual to have the same value as incoming value in a PHI node, not unlike how a PHI may have the same incoming basic block more than once. There isn't a nice way to do that, Value::users() isn't uniqified, and Value only tracks it's uses, not Users, so the check is potentially costly since it does indeed potentially involes traversing the entire use list of a value. |
||
Roman Lebedev
|
d6cd909866 |
[InstCombine] PHI-of-extractvalues -> extractvalue-of-PHI, aka invokes are bad
While since D86306 we do it's sibling fold for `insertvalue`, we should also do this for `extractvalue`'s. And unlike that one, the results here are, quite honestly, shocking, as it can be observed here on vanilla llvm test-suite + RawSpeed results: ``` | statistic name | baseline | proposed | Δ | % | |%| | |----------------------------------------------------|-----------|-----------|--------:|--------:|-------:| | asm-printer.EmittedInsts | 7945095 | 7942507 | -2588 | -0.03% | 0.03% | | assembler.ObjectBytes | 273209920 | 273069800 | -140120 | -0.05% | 0.05% | | early-cse.NumCSE | 2183363 | 2183398 | 35 | 0.00% | 0.00% | | early-cse.NumSimplify | 541847 | 550017 | 8170 | 1.51% | 1.51% | | instcombine.NumAggregateReconstructionsSimplified | 2139 | 108 | -2031 | -94.95% | 94.95% | | instcombine.NumCombined | 3601364 | 3635448 | 34084 | 0.95% | 0.95% | | instcombine.NumConstProp | 27153 | 27157 | 4 | 0.01% | 0.01% | | instcombine.NumDeadInst | 1694521 | 1765022 | 70501 | 4.16% | 4.16% | | instcombine.NumPHIsOfExtractValues | 0 | 37546 | 37546 | 0.00% | 0.00% | | instcombine.NumSunkInst | 63158 | 63686 | 528 | 0.84% | 0.84% | | instcount.NumBrInst | 874304 | 871857 | -2447 | -0.28% | 0.28% | | instcount.NumCallInst | 1757657 | 1758402 | 745 | 0.04% | 0.04% | | instcount.NumExtractValueInst | 45623 | 11483 | -34140 | -74.83% | 74.83% | | instcount.NumInsertValueInst | 4983 | 580 | -4403 | -88.36% | 88.36% | | instcount.NumInvokeInst | 61018 | 59478 | -1540 | -2.52% | 2.52% | | instcount.NumLandingPadInst | 35334 | 34215 | -1119 | -3.17% | 3.17% | | instcount.NumPHIInst | 344428 | 331116 | -13312 | -3.86% | 3.86% | | instcount.NumRetInst | 100773 | 100772 | -1 | 0.00% | 0.00% | | instcount.TotalBlocks | 1081154 | 1077166 | -3988 | -0.37% | 0.37% | | instcount.TotalFuncs | 101443 | 101442 | -1 | 0.00% | 0.00% | | instcount.TotalInsts | 8890201 | 8833747 | -56454 | -0.64% | 0.64% | | instsimplify.NumSimplified | 75822 | 75707 | -115 | -0.15% | 0.15% | | simplifycfg.NumHoistCommonCode | 24203 | 24197 | -6 | -0.02% | 0.02% | | simplifycfg.NumHoistCommonInstrs | 48201 | 48195 | -6 | -0.01% | 0.01% | | simplifycfg.NumInvokes | 2785 | 4298 | 1513 | 54.33% | 54.33% | | simplifycfg.NumSimpl | 997332 | 1018189 | 20857 | 2.09% | 2.09% | | simplifycfg.NumSinkCommonCode | 7088 | 6464 | -624 | -8.80% | 8.80% | | simplifycfg.NumSinkCommonInstrs | 15117 | 14021 | -1096 | -7.25% | 7.25% | ``` ... which tells us that this new fold fires whopping 38k times, increasing the amount of SimplifyCFG's `invoke`->`call` transforms by +54% (+1513) (again, D85787 did that last time), decreasing total instruction count by -0.64% (-56454), and sharply decreasing count of `insertvalue`'s (-88.36%, i.e. 9 times less) and `extractvalue`'s (-74.83%, i.e. four times less). This causes geomean -0.01% binary size decrease http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=size-text and, ignoring `O0-g`, is a geomean -0.01%..-0.05% compile-time improvement http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=instructions The other thing that tells is, is that while this is a massive win for `invoke`->`call` transform `InstCombinerImpl::foldAggregateConstructionIntoAggregateReuse()` fold, which is supposed to be dealing with such aggregate reconstructions, fires a lot less now. There are two reasons why: 1. After this fold, as it can be seen in tests, we may (will) end up with trivially redundant PHI nodes. We don't CSE them in InstCombine presently, which means that EarlyCSE needs to run and then InstCombine rerun. 2. But then, EarlyCSE not only manages to fold such redundant PHI's, it also sees that the extract-insert chain recreates the original aggregate, and replaces it with the original aggregate. The take-aways are 1. We maybe should do most trivial, same-BB PHI CSE in InstCombine 2. I need to check if what other patterns remain, and how they can be resolved. (i.e. i wonder if `foldAggregateConstructionIntoAggregateReuse()` might go away) This is a reland of the original commit fcb51d8c2460faa23b71e06abb7e826243887dd6, because originally i forgot to ensure that the base aggregate types match. Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D86530 |
||
Roman Lebedev
|
a2f15b5ff5 |
Revert "[InstCombine] PHI-of-extractvalues -> extractvalue-of-PHI, aka invokes are bad"
This reverts commit fcb51d8c2460faa23b71e06abb7e826243887dd6. As buildbots report, there's apparently some missing check to ensure that the types of incoming values match the type of PHI. Let's revert for a moment. |
||
Roman Lebedev
|
5ec7b497bf |
[InstCombine] PHI-of-extractvalues -> extractvalue-of-PHI, aka invokes are bad
While since D86306 we do it's sibling fold for `insertvalue`, we should also do this for `extractvalue`'s. And unlike that one, the results here are, quite honestly, shocking, as it can be observed here on vanilla llvm test-suite + RawSpeed results: ``` | statistic name | baseline | proposed | Δ | % | |%| | |----------------------------------------------------|-----------|-----------|--------:|--------:|-------:| | asm-printer.EmittedInsts | 7945095 | 7942507 | -2588 | -0.03% | 0.03% | | assembler.ObjectBytes | 273209920 | 273069800 | -140120 | -0.05% | 0.05% | | early-cse.NumCSE | 2183363 | 2183398 | 35 | 0.00% | 0.00% | | early-cse.NumSimplify | 541847 | 550017 | 8170 | 1.51% | 1.51% | | instcombine.NumAggregateReconstructionsSimplified | 2139 | 108 | -2031 | -94.95% | 94.95% | | instcombine.NumCombined | 3601364 | 3635448 | 34084 | 0.95% | 0.95% | | instcombine.NumConstProp | 27153 | 27157 | 4 | 0.01% | 0.01% | | instcombine.NumDeadInst | 1694521 | 1765022 | 70501 | 4.16% | 4.16% | | instcombine.NumPHIsOfExtractValues | 0 | 37546 | 37546 | 0.00% | 0.00% | | instcombine.NumSunkInst | 63158 | 63686 | 528 | 0.84% | 0.84% | | instcount.NumBrInst | 874304 | 871857 | -2447 | -0.28% | 0.28% | | instcount.NumCallInst | 1757657 | 1758402 | 745 | 0.04% | 0.04% | | instcount.NumExtractValueInst | 45623 | 11483 | -34140 | -74.83% | 74.83% | | instcount.NumInsertValueInst | 4983 | 580 | -4403 | -88.36% | 88.36% | | instcount.NumInvokeInst | 61018 | 59478 | -1540 | -2.52% | 2.52% | | instcount.NumLandingPadInst | 35334 | 34215 | -1119 | -3.17% | 3.17% | | instcount.NumPHIInst | 344428 | 331116 | -13312 | -3.86% | 3.86% | | instcount.NumRetInst | 100773 | 100772 | -1 | 0.00% | 0.00% | | instcount.TotalBlocks | 1081154 | 1077166 | -3988 | -0.37% | 0.37% | | instcount.TotalFuncs | 101443 | 101442 | -1 | 0.00% | 0.00% | | instcount.TotalInsts | 8890201 | 8833747 | -56454 | -0.64% | 0.64% | | instsimplify.NumSimplified | 75822 | 75707 | -115 | -0.15% | 0.15% | | simplifycfg.NumHoistCommonCode | 24203 | 24197 | -6 | -0.02% | 0.02% | | simplifycfg.NumHoistCommonInstrs | 48201 | 48195 | -6 | -0.01% | 0.01% | | simplifycfg.NumInvokes | 2785 | 4298 | 1513 | 54.33% | 54.33% | | simplifycfg.NumSimpl | 997332 | 1018189 | 20857 | 2.09% | 2.09% | | simplifycfg.NumSinkCommonCode | 7088 | 6464 | -624 | -8.80% | 8.80% | | simplifycfg.NumSinkCommonInstrs | 15117 | 14021 | -1096 | -7.25% | 7.25% | ``` ... which tells us that this new fold fires whopping 38k times, increasing the amount of SimplifyCFG's `invoke`->`call` transforms by +54% (+1513) (again, D85787 did that last time), decreasing total instruction count by -0.64% (-56454), and sharply decreasing count of `insertvalue`'s (-88.36%, i.e. 9 times less) and `extractvalue`'s (-74.83%, i.e. four times less). This causes geomean -0.01% binary size decrease http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=size-text and, ignoring `O0-g`, is a geomean -0.01%..-0.05% compile-time improvement http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=instructions The other thing that tells is, is that while this is a massive win for `invoke`->`call` transform `InstCombinerImpl::foldAggregateConstructionIntoAggregateReuse()` fold, which is supposed to be dealing with such aggregate reconstructions, fires a lot less now. There are two reasons why: 1. After this fold, as it can be seen in tests, we may (will) end up with trivially redundant PHI nodes. We don't CSE them in InstCombine presently, which means that EarlyCSE needs to run and then InstCombine rerun. 2. But then, EarlyCSE not only manages to fold such redundant PHI's, it also sees that the extract-insert chain recreates the original aggregate, and replaces it with the original aggregate. The take-aways are 1. We maybe should do most trivial, same-BB PHI CSE in InstCombine 2. I need to check if what other patterns remain, and how they can be resolved. (i.e. i wonder if `foldAggregateConstructionIntoAggregateReuse()` might go away) Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D86530 |
||
Roman Lebedev
|
ed8ecc651f |
[InstCombine] PHI-of-insertvalues -> insertvalue-of-PHI's
As per statistic, this happens pretty exceedingly rare, but i have seen it in exactly the situations the Phi-aware aggregate reconstruction would have handled, eventually, and allowed invoke -> call fold later on. So while this might be something that other fold will have to learn about, i believe we should be doing this transform in general. Here, we are okay with adding two PHI's to get both the base aggregate, and the inserted value. I'm not sure it makes much sense to restrict it to a single phi (to just the inserted value?), because originally we'd be receiving the final aggregate already.. llvm test-suite + RawSpeed: ``` | statistic name | baseline | proposed | Δ | % | \|%\| | |--------------------------------------------|-----------|-----------|-----:|-------:|------:| | instcombine.NumPHIsOfInsertValues | 0 | 12 | 12 | 0.00% | 0.00% | | asm-printer.EmittedInsts | 8926643 | 8926595 | -48 | 0.00% | 0.00% | | instcombine.NumCombined | 3846614 | 3846640 | 26 | 0.00% | 0.00% | | instcombine.NumConstProp | 24302 | 24293 | -9 | -0.04% | 0.04% | | instcombine.NumDeadInst | 1620140 | 1620112 | -28 | 0.00% | 0.00% | | instcount.NumBrInst | 898466 | 898464 | -2 | 0.00% | 0.00% | | instcount.NumCallInst | 1760819 | 1760875 | 56 | 0.00% | 0.00% | | instcount.NumExtractValueInst | 45659 | 45649 | -10 | -0.02% | 0.02% | | instcount.NumInsertValueInst | 4991 | 4981 | -10 | -0.20% | 0.20% | | instcount.NumIntToPtrInst | 27084 | 27087 | 3 | 0.01% | 0.01% | | instcount.NumPHIInst | 371435 | 371429 | -6 | 0.00% | 0.00% | | instcount.NumStoreInst | 906011 | 906019 | 8 | 0.00% | 0.00% | | instcount.TotalBlocks | 1105520 | 1105518 | -2 | 0.00% | 0.00% | | instcount.TotalInsts | 9795737 | 9795776 | 39 | 0.00% | 0.00% | | simplifycfg.NumInvokes | 2784 | 2786 | 2 | 0.07% | 0.07% | | simplifycfg.NumSimpl | 1001840 | 1001850 | 10 | 0.00% | 0.00% | | simplifycfg.NumSinkCommonInstrs | 15174 | 15170 | -4 | -0.03% | 0.03% | ``` Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D86306 |
||
Roman Lebedev
|
c0a69dfec4 |
[NFC][InstCombine] Tests for PHI-of-insertvalue's
Currently we don't do anything about these, neither in InstCombine, nor in SimplifyCFG's sinking. These happen exceedingly rarely, but i've seen them in the cases where PHI-aware aggregate reconstruction would have fired if not for them. |
||
Roman Lebedev
|
2083389218 |
[InstCombine] PHI-aware aggregate reconstruction: properly handle duplicate predecessors
While it may seem like we can just "deduplicate" the case where some basic block happens to be a predecessor more than once, which happens for e.g. switches, that is not correct thing to do. We must actually add a PHI operand for each predecessor. This was initially reported to me by David Major as a clang crash during gecko build for android. |
||
Roman Lebedev
|
c576a73397 |
[InstCombine] PHI-aware aggregate reconstruction: correctly detect "use" basic block
While the original implementation added in D85787 / ae7f08812e0995481eb345cecc5dd4529829ba44 is not incorrect, it is known to be suboptimal. In particular, it is not incorrect to use the basic block in which the original `insertvalue` instruction is located as the merge point, that is not necessarily optimal, as `@test6` shows. We should look at all the AggElts, and, if they are all defined in the same basic block, then that is the basic block we should use. On RawSpeed library, this catches +4% (+50) more cases. On vanilla LLVM test-suits, this catches +12% (+92) more cases. |
||
Roman Lebedev
|
39d81cb3fd |
[NFC][InstCombine] PHI-aware aggregate reconstruction: insert PHI node manually
This is NFC at the moment, because right now we always insert the PHI into the same basic block in which the original `insertvalue` instruction is, but that will change. Also, fixes addition of the suffix to the value names. |
||
Roman Lebedev
|
a0a05b7ce0 |
[NFC][InstCombine] Add more tests for aggregate reconstruction w/ PHI handling
Even without handling several layers of PHI nodes, we can handle more cases, as `@test6` shows. |
||
Roman Lebedev
|
e44f2bc0d5 |
[InstCombine] Aggregate reconstruction simplification (PR47060)
This pattern happens in clang C++ exception lowering code, on unwind branch. We end up having a `landingpad` block after each `invoke`, where RAII cleanup is performed, and the elements of an aggregate `{i8*, i32}` holding exception info are `extractvalue`'d, and we then branch to common block that takes extracted `i8*` and `i32` elements (via `phi` nodes), form a new aggregate, and finally `resume`'s the exception. The problem is that, if the cleanup block is effectively empty, it shouldn't be there, there shouldn't be that `landingpad` and `resume`, said `invoke` should be a `call`. Indeed, we do that simplification in e.g. SimplifyCFG `SimplifyCFGOpt::simplifyResume()`. But the thing is, all this extra `extractvalue` + `phi` + `insertvalue` cruft, while it is pointless, does not look like "empty cleanup block". So the `SimplifyCFGOpt::simplifyResume()` fails, and the exception is has higher cost than it could have on unwind branch :S This doesn't happen *that* often, but it will basically happen once per C++ function with complex CFG that called more than one other function that isn't known to be `nounwind`. I think, this is a missing fold in InstCombine, so i've implemented it. I think, the algorithm/implementation is rather self-explanatory: 1. Find a chain of `insertvalue`'s that fully tell us the initializer of the aggregate. 2. For each element, try to find from which aggregate it was extracted. If it was extracted from the aggregate with identical type, from identical element index, great. 3. If all elements were found to have been extracted from the same aggregate, then we can just use said original source aggregate directly, instead of re-creating it. 4. If we fail to find said aggregate when looking only in the current block, we need be PHI-aware - we might have different source aggregate when coming from each predecessor. I'm not sure if this already handles everything, and there are some FIXME's, i'll deal with all that later in followups. I'd be fine with going with post-commit review here code-wise, but just in case there are thoughts, i'm posting this. On RawSpeed, for example, this has the following effect: ``` | statistic name | baseline | proposed | Δ | % | abs(%) | |---------------------------------------------------|---------:|---------:|------:|--------:|-------:| | instcombine.NumAggregateReconstructionsSimplified | 0 | 1253 | 1253 | 0.00% | 0.00% | | simplifycfg.NumInvokes | 948 | 1355 | 407 | 42.93% | 42.93% | | instcount.NumInsertValueInst | 4382 | 3210 | -1172 | -26.75% | 26.75% | | simplifycfg.NumSinkCommonCode | 574 | 458 | -116 | -20.21% | 20.21% | | simplifycfg.NumSinkCommonInstrs | 1154 | 921 | -233 | -20.19% | 20.19% | | instcount.NumExtractValueInst | 29017 | 26397 | -2620 | -9.03% | 9.03% | | instcombine.NumDeadInst | 166618 | 174705 | 8087 | 4.85% | 4.85% | | instcount.NumPHIInst | 51526 | 50678 | -848 | -1.65% | 1.65% | | instcount.NumLandingPadInst | 20865 | 20609 | -256 | -1.23% | 1.23% | | instcount.NumInvokeInst | 34023 | 33675 | -348 | -1.02% | 1.02% | | simplifycfg.NumSimpl | 113634 | 114708 | 1074 | 0.95% | 0.95% | | instcombine.NumSunkInst | 15030 | 14930 | -100 | -0.67% | 0.67% | | instcount.TotalBlocks | 219544 | 219024 | -520 | -0.24% | 0.24% | | instcombine.NumCombined | 644562 | 645805 | 1243 | 0.19% | 0.19% | | instcount.TotalInsts | 2139506 | 2135377 | -4129 | -0.19% | 0.19% | | instcount.NumBrInst | 156988 | 156821 | -167 | -0.11% | 0.11% | | instcount.NumCallInst | 1206144 | 1207076 | 932 | 0.08% | 0.08% | | instcount.NumResumeInst | 5193 | 5190 | -3 | -0.06% | 0.06% | | asm-printer.EmittedInsts | 948580 | 948299 | -281 | -0.03% | 0.03% | | instcount.TotalFuncs | 11509 | 11507 | -2 | -0.02% | 0.02% | | inline.NumDeleted | 97595 | 97597 | 2 | 0.00% | 0.00% | | inline.NumInlined | 210514 | 210522 | 8 | 0.00% | 0.00% | ``` So we manage to increase the amount of `invoke` -> `call` conversions in SimplifyCFG by almost a half, and there is a very apparent decrease in instruction and basic block count. On vanilla llvm-test-suite: ``` | statistic name | baseline | proposed | Δ | % | abs(%) | |---------------------------------------------------|---------:|---------:|------:|--------:|-------:| | instcombine.NumAggregateReconstructionsSimplified | 0 | 744 | 744 | 0.00% | 0.00% | | instcount.NumInsertValueInst | 2705 | 2053 | -652 | -24.10% | 24.10% | | simplifycfg.NumInvokes | 1212 | 1424 | 212 | 17.49% | 17.49% | | instcount.NumExtractValueInst | 21681 | 20139 | -1542 | -7.11% | 7.11% | | simplifycfg.NumSinkCommonInstrs | 14575 | 14361 | -214 | -1.47% | 1.47% | | simplifycfg.NumSinkCommonCode | 6815 | 6743 | -72 | -1.06% | 1.06% | | instcount.NumLandingPadInst | 14851 | 14712 | -139 | -0.94% | 0.94% | | instcount.NumInvokeInst | 27510 | 27332 | -178 | -0.65% | 0.65% | | instcombine.NumDeadInst | 1438173 | 1443371 | 5198 | 0.36% | 0.36% | | instcount.NumResumeInst | 2880 | 2872 | -8 | -0.28% | 0.28% | | instcombine.NumSunkInst | 55187 | 55076 | -111 | -0.20% | 0.20% | | instcount.NumPHIInst | 321366 | 320916 | -450 | -0.14% | 0.14% | | instcount.TotalBlocks | 886816 | 886493 | -323 | -0.04% | 0.04% | | instcount.TotalInsts | 7663845 | 7661108 | -2737 | -0.04% | 0.04% | | simplifycfg.NumSimpl | 886791 | 887171 | 380 | 0.04% | 0.04% | | instcount.NumCallInst | 553552 | 553733 | 181 | 0.03% | 0.03% | | instcombine.NumCombined | 3200512 | 3201202 | 690 | 0.02% | 0.02% | | instcount.NumBrInst | 741794 | 741656 | -138 | -0.02% | 0.02% | | simplifycfg.NumHoistCommonInstrs | 14443 | 14445 | 2 | 0.01% | 0.01% | | asm-printer.EmittedInsts | 7978085 | 7977916 | -169 | 0.00% | 0.00% | | inline.NumDeleted | 73188 | 73189 | 1 | 0.00% | 0.00% | | inline.NumInlined | 291959 | 291968 | 9 | 0.00% | 0.00% | ``` Roughly similar effect, less instructions and blocks total. See also: rGe492f0e03b01a5e4ec4b6333abb02d303c3e479e. Compile-time wise, this appears to be roughly geomean-neutral: http://llvm-compile-time-tracker.com/compare.php?from=39617aaed95ac00957979bc1525598c1be80e85e&to=b59866cf30420da8f8e3ca239ed3bec577b23387&stat=instructions And this is a win size-wize in general: http://llvm-compile-time-tracker.com/compare.php?from=39617aaed95ac00957979bc1525598c1be80e85e&to=b59866cf30420da8f8e3ca239ed3bec577b23387&stat=size-text See https://bugs.llvm.org/show_bug.cgi?id=47060 Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D85787 |
||
Roman Lebedev
|
415270824a |
[NFC][InstCombine] Add tests for PHI merging/aggregate reconstruction (PR47060)
We should be able to see that the new aggregate we have produced is identical to the source aggregate from which we've extracted the elements that we used to form a new aggregate. This happens (a lot) in clang C++ exception code on unwind branch. |