llvm-mirror/test/Transforms/Inline/inline-hot-callsite.ll

; This tests that a hot callsite gets the (higher) inlinehint-threshold even without
; without inline hints and gets inlined because the cost is less than
; inlinehint-threshold. A cold callee with identical body does not get inlined because
; cost exceeds the inline-threshold

; RUN: opt < %s -inline -inline-threshold=0 -hot-callsite-threshold=100 -S | FileCheck %s
; RUN: opt < %s -passes='require<profile-summary>,cgscc(inline)' -inline-threshold=0 -hot-callsite-threshold=100 -S | FileCheck %s

; Run this with the default O2 pipeline to test that profile summary analysis
; is available during inlining.
; RUN: opt < %s -passes='default<O2>' -inline-threshold=0 -hot-callsite-threshold=100 -S | FileCheck %s

define i32 @callee1(i32 %x) {
  %x1 = add i32 %x, 1
  %x2 = add i32 %x1, 1
  %x3 = add i32 %x2, 1
  call void @extern()
  call void @extern()
  ret i32 %x3
}

define i32 @callee2(i32 %x) {
; CHECK-LABEL: @callee2(
  %x1 = add i32 %x, 1
  %x2 = add i32 %x1, 1
  %x3 = add i32 %x2, 1
  call void @extern()
  call void @extern()
  ret i32 %x3
}

define i32 @caller2(i32 %y1) {
; CHECK-LABEL: @caller2(
; CHECK: call i32 @callee2
; CHECK-NOT: call i32 @callee1
; CHECK: ret i32 %x3.i
  %y2 = call i32 @callee2(i32 %y1), !prof !22
  %y3 = call i32 @callee1(i32 %y2), !prof !21
  ret i32 %y3
}

declare void @extern()

!llvm.module.flags = !{!1}
!21 = !{!"branch_weights", i64 300}
!22 = !{!"branch_weights", i64 1}

!1 = !{i32 1, !"ProfileSummary", !2}
!2 = !{!3, !4, !5, !6, !7, !8, !9, !10}
!3 = !{!"ProfileFormat", !"SampleProfile"}
!4 = !{!"TotalCount", i64 10000}
!5 = !{!"MaxCount", i64 1000}
!6 = !{!"MaxInternalCount", i64 1}
!7 = !{!"MaxFunctionCount", i64 1000}
!8 = !{!"NumCounts", i64 3}
!9 = !{!"NumFunctions", i64 3}
!10 = !{!"DetailedSummary", !11}
!11 = !{!12, !13, !14}
!12 = !{i32 10000, i64 100, i32 1}
!13 = !{i32 999000, i64 100, i32 1}
!14 = !{i32 999999, i64 1, i32 2}
Implement callsite-hotness based inline cost for Sample-based PGO Summary: For sample-based PGO, using BFI to calculate callsite count is sometime not accurate. This is because with sampling based approach, if a callsite resides in a hot loop deeply nested in a bunch of cold branches, the callsite's BFI frequency would be inaccurately calculated due to lack of samples in the cold branch. E.g. if (A1 && A2 && A3 && ..... && A10) { for (i=0; i < 100000000; i++) { callsite(); } } Assume that A1 to A100 are all 100% taken, and callsite has 1000 samples and thus is considerred hot. Because the loop's trip count is huge, it's normal that all branches outside the loop has no sample at all. As a result, we can only use static branch probability to derive the the frequency of the loop header. Assuming that static heuristic thinks each branch is 50% taken, then the count calculated from BFI will be 1/(2^10) of the actual value. In order to get more accurate callsite count, we directly annotate the weight on the call instruction, and directly use it when checking callsite hotness. Note that this mechanism can also be shared by instrumentation based callsite hotness analysis. The side benefit is that it breaks the dependency from Inliner to BFI as call count is embedded in the IR. Reviewers: davidxl, eraman, dnovillo Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D22118 llvm-svn: 275073 2016-07-11 16:48:54 +00:00			`; This tests that a hot callsite gets the (higher) inlinehint-threshold even without`
			`; without inline hints and gets inlined because the cost is less than`
			`; inlinehint-threshold. A cold callee with identical body does not get inlined because`
			`; cost exceeds the inline-threshold`

[PM] Add ProfileSummaryAnalysis as a required pass in the new pipeline. Differential revision: https://reviews.llvm.org/D32768 llvm-svn: 302170 2017-05-04 16:58:45 +00:00			`; RUN: opt < %s -inline -inline-threshold=0 -hot-callsite-threshold=100 -S \| FileCheck %s`
			`; RUN: opt < %s -passes='require<profile-summary>,cgscc(inline)' -inline-threshold=0 -hot-callsite-threshold=100 -S \| FileCheck %s`

			`; Run this with the default O2 pipeline to test that profile summary analysis`
			`; is available during inlining.`
			`; RUN: opt < %s -passes='default<O2>' -inline-threshold=0 -hot-callsite-threshold=100 -S \| FileCheck %s`

Implement callsite-hotness based inline cost for Sample-based PGO Summary: For sample-based PGO, using BFI to calculate callsite count is sometime not accurate. This is because with sampling based approach, if a callsite resides in a hot loop deeply nested in a bunch of cold branches, the callsite's BFI frequency would be inaccurately calculated due to lack of samples in the cold branch. E.g. if (A1 && A2 && A3 && ..... && A10) { for (i=0; i < 100000000; i++) { callsite(); } } Assume that A1 to A100 are all 100% taken, and callsite has 1000 samples and thus is considerred hot. Because the loop's trip count is huge, it's normal that all branches outside the loop has no sample at all. As a result, we can only use static branch probability to derive the the frequency of the loop header. Assuming that static heuristic thinks each branch is 50% taken, then the count calculated from BFI will be 1/(2^10) of the actual value. In order to get more accurate callsite count, we directly annotate the weight on the call instruction, and directly use it when checking callsite hotness. Note that this mechanism can also be shared by instrumentation based callsite hotness analysis. The side benefit is that it breaks the dependency from Inliner to BFI as call count is embedded in the IR. Reviewers: davidxl, eraman, dnovillo Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D22118 llvm-svn: 275073 2016-07-11 16:48:54 +00:00			`define i32 @callee1(i32 %x) {`
			`%x1 = add i32 %x, 1`
			`%x2 = add i32 %x1, 1`
			`%x3 = add i32 %x2, 1`
[InlineCost] Remove skew when calculating call costs When calculating the cost of a call instruction we were applying a heuristic penalty as well as the cost of the instruction itself. However, when calculating the benefit from inlining we weren't discounting the equivalent penalty for the call instruction that would be removed! This caused skew in the calculation and meant we wouldn't inline in the following, trivial case: int g() { h(); } int f() { g(); } llvm-svn: 286814 2016-11-14 11:14:41 +00:00			`call void @extern()`
[PM] Add ProfileSummaryAnalysis as a required pass in the new pipeline. Differential revision: https://reviews.llvm.org/D32768 llvm-svn: 302170 2017-05-04 16:58:45 +00:00			`call void @extern()`
Implement callsite-hotness based inline cost for Sample-based PGO Summary: For sample-based PGO, using BFI to calculate callsite count is sometime not accurate. This is because with sampling based approach, if a callsite resides in a hot loop deeply nested in a bunch of cold branches, the callsite's BFI frequency would be inaccurately calculated due to lack of samples in the cold branch. E.g. if (A1 && A2 && A3 && ..... && A10) { for (i=0; i < 100000000; i++) { callsite(); } } Assume that A1 to A100 are all 100% taken, and callsite has 1000 samples and thus is considerred hot. Because the loop's trip count is huge, it's normal that all branches outside the loop has no sample at all. As a result, we can only use static branch probability to derive the the frequency of the loop header. Assuming that static heuristic thinks each branch is 50% taken, then the count calculated from BFI will be 1/(2^10) of the actual value. In order to get more accurate callsite count, we directly annotate the weight on the call instruction, and directly use it when checking callsite hotness. Note that this mechanism can also be shared by instrumentation based callsite hotness analysis. The side benefit is that it breaks the dependency from Inliner to BFI as call count is embedded in the IR. Reviewers: davidxl, eraman, dnovillo Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D22118 llvm-svn: 275073 2016-07-11 16:48:54 +00:00			`ret i32 %x3`
			`}`

			`define i32 @callee2(i32 %x) {`
			`; CHECK-LABEL: @callee2(`
			`%x1 = add i32 %x, 1`
			`%x2 = add i32 %x1, 1`
			`%x3 = add i32 %x2, 1`
[InlineCost] Remove skew when calculating call costs When calculating the cost of a call instruction we were applying a heuristic penalty as well as the cost of the instruction itself. However, when calculating the benefit from inlining we weren't discounting the equivalent penalty for the call instruction that would be removed! This caused skew in the calculation and meant we wouldn't inline in the following, trivial case: int g() { h(); } int f() { g(); } llvm-svn: 286814 2016-11-14 11:14:41 +00:00			`call void @extern()`
[PM] Add ProfileSummaryAnalysis as a required pass in the new pipeline. Differential revision: https://reviews.llvm.org/D32768 llvm-svn: 302170 2017-05-04 16:58:45 +00:00			`call void @extern()`
Implement callsite-hotness based inline cost for Sample-based PGO Summary: For sample-based PGO, using BFI to calculate callsite count is sometime not accurate. This is because with sampling based approach, if a callsite resides in a hot loop deeply nested in a bunch of cold branches, the callsite's BFI frequency would be inaccurately calculated due to lack of samples in the cold branch. E.g. if (A1 && A2 && A3 && ..... && A10) { for (i=0; i < 100000000; i++) { callsite(); } } Assume that A1 to A100 are all 100% taken, and callsite has 1000 samples and thus is considerred hot. Because the loop's trip count is huge, it's normal that all branches outside the loop has no sample at all. As a result, we can only use static branch probability to derive the the frequency of the loop header. Assuming that static heuristic thinks each branch is 50% taken, then the count calculated from BFI will be 1/(2^10) of the actual value. In order to get more accurate callsite count, we directly annotate the weight on the call instruction, and directly use it when checking callsite hotness. Note that this mechanism can also be shared by instrumentation based callsite hotness analysis. The side benefit is that it breaks the dependency from Inliner to BFI as call count is embedded in the IR. Reviewers: davidxl, eraman, dnovillo Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D22118 llvm-svn: 275073 2016-07-11 16:48:54 +00:00			`ret i32 %x3`
			`}`

			`define i32 @caller2(i32 %y1) {`
			`; CHECK-LABEL: @caller2(`
			`; CHECK: call i32 @callee2`
			`; CHECK-NOT: call i32 @callee1`
			`; CHECK: ret i32 %x3.i`
			`%y2 = call i32 @callee2(i32 %y1), !prof !22`
			`%y3 = call i32 @callee1(i32 %y2), !prof !21`
			`ret i32 %y3`
			`}`

[InlineCost] Remove skew when calculating call costs When calculating the cost of a call instruction we were applying a heuristic penalty as well as the cost of the instruction itself. However, when calculating the benefit from inlining we weren't discounting the equivalent penalty for the call instruction that would be removed! This caused skew in the calculation and meant we wouldn't inline in the following, trivial case: int g() { h(); } int f() { g(); } llvm-svn: 286814 2016-11-14 11:14:41 +00:00			`declare void @extern()`

Implement callsite-hotness based inline cost for Sample-based PGO Summary: For sample-based PGO, using BFI to calculate callsite count is sometime not accurate. This is because with sampling based approach, if a callsite resides in a hot loop deeply nested in a bunch of cold branches, the callsite's BFI frequency would be inaccurately calculated due to lack of samples in the cold branch. E.g. if (A1 && A2 && A3 && ..... && A10) { for (i=0; i < 100000000; i++) { callsite(); } } Assume that A1 to A100 are all 100% taken, and callsite has 1000 samples and thus is considerred hot. Because the loop's trip count is huge, it's normal that all branches outside the loop has no sample at all. As a result, we can only use static branch probability to derive the the frequency of the loop header. Assuming that static heuristic thinks each branch is 50% taken, then the count calculated from BFI will be 1/(2^10) of the actual value. In order to get more accurate callsite count, we directly annotate the weight on the call instruction, and directly use it when checking callsite hotness. Note that this mechanism can also be shared by instrumentation based callsite hotness analysis. The side benefit is that it breaks the dependency from Inliner to BFI as call count is embedded in the IR. Reviewers: davidxl, eraman, dnovillo Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D22118 llvm-svn: 275073 2016-07-11 16:48:54 +00:00			`!llvm.module.flags = !{!1}`
			`!21 = !{!"branch_weights", i64 300}`
			`!22 = !{!"branch_weights", i64 1}`

			`!1 = !{i32 1, !"ProfileSummary", !2}`
			`!2 = !{!3, !4, !5, !6, !7, !8, !9, !10}`
Improve PGO support for the new inliner This adds the following to the new PM based inliner in PGO mode: * Use block frequency analysis to derive callsite's profile count and use that to adjust thresholds of hot and cold callsites. * Incrementally update the BFI of the caller after a callee gets inlined into it. This incremental update is only within an invocation of the run method - BFI is not preserved across calls to run. Update the function entry count of the callee after inlining it into a caller. * I've tuned the thresholds for the hot and cold callsites using a hacked up version of the old inliner that explicitly computes BFI on a set of internal benchmarks and spec. Once the new PM based pipeline stabilizes (IIRC Chandler mentioned there are known issues) I'll benchmark this again and adjust the thresholds if required. Inliner PGO support. Differential revision: https://reviews.llvm.org/D28331 llvm-svn: 292666 2017-01-20 22:44:04 +00:00			`!3 = !{!"ProfileFormat", !"SampleProfile"}`
Implement callsite-hotness based inline cost for Sample-based PGO Summary: For sample-based PGO, using BFI to calculate callsite count is sometime not accurate. This is because with sampling based approach, if a callsite resides in a hot loop deeply nested in a bunch of cold branches, the callsite's BFI frequency would be inaccurately calculated due to lack of samples in the cold branch. E.g. if (A1 && A2 && A3 && ..... && A10) { for (i=0; i < 100000000; i++) { callsite(); } } Assume that A1 to A100 are all 100% taken, and callsite has 1000 samples and thus is considerred hot. Because the loop's trip count is huge, it's normal that all branches outside the loop has no sample at all. As a result, we can only use static branch probability to derive the the frequency of the loop header. Assuming that static heuristic thinks each branch is 50% taken, then the count calculated from BFI will be 1/(2^10) of the actual value. In order to get more accurate callsite count, we directly annotate the weight on the call instruction, and directly use it when checking callsite hotness. Note that this mechanism can also be shared by instrumentation based callsite hotness analysis. The side benefit is that it breaks the dependency from Inliner to BFI as call count is embedded in the IR. Reviewers: davidxl, eraman, dnovillo Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D22118 llvm-svn: 275073 2016-07-11 16:48:54 +00:00			`!4 = !{!"TotalCount", i64 10000}`
			`!5 = !{!"MaxCount", i64 1000}`
			`!6 = !{!"MaxInternalCount", i64 1}`
			`!7 = !{!"MaxFunctionCount", i64 1000}`
			`!8 = !{!"NumCounts", i64 3}`
			`!9 = !{!"NumFunctions", i64 3}`
			`!10 = !{!"DetailedSummary", !11}`
			`!11 = !{!12, !13, !14}`
			`!12 = !{i32 10000, i64 100, i32 1}`
			`!13 = !{i32 999000, i64 100, i32 1}`
			`!14 = !{i32 999999, i64 1, i32 2}`