block. The primary motivation for doing this is that we can now unroll nested loops.
This makes a pretty big difference in some cases. For example, in 183.equake,
we are now beating the native compiler with the CBE, and we are a lot closer
with LLC.
I'm now going to play around a bit with the unroll factor and see what effect
it really has.
llvm-svn: 13034
limited. Even in it's extremely simple state (it can only *fully* unroll single
basic block loops that execute a constant number of times), it already helps improve
performance a LOT on some benchmarks, particularly with the native code generators.
llvm-svn: 13028
operations. This allows us to compile this testcase:
int main() {
int h = 1;
do h = 3 * h + 1; while (h <= 256);
printf("%d\n", h);
return 0;
}
into this:
int %main() {
entry:
call void %__main( )
%tmp.6 = call int (sbyte*, ...)* %printf( sbyte* getelementptr ([4 x sbyte]* %.str_1, long 0, long 0), int 364 ) ; <int> [#uses=0]
ret int 0
}
This testcase was taken directly from 256.bzip2, believe it or not.
This code is not as general as I would like. Next up is to refactor it
a bit to handle more cases.
llvm-svn: 13019
even if the loop is using expressions that we can't compute as a closed-form.
This allows us to calculate that this function always returns 55:
int test() {
double X;
int Count = 0;
for (X = 100; X > 1; X = sqrt(X), ++Count)
/*empty*/;
return Count;
}
And allows us to compute trip counts for loops like:
int h = 1;
do h = 3 * h + 1; while (h <= 256);
(which occurs in bzip2), and for this function, which occurs after inlining
and other optimizations:
int popcount()
{
int x = 666;
int result = 0;
while (x != 0) {
result = result + (x & 0x1);
x = x >> 1;
}
return result;
}
We still cannot compute the exit values of result or h in the two loops above,
which means we cannot delete the loop, but we are getting closer. Being able to
compute a constant trip count for these two loops will allow us to unroll them
completely though.
llvm-svn: 13017
that does not dominate all of its users, but is in the same basic block as
its users. This class of error is what caused the mysterious CBE only
failures last night.
llvm-svn: 12979
Basically we were using SimplifyCFG as a huge sledgehammer for a simple
optimization. Because simplifycfg does so many things, we can't use it
for this purpose.
llvm-svn: 12977
Instead of producing code like this:
Loop:
X = phi 0, X2
...
X2 = X + 1
if (X != N-1) goto Loop
We now generate code that looks like this:
Loop:
X = phi 0, X2
...
X2 = X + 1
if (X2 != N) goto Loop
This has two big advantages:
1. The trip count of the loop is now explicit in the code, allowing
the direct implementation of Loop::getTripCount()
2. This reduces register pressure in the loop, and allows X and X2 to be
put into the same register.
As a consequence of the second point, the code we generate for loops went
from:
.LBB2: # no_exit.1
...
mov %EDI, %ESI
inc %EDI
cmp %ESI, 2
mov %ESI, %EDI
jne .LBB2 # PC rel: no_exit.1
To:
.LBB2: # no_exit.1
...
inc %ESI
cmp %ESI, 3
jne .LBB2 # PC rel: no_exit.1
... which has two fewer moves, and uses one less register.
llvm-svn: 12961
at the bottom of the loop instead of the top. This reduces the number of
overlapping live ranges a lot, for example, eliminating a spill in an important
loop in 183.equake with linear scan.
I still need to make the exit comparison of the loop use the post-incremented
version of this variable, but this is an easy first step.
llvm-svn: 12952
even when the "optimization" I added before is turned off. It generates this
extremely pointless code:
test:
fld QWORD PTR [%ESP + 4]
mov %AL, 0
test %AL, %AL
fcmove %ST(0), %ST(0)
ret
Good thing the optimizer will have removed this before code generation
anyway. :)
llvm-svn: 12939
Fix several bugs in the intrinsics:
1. Make sure to copy the input registers before the instructions that use them
2. Make sure to copy the value returned by 'in' out of EAX into the register
it is supposed to be in.
This fixes assertions when using in/out and linear scan.
llvm-svn: 12896
SCC passes much more useful. In particular, this should fix the incredibly
stupid missed inlining opportunities that the inliner suffered from.
llvm-svn: 12860
of the fucom[p][p] instructions. This allows us to code generate this function
bool %test(double %X, double %Y) {
%C = setlt double %Y, %X
ret bool %C
}
... into:
test:
fld QWORD PTR [%ESP + 4]
fld QWORD PTR [%ESP + 12]
fucomip %ST(1)
fstp %ST(0)
setb %AL
movsx %EAX, %AL
ret
where before we generated:
test:
fld QWORD PTR [%ESP + 4]
fld QWORD PTR [%ESP + 12]
fucompp
** fnstsw
** sahf
setb %AL
movsx %EAX, %AL
ret
The two marked instructions (which are the ones eliminated) are very bad,
because they serialize execution of the processor. These instructions are
available on the PPRO and later, but since we already use cmov's we aren't
losing any portability.
I retained the old code for the day when we decide we want to support back
to the 386.
llvm-svn: 12852
If the source of the cast is a load, we can just use the source memory location,
without having to create a temporary stack slot entry.
Before we code generated this:
double %int(int* %P) {
%V = load int* %P
%V2 = cast int %V to double
ret double %V2
}
into:
int:
sub %ESP, 4
mov %EAX, DWORD PTR [%ESP + 8]
mov %EAX, DWORD PTR [%EAX]
mov DWORD PTR [%ESP], %EAX
fild DWORD PTR [%ESP]
add %ESP, 4
ret
Now we produce this:
int:
mov %EAX, DWORD PTR [%ESP + 4]
fild DWORD PTR [%EAX]
ret
... which is nicer.
llvm-svn: 12846
for mul and div.
Instead of generating this:
test_divr:
fld QWORD PTR [%ESP + 4]
fld QWORD PTR [.CPItest_divr_0]
fdivrp %ST(1)
ret
We now generate this:
test_divr:
fld QWORD PTR [%ESP + 4]
fdivr QWORD PTR [.CPItest_divr_0]
ret
This code desperately needs refactoring, which will come in the next
patch.
llvm-svn: 12841
instructions use. This doesn't change any functionality except that long
constant expressions of these operations will now magically start working.
llvm-svn: 12840
fld QWORD PTR [%ESP + 4]
fadd QWORD PTR [.CPItest_add_0]
instead of:
fld QWORD PTR [%ESP + 4]
fld QWORD PTR [.CPItest_add_0]
faddp %ST(1)
I also intend to do this for mul & div, but it appears that I have to
refactor a bit of code before I can do so.
This is tested by: test/Regression/CodeGen/X86/fp_constant_op.llx
llvm-svn: 12839
1. If an incoming argument is dead, don't load it from the stack
2. Do not code gen noop copies at all (ie, cast int -> uint), not even to
a move. This should reduce register pressure for allocators that are
unable to coallesce away these copies in some cases.
llvm-svn: 12835