by splicing function bodies from the src module to the destination module.
This speeds up linking quite a bit, e.g. gccld time on 176.gcc from 26s -> 20s
when forming the .rbc file, with a profile build. One of the really strange
but cool effects of this patch is that it speeds up the optimizers as well,
from 12s -> 10.7s, presumably because of better locality???
In any case, this is just a first step. We can trivially get rid of the
LocalMap now and do other simplifications.
llvm-svn: 17893
* Make the numVbrBytes function more efficient and better documented \
* Fix a bug in name truncation \
* Add comments before functions \
* Get rid of functions that are now inlined into the header \
* Do not have Archive doing symbol table printing \
* Put assert comments into the assert so they print out \
* Make sure foreign symbol tables are written
llvm-svn: 17884
* Make sure we write out the foreign symbol table if we read one \
* Make the padding calculation more efficiently and avoid Solaris warnings
llvm-svn: 17883
* get rid of (void) construct in function declarations
* make toString a const member
* add a default implementation of toString for Win32
llvm-svn: 17873
* Clean up the StatusInfo constructor to construct all members and give
them reasonable values.
* Get rid of the Vector typedef and make the interface to
getDirectoryContent use a std::set instead of a std::vector so the dir
content is sorted.
* Make the getStatusInfo method const and not return a useless boolean.
llvm-svn: 17872
* Get rid of "emitMaybePCRelativeValue", either we want to emit a PC relative
value or not: drop the maybe BS. As it turns out, the only places where
the bool was a variable coming in, the bool was a dynamic constant.
llvm-svn: 17867
immediately instead of lazily.
In this program, for example:
int main() {
printf("hello world\n");
printf("hello world\n");
printf("hello world\n");
printf("hello world\n");
}
We used to have to go through compilation callback 4 times (once for each
call to printf), now we don't go to it at all.
Thanks to Misha for noticing this, and for adding the initial ghost linkage
patches.
llvm-svn: 17864
1. Speedup getValueState by having it not consider Arguments. It's better
to just add them before we start SCCP'ing.
2. SCCP can delete the contents of dead blocks. No really, it's ok! This
reduces the size of the IR for subsequent passes, even though
simplifycfg would do the same job. In practice, simplifycfg does not
run until much later than sccp in gccas
llvm-svn: 17820
class. The only changes are minor:
* Do not try to SCCP instructions that return void in the rewrite loop.
This is silly and fool hardy, wasting a map lookup and adding an entry
to the map which is never used.
* If we decide something has an undefined value, rewrite it to undef,
potentially leading to further simplications.
llvm-svn: 17816
value. This allows us to turn more globals into constants and eliminate them.
This patch implements GlobalOpt/load-store-global.llx.
Note that this patch speeds up 255.vortex from:
Output/255.vortex.out-cbe.time:program 7.640000
Output/255.vortex.out-llc.time:program 9.810000
to:
Output/255.vortex.out-cbe.time:program 7.250000
Output/255.vortex.out-llc.time:program 9.490000
Which isn't bad at all!
llvm-svn: 17746
If this happens, detect it early instead of relying on instcombine to notice
it later. This can be a big speedup, because PHI nodes can have many
incoming values.
llvm-svn: 17741
%X = alloca ...
%Y = alloca ...
X == Y
into false. This allows us to simplify some stuff in eon (and probably
many other C++ programs) where operator= was checking for self assignment.
Folding this allows us to SROA several additional structs.
llvm-svn: 17735
constant value. This makes the return value dead and allows for
simplification in the caller.
This implements IPConstantProp/return-constant.ll
This triggers several dozen times throughout SPEC.
llvm-svn: 17730
of the array is just two. This occurs 8 times in gcc, 6 times in crafty, and
12 times in 099.go.
This implements ScalarRepl/sroa_two.ll
llvm-svn: 17727
argument pointers. This is only valid to do if the function already
unconditionally loaded an argument or if the pointer passed in is known
to be valid. Make sure to do the required checks.
This fixed ArgumentPromotion/control-flow.ll and the Burg program.
llvm-svn: 17718
two or three, open code the equivalent operation which is faster on athlon
and P4 (by a substantial margin).
For example, instead of compiling this:
long long X2(long long Y) { return Y << 2; }
to:
X3_2:
movl 4(%esp), %eax
movl 8(%esp), %edx
shldl $2, %eax, %edx
shll $2, %eax
ret
Compile it to:
X2:
movl 4(%esp), %eax
movl 8(%esp), %ecx
movl %eax, %edx
shrl $30, %edx
leal (%edx,%ecx,4), %edx
shll $2, %eax
ret
Likewise, for << 3, compile to:
X3:
movl 4(%esp), %eax
movl 8(%esp), %ecx
movl %eax, %edx
shrl $29, %edx
leal (%edx,%ecx,8), %edx
shll $3, %eax
ret
This matches icc, except that icc open codes the shifts as adds on the P4.
llvm-svn: 17707
for (X * C1) + (X * C2) (where * can be mul or shl), allowing us to fold:
Y+Y+Y+Y+Y+Y+Y+Y
into
%tmp.8 = shl long %Y, ubyte 3 ; <long> [#uses=1]
instead of
%tmp.4 = shl long %Y, ubyte 2 ; <long> [#uses=1]
%tmp.12 = shl long %Y, ubyte 2 ; <long> [#uses=1]
%tmp.8 = add long %tmp.4, %tmp.12 ; <long> [#uses=1]
This implements add.ll:test25
Also add support for (X*C1)-(X*C2) -> X*(C1-C2), implementing sub.ll:test18
llvm-svn: 17704
This allows to elimination of a bunch of global pool descriptor args from
programs being pool allocated (and is also generally useful!)
llvm-svn: 17657
bzip2: block size 9 -> 5, reduces memory by 400Kbytes, doesn't affect speed
or compression ratio on all but the largest bytecode files (>1MB)
zip: level 9 -> 6, this speeds up compression time by ~30% but only
degrades the compressed size by a few bytes per megabyte. Those few
bytes aren't worth the effort.
llvm-svn: 17647
int test(int x) { return 32768 - x; }
Fixed by teaching the function that checks a constant's validity to be used
as an immediate argument about subtract-from instructions.
llvm-svn: 17476
This method is really a gross hack, but at least we can make it work on
the targets we support right now.
This bug fix stops a crash in a testcase reduced from 176.gcc
llvm-svn: 17443
1. Calls to external global VARIABLES should not be treated as a call to an
external function
2. Efficiently deleting an element from a vector by using std::swap with
the back, then pop_back is NOT a good way to keep the vector sorted.
3. Our hope of having stuff get deleted by making them redundant just won't
work. In particular, if we have three calls in sequence that should be
merged: A, B, C first we unify B into A. To be sure that they appeared
identical (so B would be erased) we set B = A. On the next step, we
unified C into A and set C = A. Unfortunately, this is no guarantee that
C = B, so we would fail to delete the dead call. Switch to a more
explicit scheme.
llvm-svn: 17357
* change some uses of NH.getNode() in a bool context to use !NH.isNull()
* Fix a bunch of places where we depended on the (undefined) order of
evaluation of arguments to function calls to ensure that getNode() was
called before getOffset(). In practice, this was NOT happening.
llvm-svn: 17354
Correct the dependency of the Lexer.o file on the constructed
llvmAsmParser.h header file. It is not the Lexer.cpp file that depends on
the header, its the output of compiling Lexer.cpp, Lexer.o
llvm-svn: 17289
by the recently committed rlwimi.ll test file. Also commit initial code
for bitfield extract, although it is turned off until fully debugged.
llvm-svn: 17207
* Convert register numbers from their opcode value to the real value, e.g.
PPC::R1 => 1 and PPC::F1 => 1
* Add correct handling of loading of global values which are PC-relative --
implement ha16() and lo16()
llvm-svn: 17190
be listed second as that is how the instructions are usually created (and is the
correct asm syntax) so that it's assembled correctly from its constituents
llvm-svn: 17183
The decimal value given in the manual (8 or 9) really needs to be multiplied by
a factor of 32 because of the group of 5 zero bits after the register code.
llvm-svn: 17182
as the shift amount operand to a shift instruction. This was causing us to
emit unnecessary clear operations for code such as:
int foo(int x) { return 1 << x; }
llvm-svn: 17175
including registers, constants, and partial support for global addresses
* The JIT is disabled by default to allow building llvm-gcc, which wants to test
running programs during configure
llvm-svn: 17149
Instead of unconditionally copying all phi node values into temporaries for
all successor blocks, generate code that will determine what successor
block will be called and then copy only those phi node values needed by
the successor block.
This seems to cut down namd execution time from being 8% higher than GCC to
4% higher than GCC.
llvm-svn: 17144
- Support added for functions, basic blocks, constant pool, constants,
registers, and some basic support for globals, all untested
* Turn assert()s into abort()s so that unimplemented functions fail in release
llvm-svn: 17143
loops. This optimization is not turned on by default yet, but may be run
with the opt tool's -loop-reduce flag. There are many FIXMEs listed in the
code that will make it far more applicable to a wide range of code, but you
have to start somewhere :)
This limited version currently triggers on the following tests in the
MultiSource directory:
pcompress2: 7 times
cfrac: 5 times
anagram: 2 times
ks: 6 times
yacr2: 2 times
llvm-svn: 17134
change hacks off 10K of bytecode from perlbmk (.5%) even though the front-end
is not generating them yet and we are not optimizing the resultant code.
This isn't too bad.
llvm-svn: 17111
exercise that I'm not interested in tackling right now. Just punt and treat them
like unwind's.
This 'fixes' test/Regression/Transforms/ADCE/unreachable-function.ll
llvm-svn: 17106
unneccesary. This allows us to delete several hundred phi nodes of the
form PHI(x,x,x,undef) from 253.perlbmk and probably other programs as well.
This implements Mem2Reg/UndefValuesMerge.ll
llvm-svn: 17098
double %test(uint %X) {
%tmp.1 = cast uint %X to double ; <double> [#uses=1]
ret double %tmp.1
}
into:
test:
sub %ESP, 8
mov %EAX, DWORD PTR [%ESP + 12]
mov %ECX, 0
mov DWORD PTR [%ESP], %EAX
mov DWORD PTR [%ESP + 4], %ECX
fild QWORD PTR [%ESP]
add %ESP, 8
ret
... which basically zero extends to 8 bytes, then does an fild for an
8-byte signed int.
Now we generate this:
test:
sub %ESP, 4
mov %EAX, DWORD PTR [%ESP + 8]
mov DWORD PTR [%ESP], %EAX
fild DWORD PTR [%ESP]
shr %EAX, 31
fadd DWORD PTR [.CPItest_0 + 4*%EAX]
add %ESP, 4
ret
.section .rodata
.align 4
.CPItest_0:
.quad 5728578726015270912
This does a 32-bit signed integer load, then adds in an offset if the sign
bit of the integer was set.
It turns out that this is substantially faster than the preceeding sequence.
Consider this testcase:
unsigned a[2]={1,2};
volatile double G;
void main() {
int i;
for (i=0; i<100000000; ++i )
G += a[i&1];
}
On zion (a P4 Xeon, 3Ghz), this patch speeds up the testcase from 2.140s
to 0.94s.
On apoc, an athlon MP 2100+, this patch speeds up the testcase from 1.72s
to 1.34s.
Note that the program takes 2.5s/1.97s on zion/apoc with GCC 3.3 -O3
-fomit-frame-pointer.
llvm-svn: 17083
%X = and Y, constantint
%Z = setcc %X, 0
instead of emitting:
and %EAX, 3
test %EAX, %EAX
je .LBBfoo2_2 # UnifiedReturnBlock
We now emit:
test %EAX, 3
je .LBBfoo2_2 # UnifiedReturnBlock
This triggers 581 times on 176.gcc for example.
llvm-svn: 17080
1. optional shift left
2. and x, immX
3. and y, immY
4. or z, x, y
==> rlwimi z, x, y, shift, mask begin, mask end
where immX == ~immY and immX is a run of set bits. This transformation
fires 32 times on voronoi, once on espresso, and probably several
dozen times on external benchmarks such as gcc.
To put this in terms of actual code generated for
struct B { unsigned a : 3; unsigned b : 2; };
void storeA (struct B *b, int v) { b->a = v;}
void storeB (struct B *b, int v) { b->b = v;}
Old:
_storeA:
rlwinm r2, r4, 0, 29, 31
lwz r4, 0(r3)
rlwinm r4, r4, 0, 0, 28
or r2, r4, r2
stw r2, 0(r3)
blr
_storeB:
rlwinm r2, r4, 3, 0, 28
rlwinm r2, r2, 0, 27, 28
lwz r4, 0(r3)
rlwinm r4, r4, 0, 29, 26
or r2, r2, r4
stw r2, 0(r3)
blr
New:
_storeA:
lwz r2, 0(r3)
rlwimi r2, r4, 0, 29, 31
stw r2, 0(r3)
blr
_storeB:
lwz r2, 0(r3)
rlwimi r2, r4, 3, 27, 28
stw r2, 0(r3)
blr
llvm-svn: 17078
flag rotate left word immediate then mask insert (rlwimi) as a two-address
instruction, and update the ISel usage of the instruction accordingly.
This will allow us to properly schedule rlwimi, and use it to efficiently
codegen bitfield operations.
llvm-svn: 17068
case:
int C[100];
int foo() {
return C[4];
}
We now codegen:
foo:
mov %EAX, DWORD PTR [C + 16]
ret
instead of:
foo:
mov %EAX, OFFSET C
mov %EAX, DWORD PTR [%EAX + 16]
ret
Other impressive features may be coming later.
This patch is contributed by Jeff Cohen!
llvm-svn: 17011
useful when you have a reference like:
int A[100];
void foo() { A[10] = 1; }
In this case, &A[10] is a single constant and should be treated as such.
Only MO_GlobalAddress and MO_ExternalSymbol are allowed to use this field, no
other operand type is.
This is another fine patch contributed by Jeff Cohen!!
llvm-svn: 17007
The problem occurred when trying to reload this instruction:
MOV32mr %reg2326, 8, %reg2297, 4, %reg2295
The value of reg2326 was available in EBX, so it was reused from there, instead
of reloading it into EDX.
The value of reg2297 was available in EDX, so it was reused from there, instead
of reloading it into EDI.
The value of reg2295 was not available, so we tried reloading it into EBX, its
assigned register. However, we checked and saw that we already reloaded
something into EBX, so we chose what reg2326 was assigned to (EDX) and reloaded
into that register instead.
Unfortunately EDX had already been used by reg2297, so reloading into EDX
clobbered the value used by the reg2326 operand, breaking the program.
The fix for this is to check that the newly picked register is ok. In this
case we now find that EDX is already used and try using EDI, which succeeds.
llvm-svn: 17006
This transformation fires a few dozen times across the testsuite.
For example, int test2(int X) { return X ^ 0x0FF00FF0; }
Old:
_test2:
lis r2, 4080
ori r2, r2, 4080
xor r3, r3, r2
blr
New:
_test2:
xoris r3, r3, 4080
xori r3, r3, 4080
blr
llvm-svn: 17004
addPassesToEmitMachineCode()
* Add support for registers and constants in getMachineOpValue()
This enables running "int main() { ret 0 }" via the PowerPC JIT.
llvm-svn: 16983
* Add implementation of getMachineOpValue() for generated code emitter
* Convert assert()s in unimplemented functions to abort()s so that non-debug
builds fail predictably
* Add file header comments
llvm-svn: 16981
and 64-bit code emitters that cannot share code unless we use virtual
functions
* Identify components being built by tablegen with more detail by assigning them
to PowerPC, PPC32, or PPC64 more specifically; also avoids seeing 'building
PowerPC XYZ' messages twice, where one is for PPC32 and one for PPC64
llvm-svn: 16980
to go in. This patch allows us to compute the trip count of loops controlled
by values loaded from constant arrays. The cannonnical example of this is
strlen when passed a constant argument:
for (int i = 0; "constantstring"[i]; ++i) ;
return i;
In this case, it will compute that the loop executes 14 times, which means
that the exit value of i is 14. Because of this, the loop gets DCE'd and
we are happy. This also applies to anything that does similar things, e.g.
loops like this:
const float Array[] = { 0.1, 2.1, 3.2, 23.21 };
for (int i = 0; Array[i] < 20; ++i)
and is actually fairly general.
The problem with this is that it almost never triggers. The reason is that
we run indvars and the loop optimizer only at compile time, which is before
things like strlen and strcpy have been inlined into the program from libc.
Because of this, it almost never is used (it triggers twice in specint2k).
I'm committing it because it DOES work, may be useful in the future, and
doesn't slow us down at all. If/when we start running the loop optimizer
at link-time (-O4?) this will be very nice indeed :)
llvm-svn: 16926
pointer recurrences into expressions from this:
%P_addr.0.i.0 = phi sbyte* [ getelementptr ([8 x sbyte]* %.str_1, int 0, int 0), %entry ], [ %inc.0.i, %no_exit.i ]
%inc.0.i = getelementptr sbyte* %P_addr.0.i.0, int 1 ; <sbyte*> [#uses=2]
into this:
%inc.0.i = getelementptr sbyte* getelementptr ([8 x sbyte]* %.str_1, int 0, int 0), int %inc.0.i.rec
Actually create something nice, like this:
%inc.0.i = getelementptr [8 x sbyte]* %.str_1, int 0, int %inc.0.i.rec
llvm-svn: 16924
well as a vector of constant*'s. It turns out that this is more efficient
and all of the clients want to do that, so we should cater to them.
llvm-svn: 16923
First, it allows SRA of globals that have embedded arrays, implementing
GlobalOpt/globalsra-partial.llx. This comes up infrequently, but does allow,
for example, deleting several stores to dead parts of globals in dhrystone.
Second, this implements GlobalOpt/malloc-promote-*.llx, which is the
following nifty transformation:
Basically if a global pointer is initialized with malloc, and we can tell
that the program won't notice, we transform this:
struct foo *FooPtr;
...
FooPtr = malloc(sizeof(struct foo));
...
FooPtr->A FooPtr->B
Into:
struct foo FooPtrBody;
...
FooPtrBody.A FooPtrBody.B
This comes up occasionally, for example, the 'disp' global in 183.equake (where
the xform speeds the CBE version of the program up from 56.16s to 52.40s (7%)
on apoc), and the 'desired_accept', 'fixLRBT', 'macroArray', & 'key_queue'
globals in 300.twolf (speeding it up from 22.29s to 21.55s (3.4%)).
The nice thing about this xform is that it exposes the resulting global to
global variable optimization and makes alias analysis easier in addition to
eliminating a few loads.
llvm-svn: 16916
first element of an array, return a GEP instead of a cast. This allows us
to transparently fold this:
int* getelementptr (int* cast ([100 x int]* %Gbody to int*), int 40)
into this:
int* getelementptr ([100 x int]* %Gbody, int 0, int 40)
llvm-svn: 16911
still optimize away all of the indirect calls and loads, etc from it.
This turns code like this:
if (G != 0)
G();
into
if (G != 0)
ActualCallee();
This triggers a couple of times in gcc and libstdc++.
llvm-svn: 16901
Deal with allocating stack space for outgoing args and copying them into the
correct stack slots (at least, we can copy <=32-bit int args).
We now correctly generate ADJCALLSTACK* instructions.
llvm-svn: 16881
stored to, but are stored at variable indexes. This occurs at least in
176.gcc, but probably others, and we should handle it for completeness.
llvm-svn: 16876
has a large number of users. Instead, just keep track of whether we're
making changes as we do so.
This patch has no functionlity changes.
llvm-svn: 16874
we know that all uses of the global will trap if the pointer contained is
null. In this case, we forward substitute the stored value to any uses.
This has the effect of devirtualizing trivial globals in trivial cases. For
example, 164.gzip contains this:
gzip.h:extern int (*read_buf) OF((char *buf, unsigned size));
bits.c: read_buf = file_read;
deflate.c: lookahead = read_buf((char*)window,
deflate.c: n = read_buf((char*)window+strstart+lookahead, more);
Since read_buf has to point to file_read at every use, we just replace
the calls through read_buf with a direct call to file_read.
This occurs in several benchmarks, including 176.gcc and 164.gzip. Direct
calls are good and stuff.
llvm-svn: 16871
the -sse* options (to avoid misleading people).
Also, the stack alignment of the target doesn't depend on whether SSE is
eventually implemented, so remove a comment.
llvm-svn: 16860
which prevented setcc's from being folded into branches. It appears that
conditional branchinst's CC operand is actually operand(2), not operand(0)
as we might expect. :(
llvm-svn: 16859
* Do not lead dangling dead constants prevent optimization
* Iterate global optimization while we're making progress.
These changes allow us to be more aggressive, handling cases like
GlobalOpt/iterate.llx without a problem (turning it into 'ret int 0').
llvm-svn: 16857
optimizations to trigger much more often. This allows the elimination of
several dozen more global variables in Programs/External. Note that we only
do this for non-constant globals: constant globals will already be optimized
out if the accesses to them permit it.
This implements Transforms/GlobalOpt/globalsra.llx
llvm-svn: 16842
of one or more 1 bits (may wrap from least significant bit to most
significant bit) as the rlwinm rather than andi., andis., or some longer
instructons sequence.
int andn4(int z) { return z & -4; }
int clearhi(int z) { return z & 0x0000FFFF; }
int clearlo(int z) { return z & 0xFFFF0000; }
int clearmid(int z) { return z & 0x00FFFF00; }
int clearwrap(int z) { return z & 0xFF0000FF; }
_andn4:
rlwinm r3, r3, 0, 0, 29
blr
_clearhi:
rlwinm r3, r3, 0, 16, 31
blr
_clearlo:
rlwinm r3, r3, 0, 0, 15
blr
_clearmid:
rlwinm r3, r3, 0, 8, 23
blr
_clearwrap:
rlwinm r3, r3, 0, 24, 7
blr
llvm-svn: 16832
1. Fix an illegal argument to getClassB when deciding whether or not to
sign extend a byte load.
2. Initial addition of isLoad and isStore flags to the instruction .td file
for eventual use in a scheduler.
3. Rewrite of how constants are handled in emitSimpleBinaryOperation so
that we can emit the PowerPC shifted immediate instructions far more
often. This allows us to emit the following code:
int foo(int x) { return x | 0x00F0000; }
_foo:
.LBB_foo_0: ; entry
; IMPLICIT_DEF
oris r3, r3, 15
blr
llvm-svn: 16826
loading a 32bit constant into a register whose low halfword is all zeroes.
We now omit the ori after the lis for the following C code:
int bar(int y) { return y * 0x00F0000; }
_bar:
.LBB_bar_0: ; entry
; IMPLICIT_DEF
lis r2, 15
mullw r3, r3, r2
blr
llvm-svn: 16825
exponential behavior (bork!). This patch processes stuff with an
explicit SCC finder, allowing the algorithm to be more clear,
efficient, and also (as a bonus) correct! This gets us back to taking
0.6s to disassemble my horrible .bc file that previously took something
> 30 mins.
llvm-svn: 16811
* Instead of handling dead functions specially, just nuke them.
* Be more aggressive about cleaning up after constification, in
particular, handle getelementptr instructions and constantexprs.
* Be a little bit more structured about how we process globals.
*** Delete globals that are only stored to, and never read. These are
clearly not useful, so they should go. This implements deadglobal.llx
This last one triggers quite a few times. In particular, 2208 in the
external tests, 1865 of which are in 252.eon. This shrinks eon from
1995094 to 1732341 bytes of bytecode.
llvm-svn: 16802
simplifications of the resultant program to avoid making later passes
do it all.
This allows us to constify globals that just have the same constant that
they are initialized stored into them.
Suprisingly this comes up ALL of the freaking time, dozens of times in
SPEC, 30 times in vortex alone.
For example, on 256.bzip2, it allows us to constify these two globals:
%smallMode = internal global ubyte 0 ; <ubyte*> [#uses=8]
%verbosity = internal global int 0 ; <int*> [#uses=49]
Which (with later optimizations) results in the bytecode file shrinking
from 82286 to 69686 bytes! Lets hear it for IPO :)
For the record, it's nuking lots of "if (verbosity > 2) { do lots of stuff }"
code.
llvm-svn: 16793
(PromoteAbstractToConcrete), and to use a set to avoid recomputation.
In particular, this set eliminates the potentially exponential cases
from this little recursive algorithm.
On a particularly nasty testcase, llvm-dis on the .bc file went from 34
minutes (which is when I killed it, it still hadn't finished) to 0.57s.
Remember kids, exponential algorithms are bad.
llvm-svn: 16772
t:
mov %EDX, DWORD PTR [%ESP + 4]
mov %ECX, 2
mov %EAX, %EDX
sar %EDX, 31
idiv %ECX
mov %EAX, %EDX
ret
Generate:
t:
mov %ECX, DWORD PTR [%ESP + 4]
*** mov %EAX, %ECX
cdq
and %ECX, 1
xor %ECX, %EDX
sub %ECX, %EDX
*** mov %EAX, %ECX
ret
Note that the two marked moves are redundant, and should be eliminated by the
register allocator, but aren't.
Compare this to GCC, which generates:
t:
mov %eax, DWORD PTR [%esp+4]
mov %edx, %eax
shr %edx, 31
lea %ecx, [%edx+%eax]
and %ecx, -2
sub %eax, %ecx
ret
or ICC 8.0, which generates:
t:
movl 4(%esp), %ecx #3.5
movl $-2147483647, %eax #3.25
imull %ecx #3.25
movl %ecx, %eax #3.25
sarl $31, %eax #3.25
addl %ecx, %edx #3.25
subl %edx, %eax #3.25
addl %eax, %eax #3.25
negl %eax #3.25
subl %eax, %ecx #3.25
movl %ecx, %eax #3.25
ret #3.25
We would be in great shape if not for the moves.
llvm-svn: 16763
an instruction if it can be hoisted to a common dominator of the block.
This implements: test/Regression/Transforms/TailDup/MergeTest.ll
llvm-svn: 16758
previously temporary NULLCOMP implementation that merely copies the data
verbatim without compression. Also, don't warn if there's no compression
library as that is taken care of during configuration time.
llvm-svn: 16654
mapping of files. This first version uses mmap where its available. The
class needs to implement an alternate mechanism based on malloc'd memory
and file reading/writing for platforms without virtual memory.
llvm-svn: 16649
old and broken AT&T syntax assemblers. The problem with this hack is that
*SOME* forms of the fdiv and fsub instructions have the 'r' bit inverted.
This was a real pain to figure out, but is trivially easy to support: thus
we are now bug compatible with gas and gcc.
llvm-svn: 16644
Intel and AT&T style assembly language. The ultimate goal of this is to
eliminate the GasBugWorkaroundEmitter class, but for now AT&T style emission
is not fully operational.
llvm-svn: 16639
hopefully lead to the death of the 'GasBugWorkaroundEmitter'. This also
includes changes to wrap the whole file to 80 columns! Woot! :)
Note that the AT&T style output has not been tested at all.
llvm-svn: 16638
it was a use, def, or both. This allows us to be less pessimistic in our
analysis of them. In practice, this doesn't make a big difference, but it
doesn't hurt either.
llvm-svn: 16632
and delete them if they turn out to be dead. This is a useful little hack
that even speeds up some programs. For example, it speeds up Ptrdist/ks
from 17.53s to 15.59s, and 188.ammp from 149s to 146s.
This also speeds up llc :)
llvm-svn: 16630
generated code over the simple spiller. The new local spiller generates
substantially better code than the simple one in some cases, by reusing
values that are loaded out of stack slots and kept available in registers.
This primarily helps programs that are spilling a lot, and there is still
stuff that can be done to improve it. This patch makes the local spiller
the default, as it's only a tiny bit slower than the simple spiller (it
increases the runtime of llc by < 1%).
Here are some numbers with speedups.
Program #reuse old(s) new(s) Speedup
Povray: 3452, 16.87 -> 15.93 (5.5%)
177.mesa: 2176, 2.77 -> 2.76 (0%)
179.art: 35, 28.43 -> 28.01 (1.5%)
183.equake: 55, 61.44 -> 61.41 (0%)
188.ammp: 869, 174 -> 149 (15%)
164.gzip: 43, 40.73 -> 40.71 (0%)
175.vpr: 351, 18.54 -> 17.34 (6.5%)
176.gcc: 2471, 5.01 -> 4.92 (1.8%)
181.mcf 42, 79.30 -> 75.20 (5.2%)
186.crafty: 484, 29.73 -> 30.04 (-1%)
197.parser: 251, 10.47 -> 10.67 (-1%)
252.eon: 1501, 1.98 -> 1.75 (12%)
253.perlbm: 1183, 14.83 -> 14.42 (2.8%)
254.gap: 825, 7.46 -> 7.29 (2.3%)
255.vortex: 285, 10.51 -> 10.27 (2.3%)
256.bzip2: 63, 55.70 -> 55.20 (0.9%)
300.twolf: 830, 21.63 -> 22.00 (-1%)
PtrDist/ks 14, 32.75 -> 17.53 (46.5%)
Olden/tsp 46, 8.71 -> 8.24 (5.4%)
Free/distray 70, 1.09 -> 0.99 (9.2%)
llvm-svn: 16629
two spillers produce perfectly identical code (at least on povray and eon),
but the simple spiller is substantially faster than the local spiller. Once
the local spiller is improved, we can switch back.
Switching cuts 5.2% off of the llc time for povray (about 1.3s).
llvm-svn: 16608
use a simple vector. This speeds up -spiller=simple from taking 22s to taking
.1s on povray (debug build). This change does not modify the generated code.
llvm-svn: 16607
won't work if not compiled in V9 mode, currently by GCC only, because Sun's
system compiler does not tell us if it's a V8 or V9 system.
llvm-svn: 16602
This method is linear time in the size of the basic block, which is very
bad for large basic blocks. On the Assembler/2004-09-29-VerifierIsReallySlow.llx
testcase, the verifier changes from taking 50s to 0.23s with this patch.
llvm-svn: 16593
* SubOne/AddOne functions always return ConstantInt, declare them as such
* Pull code for handling setcc X, cst, where cst is at the end of the range,
or cc is LE or GE up earlier in visitSetCondInst. This reduces #iterations
in some cases.
* Fold: (div X, C1) op C2 -> range check, implementing div.ll:test6 - test9.
llvm-svn: 16588