Related
I'm writing a function pass in LLVM, which generates IR file. The problem is that the assembled code does not seem to behave as I expect. Since I'm pretty new to LLVM, I'd like to know if I misunderstood the LLVM IR semantics or this is an incorrect behavior of llc.
The LLVM IR is:
define void #fff(i32*) #0 {
%2 = alloca i32*, align 8
%3 = alloca i32, align 4
%4 = load i8*, i8** #dirty
br label %5
; <label>:5: ; preds = %1
store i32* %0, i32** %2, align 8
%6 = load i32*, i32** %2, align 8
%7 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([11 x i8], [11 x i8]* #.str.4, i32 0, i32 0), i32* %6)
%8 = load i32*, i32** %2, align 8
%9 = load i32, i32* %8, align 4
%readDirty = load atomic i8, i8* %4 acquire, align 8
%10 = icmp eq i8 %readDirty, 1
br i1 %10, label %Restart, label %11, !prof !3
; <label>:11: ; preds = %5
store i32 %9, i32* %3, align 4
ret void
Restart: ; preds = %5
;EDIT: bug was here. Must include label %5 as a possible destination block
indirectbr i8* blockaddress(#fff, %5), []
}
This correspond (roughly) to the following C code:
char *dirty=1;
void fff(int *head) ATTR{
restart:
printf("head = %p\n", head);
int r = *head;
if(*dirty)
goto restart; //But using indirect branch
}
Next I assemble, link and run using:
llc -filetype=obj simpleOut.ll -o out.o
gcc -o exe out.o
./exe
If I call the function with address 0x7ffeea51d7a8, it prints:
head = 0x7ffeea51d7a8
head = 0x2e889e825bf4005c
Segmentation fault: 11
The x86_64 assembly code is:
;head reside in rcx
100000d60: 55 pushq %rbp
100000d61: 48 89 e5 movq %rsp, %rbp
100000d64: 53 pushq %rbx
100000d65: 48 83 ec 18 subq $24, %rsp
100000d69: 48 89 f9 movq %rdi, %rcx
100000d6c: 48 8d 3d dd 02 00 00 leaq 733(%rip), %rdi
100000d73: ff 17 callq *(%rdi)
100000d75: 48 8b 18 movq (%rax), %rbx
100000d78: 48 8d 3d c0 01 00 00 leaq 448(%rip), %rdi
100000d7f: 48 89 4d f0 movq %rcx, -16(%rbp)
100000d83: 48 8b 75 f0 movq -16(%rbp), %rsi
100000d87: b0 00 movb $0, %al
100000d89: e8 62 01 00 00 callq 354 ;call to printf, corrupt rcx
100000d8e: 48 8b 45 f0 movq -16(%rbp), %rax
100000d92: 8b 00 movl (%rax), %eax
100000d94: 80 3b 01 cmpb $1, (%rbx)
100000d97: 74 0a je 10 <_fff+0x43>
100000d99: 89 45 ec movl %eax, -20(%rbp)
100000d9c: 48 83 c4 18 addq $24, %rsp
100000da0: 5b popq %rbx
100000da1: 5d popq %rbp
100000da2: c3 retq
100000da3: 48 8d 05 ce ff ff ff leaq -50(%rip), %rax
100000daa: ff e0 jmpq *%rax ;jumps to 100000d78
100000dac: 0f 1f 40 00 nopl (%rax)
The problem seems to be that the LLVM statement store i32* %0, i32** %2, align 8 translates to movq %rcx, -16(%rbp) even after the restart, where the register rcx was already corrupted by printf function.
If this seems like a bug I'll file a bug report with LLVM. Just wanted to check that I don't misunderstand the LLVM IR.
llc version is 5.0.0, installed via homebrew. gcc (used for linking) is clang-900.0.39.2.
Thanks
According to the documentation, indirectbr instruction should be supplied with the list of all possible destination blocks. Omitting a BB that is being jumped to produces undefined behavior.
Can using gotos instead of oops result in a series of jump instructions more efficient than what the compiler would have generated if loops had been used instead?
For example: If I had a while loop nested inside a switch statement, which would be nested in another loop which would be nested inside of another switch case, could using goto actually outsmart the jump instructions the compiler generates when using just loops and no gotos?
It may be possible to gain a small speed advantage by using goto. However, the opposite may also be true. The compilers have become very good in detecting and unrolling loops or optimizing loops by using SIMD instructions. You will most likely kill all those optimization options for the compiler, since they are not build to optimize goto statements.
You can also write functions to prevent gotos. This way you enable the compiler to inline the function and get rid of the jump.
If you consider using goto for optimization purposes, i would say, that it is a very bad idea. Express your algorithm in clean code and optimize later. And if you need more performance, think about you data and the access to it. That is the point, where you can gain or loose performance.
Since you wanted code to proove the point, I constructed the following example. I used gcc 6.3.1 with -O3 on an Intel i7-3537U. If you try to reproduce the example, your results may differ depending on your compiler or hardware.
#include <iostream>
#include <array>
#include "ToolBox/Instrumentation/Profiler.hpp"
constexpr size_t kilo = 1024;
constexpr size_t mega = kilo * 1024;
constexpr size_t size = 512*mega;
using container = std::array<char, size>;
enum class Measurements {
jump,
loop
};
// Simple vector addition using for loop
void sum(container& result, const container& data1, const container& data2) {
profile(Measurements::loop);
for(unsigned int i = 0; i < size; ++i) {
result[i] = data1[i] + data2[i];
}
}
// Simple vector addition using jumps
void sum_jump(container& result, const container& data1, const container& data2) {
profile(Measurements::jump);
unsigned int i = 0;
label:
result[i] = data1[i] + data2[i];
i++;
if(i == size) goto label;
}
int main() {
// This segment is just for benchmarking purposes
// Just ignore this
ToolBox::Instrumentation::Profiler<Measurements, std::chrono::nanoseconds, 2> profiler(
std::cout,
{
{Measurements::jump, "jump"},
{Measurements::loop, "loop"}
}
);
// allocate memory to execute our sum functions on
container data1, data2, result;
// run the benchmark 100 times to account for caching of the data
for(unsigned i = 0; i < 100; i++) {
sum_jump(result, data1, data2);
sum(result, data1, data2);
}
}
The output of the programm is the the following:
Runtimes for 12Measurements
jump: 100x 2972 nanoseconds 29 nanoseconds/execution
loop: 100x 2820 nanoseconds 28 nanoseconds/execution
Ok, we see that there is no time difference in the runtimes, because we are limited in bandwidth of the memory and not in cpu instructions. But lets look at the assembler instructions, that are generated:
Dump of assembler code for function sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&):
0x00000000004025c0 <+0>: push %r15
0x00000000004025c2 <+2>: push %r14
0x00000000004025c4 <+4>: push %r12
0x00000000004025c6 <+6>: push %rbx
0x00000000004025c7 <+7>: push %rax
0x00000000004025c8 <+8>: mov %rdx,%r15
0x00000000004025cb <+11>: mov %rsi,%r12
0x00000000004025ce <+14>: mov %rdi,%rbx
0x00000000004025d1 <+17>: callq 0x402110 <_ZNSt6chrono3_V212system_clock3nowEv#plt>
0x00000000004025d6 <+22>: mov %rax,%r14
0x00000000004025d9 <+25>: lea 0x20000000(%rbx),%rcx
0x00000000004025e0 <+32>: lea 0x20000000(%r12),%rax
0x00000000004025e8 <+40>: lea 0x20000000(%r15),%rsi
0x00000000004025ef <+47>: cmp %rax,%rbx
0x00000000004025f2 <+50>: sbb %al,%al
0x00000000004025f4 <+52>: cmp %rcx,%r12
0x00000000004025f7 <+55>: sbb %dl,%dl
0x00000000004025f9 <+57>: and %al,%dl
0x00000000004025fb <+59>: cmp %rsi,%rbx
0x00000000004025fe <+62>: sbb %al,%al
0x0000000000402600 <+64>: cmp %rcx,%r15
0x0000000000402603 <+67>: sbb %cl,%cl
0x0000000000402605 <+69>: test $0x1,%dl
0x0000000000402608 <+72>: jne 0x40268b <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+203>
0x000000000040260e <+78>: and %cl,%al
0x0000000000402610 <+80>: and $0x1,%al
0x0000000000402612 <+82>: jne 0x40268b <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+203>
0x0000000000402614 <+84>: xor %eax,%eax
0x0000000000402616 <+86>: nopw %cs:0x0(%rax,%rax,1)
0x0000000000402620 <+96>: movdqu (%r12,%rax,1),%xmm0
0x0000000000402626 <+102>: movdqu 0x10(%r12,%rax,1),%xmm1
0x000000000040262d <+109>: movdqu (%r15,%rax,1),%xmm2
0x0000000000402633 <+115>: movdqu 0x10(%r15,%rax,1),%xmm3
0x000000000040263a <+122>: paddb %xmm0,%xmm2
0x000000000040263e <+126>: paddb %xmm1,%xmm3
0x0000000000402642 <+130>: movdqu %xmm2,(%rbx,%rax,1)
0x0000000000402647 <+135>: movdqu %xmm3,0x10(%rbx,%rax,1)
0x000000000040264d <+141>: movdqu 0x20(%r12,%rax,1),%xmm0
0x0000000000402654 <+148>: movdqu 0x30(%r12,%rax,1),%xmm1
0x000000000040265b <+155>: movdqu 0x20(%r15,%rax,1),%xmm2
0x0000000000402662 <+162>: movdqu 0x30(%r15,%rax,1),%xmm3
0x0000000000402669 <+169>: paddb %xmm0,%xmm2
0x000000000040266d <+173>: paddb %xmm1,%xmm3
0x0000000000402671 <+177>: movdqu %xmm2,0x20(%rbx,%rax,1)
0x0000000000402677 <+183>: movdqu %xmm3,0x30(%rbx,%rax,1)
0x000000000040267d <+189>: add $0x40,%rax
0x0000000000402681 <+193>: cmp $0x20000000,%rax
0x0000000000402687 <+199>: jne 0x402620 <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+96>
0x0000000000402689 <+201>: jmp 0x4026d5 <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+277>
0x000000000040268b <+203>: xor %eax,%eax
0x000000000040268d <+205>: nopl (%rax)
0x0000000000402690 <+208>: movzbl (%r15,%rax,1),%ecx
0x0000000000402695 <+213>: add (%r12,%rax,1),%cl
0x0000000000402699 <+217>: mov %cl,(%rbx,%rax,1)
0x000000000040269c <+220>: movzbl 0x1(%r15,%rax,1),%ecx
0x00000000004026a2 <+226>: add 0x1(%r12,%rax,1),%cl
0x00000000004026a7 <+231>: mov %cl,0x1(%rbx,%rax,1)
0x00000000004026ab <+235>: movzbl 0x2(%r15,%rax,1),%ecx
0x00000000004026b1 <+241>: add 0x2(%r12,%rax,1),%cl
0x00000000004026b6 <+246>: mov %cl,0x2(%rbx,%rax,1)
0x00000000004026ba <+250>: movzbl 0x3(%r15,%rax,1),%ecx
0x00000000004026c0 <+256>: add 0x3(%r12,%rax,1),%cl
0x00000000004026c5 <+261>: mov %cl,0x3(%rbx,%rax,1)
0x00000000004026c9 <+265>: add $0x4,%rax
0x00000000004026cd <+269>: cmp $0x20000000,%rax
0x00000000004026d3 <+275>: jne 0x402690 <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+208>
0x00000000004026d5 <+277>: callq 0x402110 <_ZNSt6chrono3_V212system_clock3nowEv#plt>
0x00000000004026da <+282>: sub %r14,%rax
0x00000000004026dd <+285>: add %rax,0x202b74(%rip) # 0x605258 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_1EE14totalTimeSpentE>
0x00000000004026e4 <+292>: incl 0x202b76(%rip) # 0x605260 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_1EE10executionsE>
0x00000000004026ea <+298>: add $0x8,%rsp
0x00000000004026ee <+302>: pop %rbx
0x00000000004026ef <+303>: pop %r12
0x00000000004026f1 <+305>: pop %r14
0x00000000004026f3 <+307>: pop %r15
0x00000000004026f5 <+309>: retq
End of assembler dump.
As we can see, the compiler has vectorized the loop and uses simd instructions (e.g. paddb).
Now the version with the jumps:
Dump of assembler code for function sum_jump(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&):
0x0000000000402700 <+0>: push %r15
0x0000000000402702 <+2>: push %r14
0x0000000000402704 <+4>: push %r12
0x0000000000402706 <+6>: push %rbx
0x0000000000402707 <+7>: push %rax
0x0000000000402708 <+8>: mov %rdx,%rbx
0x000000000040270b <+11>: mov %rsi,%r14
0x000000000040270e <+14>: mov %rdi,%r15
0x0000000000402711 <+17>: callq 0x402110 <_ZNSt6chrono3_V212system_clock3nowEv#plt>
0x0000000000402716 <+22>: mov %rax,%r12
0x0000000000402719 <+25>: mov (%rbx),%al
0x000000000040271b <+27>: add (%r14),%al
0x000000000040271e <+30>: mov %al,(%r15)
0x0000000000402721 <+33>: callq 0x402110 <_ZNSt6chrono3_V212system_clock3nowEv#plt>
0x0000000000402726 <+38>: sub %r12,%rax
0x0000000000402729 <+41>: add %rax,0x202b38(%rip) # 0x605268 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_0EE14totalTimeSpentE>
0x0000000000402730 <+48>: incl 0x202b3a(%rip) # 0x605270 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_0EE10executionsE>
0x0000000000402736 <+54>: add $0x8,%rsp
0x000000000040273a <+58>: pop %rbx
0x000000000040273b <+59>: pop %r12
0x000000000040273d <+61>: pop %r14
0x000000000040273f <+63>: pop %r15
0x0000000000402741 <+65>: retq
End of assembler dump.
And here we did not trigger the optimization.
You can compile the programm and check the asm code yourself with:
gdb -batch -ex 'file a.out' -ex 'disassemble sum'
I also tried this approch for a matrix multiplication, but gcc was smart enough to detect the matrix multiplication with goto/label syntax too.
Conclusion:
Even if we did not see a speed loss, we saw, that gcc could not use an optimization, that could speed up the computation. In more cpu challenging tasks, that may cause a drop in performance.
Could it? Sure, theoretically, if you were using an especially dumb compiler or were compiling with optimizations disabled.
In practice, absolutely not. An optimizing compiler has very little difficulty optimizing loops and switch statements. You are far more likely to confuse the optimizer with unconditional jumps than if you play by the normal rules and use looping constructs that it is familiar with and programmed to optimize accordingly.
This just goes back to a general rule that optimizers do their best work with standard, idiomatic code because that's what they have been trained to recognize. Given their extensive use of pattern matching, if you deviate from normal patterns, you are less likely to get optimal code output.
For example: If I had a while loop nested inside a switch statement, which would be nested in another loop which would be nested inside of another switch case
Yeah, I'm confused by reading your description of the code, but a compiler would not be. Unlike humans, compilers have no trouble with nested loops. You aren't going to confuse it by writing valid C++ code, unless the compiler has a bug, in which case all bets are off. No matter how much nesting there is, if the logical result is a jump outside of the entire block, then a compiler will emit code that does that, just as if you had written a goto of your own. If you had tried to write a goto of your own, you would be more likely to get sub-optimal code because of scoping issues, which would require the compiler to emit code that saved local variables, adjusted the stack frame, etc. There are many standard optimizations that are normally performed on loops, yet become either impossible or simply not applied if you throw a goto inside of that loop.
Furthermore, it is not the case that jumps always result in faster codeāeven when jumping over other instructions. In many cases on modern, heavily-pipelined, out-of-order processors, mispredicted branches amount to a significant penalty, far more than if the intervening instructions had simply been executed and their results thrown away (ignored). Optimizing compilers that are targeting a particular platform know about this, and may decide to emit branchless code for performance reasons. If you throw in a jump, you force an unconditional branch and eliminate their ability to make these decisions strategically.
You claim that you want "an actual example instead of theory", but this is setting up a logical fallacy. It makes the assumption that you are correct and that the use of goto can indeed lead to better optimized code than a compiler could generate with looping constructs, but if that is not the case, then there is no code that can prove it. I could show you hundreds of examples of code where a goto resulted in sub-optimal object code being generated, but you could just claim that there still existed some snippet of code where the reverse is true, and there is no way that I (or anyone else) could exhaustively demonstrate that you were wrong.
On the contrary, theoretical arguments are the only way to answer this question (at least if answering in the negative). Furthermore, I would argue that a theoretical understanding is all that is required to inform one's writing of code. You should write code assuming that your optimizer is not broken, and only after determining that it actually is broken should you go back and try to figure out how to revise the code to get it to generate the output you expected. If, in some bizarre circumstance, you find that that involves a goto, then you can use it. Until that point, assume that the compiler knows how to optimize loops and write the code in the normal, readable way, assured that in the vast majority of circumstances (if not actually 100%) that the output will be better than what you could have gotten by trying to outsmart the compiler from the outset.
Goto/while/do are just high-level compiler constructions. Once they get resolved to Intermediate Representation (IR or AST) they disappear and everything is implemented in terms of branches.
Take for example this code
double calcsum( double* val, unsigned int count )
{
double sum = 0;
for ( unsigned int j=0; j<count; ++count ) {
sum += val[j];
}
return sum;
}
Compile it such that it generates IR:
clang++ -S -emit-llvm -O3 test.cpp # will create test.ll
Look at the generated IR language
$ cat test.ll
define double #_Z7calcsumPdj(double* nocapture readonly, i32)
local_unnamed_addr #0 {
%3 = icmp eq i32 %1, 0
br i1 %3, label %26, label %4
; <label>:4: ; preds = %2
%5 = load double, double* %0, align 8, !tbaa !1
%6 = sub i32 0, %1
%7 = and i32 %6, 7
%8 = icmp ugt i32 %1, -8
br i1 %8, label %12, label %9
; <label>:9: ; preds = %4
%10 = sub i32 %6, %7
br label %28
; <label>:11: ; preds = %28
br label %12
; <label>:12: ; preds = %11, %4
%13 = phi double [ undef, %4 ], [ %38, %11 ]
%14 = phi double [ 0.000000e+00, %4 ], [ %38, %11 ]
%15 = icmp eq i32 %7, 0
br i1 %15, label %24, label %16
; <label>:16: ; preds = %12
br label %17
; <label>:17: ; preds = %17, %16
%18 = phi double [ %14, %16 ], [ %20, %17 ]
%19 = phi i32 [ %7, %16 ], [ %21, %17 ]
%20 = fadd double %18, %5
%21 = add i32 %19, -1
%22 = icmp eq i32 %21, 0
br i1 %22, label %23, label %17, !llvm.loop !5
; <label>:23: ; preds = %17
br label %24
; <label>:24: ; preds = %12, %23
%25 = phi double [ %13, %12 ], [ %20, %23 ]
br label %26
; <label>:26: ; preds = %24, %2
%27 = phi double [ 0.000000e+00, %2 ], [ %25, %24 ]
ret double %27
; <label>:28: ; preds = %28, %9
%29 = phi double [ 0.000000e+00, %9 ], [ %38, %28 ]
%30 = phi i32 [ %10, %9 ], [ %39, %28 ]
%31 = fadd double %29, %5
%32 = fadd double %31, %5
%33 = fadd double %32, %5
%34 = fadd double %33, %5
%35 = fadd double %34, %5
%36 = fadd double %35, %5
%37 = fadd double %36, %5
%38 = fadd double %37, %5
%39 = add i32 %30, -8
%40 = icmp eq i32 %39, 0
br i1 %40, label %11, label %28
}
You can see that there are not while/do constructs anymore. Everything is branch/goto. Obviously this still gets further compiled into assembly language .
But besides this, "fast" today depends on if the compiler can match a good optimization pass to your code. By using goto you would be potentially forfeiting some loop-optimization passes as lcssa, licm, loop deletion, loop reduce, simplify, unroll, unswitch:
http://llvm.org/docs/Passes.html#lcssa-loop-closed-ssa-form-pass
I think if you are trying to optimize your code using this you are doing something wrong.
Compiler optimization can be made per platform - you can't do this
There are a lot of compilers how can you be sure your optimization is will be not degradation on different platform
There is so match OS, processors and other things that prevent such optimization to live
Programming language is made for Human not machines - why didn't you use machine instructions - you will have better control and can optimize things (take into account point 3)
.......
Anyway if you Are looking for micro optimizations on current platform better to get full control and use asm.
I'm writing a compiler for LLVM for a language whose semantics explicity define that a division by zero should always raise a floating point exception. The problem is, after running -mem2reg and -constprop on my raw IR, my code gets converted to:
define i32 #main() {
entry:
%t3 = sdiv i32 2, 0
ret i32 7
}
which then gets turned by llc -O0 into:
.text
.globl _c0_main
.align 16, 0x90
.type _c0_main,#function
main:
.cfi_startproc
# BB#0: # %entry
movl $7, %eax
ret
.Ltmp0:
.size main, .Ltmp0-main
.cfi_endproc
Is there a way to force llc not to remove effectful operations?
The sdiv instruction has divide by zero semantics that are undefined. If your front end language has some defined semantics for this you'll need to use instructions other than sdiv.
Perhaps you'll have to detect a divide by zero and branch at runtime to a sequence of instructions that gives the semantics you want.
The language reference states that [d]ivision by zero leads to undefined behavior as stated by Colin.
Thus, you have to check if the dividend is zero and explicitly generate a floating point exception. In C, this could look as follows:
extern void raiseFloatingException();
int someFunc() {
int a = 2;
int b = 0;
if (!b) {
raiseFloatingException();
}
int result = a / b;
return result;
}
If you compile this to bit code and optimize it with -mem2reg you get the pattern that you could generate with your back end:
define i32 #someFunc() #0 {
%1 = icmp ne i32 0, 0
br i1 %1, label %3, label %2
; <label>:2 ; preds = %0
call void (...)* #raiseFloatingException()
br label %3
; <label>:3 ; preds = %2, %0
%4 = sdiv i32 2, 0
ret i32 %4
}
Note, that you will have to provide the raiseFloatingException function and link it with your code.
I'm using MCJIT in a CPU emulator project, the IR is mainly generated using IRBuilder and it works, but the performance is worse than our old JIT. I compared the code generated, the main problem is related to register usage. In LLVM, I promoted 5 local variables to registers by putting allocas at the beginning of the IR function, after run "createPromoteMemoryToRegisterPass", all 5 variables are promoted to virtual registers with PHI nodes in the optimized IR. But in the final X64 binary code, only 2 variables are promoted to registers, the other 3 spill. And in the code there are still 2 registers unused(R11, R12).
I've tried using llc to compile the optimized IR with greedy/pbqp, the results are same, only 2 got promoted.
I've spent a lot of time on this issue, any suggestions will be appreciated.
Code snippet:
IR before optimization:
define void #emulate(%struct.fpga_cpu* %pcpu, i64 %pc_arg, i64 %r0_arg, i64 %r1_arg, i64 %r2_arg, i64 %r3_arg) {
entry:
%0 = alloca i64
%1 = alloca i64
%2 = alloca i64
%3 = alloca i64
%4 = alloca i64
%5 = alloca i64
store i64 %r0_arg, i64* %2
store i64 %r1_arg, i64* %3
store i64 %r2_arg, i64* %4
store i64 %r3_arg, i64* %5
store i64 %pc_arg, i64* %0
....
}
IR after optimization:
define void #emulate(%struct.fpga_cpu* %pcpu, i64 %pc_arg, i64 %r0_arg, i64 %r1_arg, i64 %r2_arg, i64 %r3_arg) {
....
%.01662 = phi i64 [ %r2_arg, %entry ], [ %.21664, %cmp ]
%.01648 = phi i64 [ %r1_arg, %entry ], [ %.21650, %cmp ]
%.01640 = phi i64 [ %r0_arg, %entry ], [ %.21642, %cmp ]
%.01636 = phi i64 [ %r3_arg, %entry ], [ %.21638, %cmp ]
%.0 = phi i64 [ %pc_arg, %entry ], [ %.2, %cmp ]
....
}
The X64 code:
emulate:
pushq %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $72, %rsp
movq %r9, 56(%rsp) # 8-byte Spill
movq %r8, 64(%rsp) # 8-byte Spill
movq %rcx, %r15
movq %rdx, %r13
movq %rdi, %r14
....
r2_arg, r3_arg spill, but R11,R12 are never used.
....
retq
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Usually I would let the compiler do it's magic of optimizing complicated logical expressions, however, in this case the compiler I have to use is not very good at this (basically all it can do is to replaced things like /64 with bit-shifts and %512 with bitwise-and).
Is there any tool available that can analyze and provide optimized versions of expressions, (i.e. the same way good optimizing compilers do)?
e.g. I would like to optimize the following:
int w = 2 - z/2;
int y0 = y + (((v % 512) / 64) / 4) * 8 + ((v / 512) / mb)*16;
int x0 = x + (((v % 512) / 64) % 4) * 8 * (w - 1) + ((v / 512) % mb)*8 * w;
int i = x0 * (w ^ 3) * 2 + y0 * mb * 16 * 2 + (2*z - 3) * (z/2);
Here's a test:
typedef int MyInt; // or unsigned int
MyInt get(MyInt x, MyInt y, MyInt z, MyInt v, MyInt mb)
{
MyInt w = 2 - z/2;
MyInt y0 = y + (((v % 512) / 64) / 4) * 8 + ((v / 512) / mb)*16;
MyInt x0 = x + (((v % 512) / 64) % 4) * 8 * (w - 1) + ((v / 512) % mb)*8 * w;
MyInt i = x0 * (w ^ 3) * 2 + y0 * mb * 16 * 2 + (2*z - 3) * (z/2);
return i;
}
I compiled with GCC 4.7.0 with -O3.
With int:
.LFB0:
movl %ecx, %eax
movq %r12, -24(%rsp)
.LCFI0:
movl %edx, %r12d
sarl $31, %eax
shrl $31, %r12d
movq %r13, -16(%rsp)
shrl $23, %eax
addl %edx, %r12d
movq %rbx, -40(%rsp)
leal (%rcx,%rax), %r9d
movl %r12d, %r11d
movq %r14, -8(%rsp)
sarl %r11d
movq %rbp, -32(%rsp)
.LCFI1:
movl %edx, %ebp
andl $511, %r9d
negl %r11d
subl %eax, %r9d
leal 511(%rcx), %eax
testl %ecx, %ecx
leal 2(%r11), %r13d
leal 63(%r9), %ebx
cmovns %ecx, %eax
sarl $9, %eax
movl %r13d, %r14d
xorl $3, %r14d
movl %eax, %edx
testl %r9d, %r9d
cmovns %r9d, %ebx
sarl $31, %edx
addl $1, %r11d
idivl %r8d
movl %ebx, %r10d
sarl $31, %ebx
shrl $30, %ebx
sarl $6, %r10d
addl %ebx, %r10d
andl $3, %r10d
subl %ebx, %r10d
movq -40(%rsp), %rbx
sall $3, %r10d
sall $3, %edx
imull %r11d, %r10d
imull %r13d, %edx
movq -16(%rsp), %r13
addl %edi, %r10d
addl %edx, %r10d
leal 255(%r9), %edx
imull %r10d, %r14d
testl %r9d, %r9d
cmovs %edx, %r9d
sall $4, %eax
sarl %r12d
sarl $8, %r9d
leal (%rsi,%r9,8), %ecx
addl %eax, %ecx
leal -3(%rbp,%rbp), %eax
movq -32(%rsp), %rbp
imull %r8d, %ecx
imull %r12d, %eax
movq -24(%rsp), %r12
sall $4, %ecx
addl %r14d, %ecx
movq -8(%rsp), %r14
leal (%rax,%rcx,2), %eax
ret
With unsigned int:
.LFB0:
movl %ecx, %eax
movq %rbp, -16(%rsp)
movl %edx, %r11d
.LCFI0:
movl %edx, %ebp
shrl $9, %eax
xorl %edx, %edx
divl %r8d
movq %r12, -8(%rsp)
.LCFI1:
movl %ecx, %r12d
shrl %r11d
andl $511, %r12d
movq %rbx, -24(%rsp)
.LCFI2:
movl $2, %r10d
movl %r12d, %r9d
movl $1, %ebx
subl %r11d, %r10d
shrl $6, %r9d
subl %r11d, %ebx
shrl $8, %r12d
andl $3, %r9d
sall $4, %r8d
imull %ebx, %r9d
leal (%r12,%rax,2), %eax
movq -24(%rsp), %rbx
imull %r10d, %edx
xorl $3, %r10d
movq -8(%rsp), %r12
leal (%rsi,%rax,8), %eax
addl %edx, %r9d
leal (%rdi,%r9,8), %edi
imull %eax, %r8d
leal -3(%rbp,%rbp), %eax
movq -16(%rsp), %rbp
imull %r10d, %edi
imull %r11d, %eax
addl %edi, %r8d
leal (%rax,%r8,2), %eax
ret
"Optimizing" further by folding constants manually has (predictably) no further effect.
When I want optimizations, I tend to check what Clang generates as LLVM IR. It's more readable (I find) than pure assembly.
int foo(int v, int mb, int x, int y, int z) {
int w = 2 - z/2;
// When you have specific constraints, tell the optimizer about it !
if (w < 0 || w > 2) { return 0; }
int y0 = y + (((v % 512) / 64) / 4) * 8 + ((v / 512) / mb)*16;
int x0 = x + (((v % 512) / 64) % 4) * 8 * (w - 1) + ((v / 512) % mb)*8 * w;
int i = x0 * (w ^ 3) * 2 + y0 * mb * 16 * 2 + (2*z - 3) * (z/2);
return i;
}
Is transformed into:
define i32 #foo(i32 %v, i32 %mb, i32 %x, i32 %y, i32 %z) nounwind uwtable readnone {
%1 = sdiv i32 %z, 2
%2 = sub nsw i32 2, %1
%3 = icmp slt i32 %2, 0
%4 = icmp slt i32 %z, -1
%or.cond = or i1 %3, %4
br i1 %or.cond, label %31, label %5
; <label>:5 ; preds = %0
%6 = srem i32 %v, 512
%7 = sdiv i32 %6, 64
%8 = sdiv i32 %6, 256
%9 = shl i32 %8, 3
%10 = sdiv i32 %v, 512
%11 = sdiv i32 %10, %mb
%12 = shl i32 %11, 4
%13 = add i32 %9, %y
%14 = add i32 %13, %12
%15 = srem i32 %7, 4
%16 = add nsw i32 %2, -1
%17 = mul i32 %16, %15
%18 = srem i32 %10, %mb
%19 = mul i32 %2, %18
%tmp = add i32 %19, %17
%tmp2 = shl i32 %tmp, 3
%20 = add nsw i32 %tmp2, %x
%21 = shl i32 %2, 1
%22 = xor i32 %21, 6
%23 = mul i32 %22, %20
%24 = shl i32 %mb, 5
%25 = mul i32 %24, %14
%26 = shl i32 %z, 1
%27 = add nsw i32 %26, -3
%28 = mul nsw i32 %1, %27
%29 = add i32 %25, %28
%30 = add i32 %29, %23
br label %31
; <label>:31 ; preds = %5, %0
%.0 = phi i32 [ %30, %5 ], [ 0, %0 ]
ret i32 %.0
}
I do not know whether it is optimal, but it certainly is relatively readable.
It would be great if you could indicate all your constraints on the input (all five of them if necessary) because the optimizer might be able to use them.