LLVM generates unefficient IR - c++

I playing with LLVM and tried to compile simple C++ code using it
#include <stdio.h>
#include <stdlib.h>
int main()
{
int test = rand();
if (test % 2)
test += 522;
else
test *= 333;
printf("test %d\n", test);
}
Especially to test how LLVM treats code branches
Result I got is very strange, it gives valid result on execution, but looks unefficient
; Function Attrs: nounwind
define i32 #main() local_unnamed_addr #0 {
%1 = tail call i32 #rand() #3
%2 = and i32 %1, 1
%3 = icmp eq i32 %2, 0
%4 = add nsw i32 %1, 522
%5 = mul nsw i32 %1, 333
%6 = select i1 %3, i32 %5, i32 %4
%7 = tail call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([9 x i8], [9 x i8]* #.str, i64 0, i64 0), i32 %6)
ret i32 0
}
It looks like it executing both ways even if only one is needen
My question is: Should not LLVM in this case generate labels and why?
Thank you
P.S. I'm using http://ellcc.org/demo/index.cgi for this test

Branches can be expensive, so generating code without branches at the cost of one unnecessary add or mul instruction, will usually work out to be faster in practice.
If you make the branches of your if longer, you'll see that it'll eventually become a proper branch instead of a select.
The compiler tends to have a good understanding of which option is faster in which case, so I'd trust it unless you have specific benchmarks that show the version with select to be slower than a version that branches.

Related

Clang sizeof("literal") optimization

Experiencing with C++, I tried to understand the performance difference between sizeof and strlen for string literal.
Here my small benchmark code:
#include <iostream>
#include <cstring>
#define LOOP_COUNT 1000000000
unsigned long long rdtscl(void)
{
unsigned int lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
int main()
{
unsigned long long before = rdtscl();
size_t ret;
for (int i = 0; i < LOOP_COUNT; i++)
ret = strlen("abcd");
unsigned long long after = rdtscl();
std::cout << "Strlen " << (after - before) << " ret=" << ret << std::endl;
before = rdtscl();
for (int i = 0; i < LOOP_COUNT; i++)
ret = sizeof("abcd");
after = rdtscl();
std::cout << "Sizeof " << (after - before) << " ret=" << ret << std::endl;
}
Compiling with clang++, I get the following result:
clang++ -O3 -Wall -o sizeof_vs_strlen sizeof_vs_strlen.cpp
./sizeof_vs_strlen
Strlen 36 ret=4
Sizeof 62092396 ret=5
With g++:
g++ -O3 -Wall -o sizeof_vs_strlen sizeof_vs_strlen.cpp
./sizeof_vs_strlen
Strlen 30 ret=4
Sizeof 30 ret=5
I strongly suspect that g++ does optimize the loop with sizeof and clang++ don't.
Is this result a known issue?
EDIT:
The assembly generated by clang++ for the loop with sizeof:
rdtsc
mov %edx,%r14d
shl $0x20,%r14
mov $0x3b9aca01,%ecx
xchg %ax,%ax
add $0xffffffed,%ecx // 0x400ad0
jne 0x400ad0 <main+192>
mov %eax,%eax
or %rax,%r14
rdtsc
And the one by g++:
rdtsc
mov %edx,%esi
mov %eax,%ecx
rdtsc
I don't understand why clang++ do the {add, jne} loop, it seems useless. Is it a bug?
For information:
g++ (GCC) 5.1.0
clang version 3.6.2 (tags/RELEASE_362/final)
EDIT2:
It it likely to be a bug in clang.
I opened a bug report.
I'd call that a bug in clang.
It's actually optimising the sizeof itself, just not the loop.
To make the code much clearer, I change the std::cout to printf, and then you get the following LLVM-IR code for main:
; Function Attrs: nounwind uwtable
define i32 #main() #0 {
entry:
%0 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
%asmresult1.i = extractvalue { i32, i32 } %0, 1
%conv2.i = zext i32 %asmresult1.i to i64
%shl.i = shl nuw i64 %conv2.i, 32
%asmresult.i = extractvalue { i32, i32 } %0, 0
%conv.i = zext i32 %asmresult.i to i64
%or.i = or i64 %shl.i, %conv.i
%1 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
%asmresult.i.25 = extractvalue { i32, i32 } %1, 0
%asmresult1.i.26 = extractvalue { i32, i32 } %1, 1
%conv.i.27 = zext i32 %asmresult.i.25 to i64
%conv2.i.28 = zext i32 %asmresult1.i.26 to i64
%shl.i.29 = shl nuw i64 %conv2.i.28, 32
%or.i.30 = or i64 %shl.i.29, %conv.i.27
%sub = sub i64 %or.i.30, %or.i
%call2 = tail call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([21 x i8], [21 x i8]* #.str, i64 0, i64 0), i64 %sub, i64 4)
%2 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
%asmresult1.i.32 = extractvalue { i32, i32 } %2, 1
%conv2.i.34 = zext i32 %asmresult1.i.32 to i64
%shl.i.35 = shl nuw i64 %conv2.i.34, 32
br label %for.cond.5
for.cond.5: ; preds = %for.cond.5, %entry
%i4.0 = phi i32 [ 0, %entry ], [ %inc10.18, %for.cond.5 ]
%inc10.18 = add nsw i32 %i4.0, 19
%exitcond.18 = icmp eq i32 %inc10.18, 1000000001
br i1 %exitcond.18, label %for.cond.cleanup.7, label %for.cond.5
for.cond.cleanup.7: ; preds = %for.cond.5
%asmresult.i.31 = extractvalue { i32, i32 } %2, 0
%conv.i.33 = zext i32 %asmresult.i.31 to i64
%or.i.36 = or i64 %shl.i.35, %conv.i.33
%3 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
%asmresult.i.37 = extractvalue { i32, i32 } %3, 0
%asmresult1.i.38 = extractvalue { i32, i32 } %3, 1
%conv.i.39 = zext i32 %asmresult.i.37 to i64
%conv2.i.40 = zext i32 %asmresult1.i.38 to i64
%shl.i.41 = shl nuw i64 %conv2.i.40, 32
%or.i.42 = or i64 %shl.i.41, %conv.i.39
%sub13 = sub i64 %or.i.42, %or.i.36
%call14 = tail call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([21 x i8], [21 x i8]* #.str, i64 0, i64 0), i64 %sub13, i64 5)
ret i32 0
}
As you can see, the call to printf is using the constant 5 from sizeof, and the for.cond.5: starts the empty loop:
a "phi" node (which selects the "new" value of i based on where we came from - before loop -> 0, inside loop -> %inc10.18)
an increment
a conditional branch that jumps back if %inc10.18 is not 100000001.
I don't know enough about clang and LLVM to explain why that empty loop isn't optimised away. But it's certainly not the sizeof that is taking time, as there is no sizeof inside the loop.
It's worth noting that sizeof is ALWAYS a constant at compile-time, it NEVER "takes time" beyond loading a constant value into a register.
The difference is, sizeof() is not a function call. The value, "returned" by sizeof() is known at compile-time.
At the same moment, strlen is function call (executed at run-time, apparently) and is not optimized at all. It seeks '\0' in a string, and it doesn't even have a single clue if it is a dynamically allocated string or a string literal.
So, sizeof is expected to be always faster for string literals.
I am not an expert, but your results might be explained by scheduling algorithms or overflow in your time variables.

How much space for a LLVM trampoline

I'm trying to figure out how to use the trampoline intrinsics in LLVM. The documentation makes mention of some amount of storage that's needed to store the trampoline in, which is platform dependent. My question is, how do I figure out how much is needed?
I found this example, that picks 32 bytes for apparently no reason. How does one choose a good value?
declare void #llvm.init.trampoline(i8*, i8*, i8*);
declare i8* #llvm.adjust.trampoline(i8*);
define i32 #foo(i32* nest %ptr, i32 %val)
{
%x = load i32* %ptr
%sum = add i32 %x, %val
ret i32 %sum
}
define i32 #main(i32, i8**)
{
%closure = alloca i32
store i32 13, i32* %closure
%closure_ptr = bitcast i32* %closure to i8*
%tramp_buf = alloca [32 x i8], align 4
%tramp_ptr = getelementptr [32 x i8]* %tramp_buf, i32 0, i32 0
call void #llvm.init.trampoline(
i8* %tramp_ptr,
i8* bitcast (i32 (i32*, i32)* #foo to i8*),
i8* %closure_ptr)
%ptr = call i8* #llvm.adjust.trampoline(i8* %tramp_ptr)
%fp = bitcast i8* %ptr to i32(i32)*
%val2 = call i32 %fp (i32 13)
; %val = call i32 #foo(i32* %closure, i32 42);
ret i32 %val2
}
Yes, trampolines are used to generate some code "on fly". It's unclear why do you need these intrinsics at all, because they are used to implement GCC's nested functions extension (in particular, when the address of the nested function is captured and the function access the stuff inside the enclosing function).
The best way to figure out the necessary size and alignment of trampoline buffer is to grep gcc sources for "TRAMPOLINE_SIZE" and "TRAMPOLINE_ALIGNMENT".
As far as I can see, at the time of this writing, the buffer of 72 bytes and alignment of 16 bytes will be enough for all the platforms gcc / LLVM supports.

How can you print instruction in llvm

From an llvm pass, I need to print an llvm instruction (Type llvm::Instruction) on the screen, just like as it appears in the llvm bitcode file. Actually my compilation is crashing, and does not reach the point where bitcode file is generated. So for debugging I want to print some instructions to know what is going wrong.
Assuming I is your instruction
I.print(errs());
By simply using the print method.
For a simple Hello World program, using C++'s range-based loops, you can do something like this:
for(auto& B: F){
for(auto& I: B){
errs() << I << "\n";
}
}
This gives the output:
%3 = alloca i32, align 4
%4 = alloca i8**, align 8
store i32 %0, i32* %3, align 4
store i8** %1, i8*** %4, align 8
%5 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([15 x i8], [15 x i8]* #.str, i64 0, i64 0))
ret i32 0

Does ampersand of reference data type return the address of the reference itself or the value it is referencing?

I wish I could have tested it myself, but unfortunately it's impossible to do so in the current situation I'm in. Could anyone care to shed some light on this?
It returns the address of the value it is referencing.
Unlike a pointer, a reference isn't actually an object. It doesn't have an address itself, it's easier to think of it as an alias to an object that the compiler will then interpret.
Anyway, you can always test code online with IdeOne
#include <stdio.h>
void func(int &z)
{
printf("&z: %p", &z);
}
int main(int argc, char **argv)
{
int x = 0;
int &y = x;
printf("&x: %p\n", &x);
printf("&y: %p\n", &y);
func(x);
return 0;
}
Executed online, with results.
Address of the value.
(There is no such thing as address of reference)
You can always check on codepad : http://codepad.org/I1aBfWAQ
A further experiment: there are online compilers that allow inspection of intermediate results!
Taking Mahmoud's example on the Try out LLVM and Clang page, we get the following LLVM IR (truncated).
define i32 #main() nounwind uwtable {
%x = alloca i32, align 4
store i32 0, i32* %x, align 4, !tbaa !0
%1 = call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([8 x i8]* #.str, i64 0, i64 0), i32* %x)
%2 = call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([8 x i8]* #.str1, i64 0, i64 0), i32* %x)
%3 = call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([7 x i8]* #.str2, i64 0, i64 0), i32* %x) nounwind
ret i32 0
}
... the syntax is not too important, what is important is to note that a single variable has been declared: %x (whose name is suspiciously similar to that of the C function).
This means that the compiler has elided the y and z variable, no storage is ever allocated for them.
I will spare you the assembly listing ;)

LLVM Code generation causing seg fault?

I am interested in language creation and compiler construction, and have been working through the example here: http://gnuu.org/2009/09/18/writing-your-own-toy-compiler/. The author was using LLVM 2.6, and after making a couple changes for LLVM 2.7, I got all the code generation code to compile. When feeding the complier the test code,
int do_math( int a ) {
int x = a * 5 + 3
}
do_math( 10 )
the program works correctly until it tries to run the code, at which point it segfaults. I am in the process of building LLDB on my system, but it the meantime, anyone see an obvious seg fault in this LLVM asm?
; ModuleID = 'main'
define internal void #main() {
entry:
%0 = call i64 #do_math(i64 10) ; <i64> [#uses=0]
ret void
}
define internal i64 #do_math(i64) {
entry:
%a = alloca i64 ; <i64*> [#uses=1]
%x = alloca i64 ; <i64*> [#uses=1]
%1 = add i64 5, 3 ; <i64> [#uses=1]
%2 = load i64* %a ; <i64> [#uses=1]
%3 = mul i64 %2, %1 ; <i64> [#uses=1]
store i64 %3, i64* %x
ret void
}
The output is just:
Segmentation fault
My arch is OS X x86_64.
Thanks.
I got same problem. I stripped down Loren's compiler and everything was working fine except execution.
Segmentation fault was caused by the fact that:
ExecutionEngine *ee = EngineBuilder(module).create();
returns NULL. To see the actual error, you need to get error string:
std::string error;
ExecutionEngine *ee = EngineBuilder(module).setErrorStr(&error).create();
In your case you should probably see:
"Unable to find target for this triple (no targets are registered)
To fix that you need to call
InitializeNativeTarget();
But if you get:
JIT has not been linked in.
You should include:
llvm/ExecutionEngine/MCJIT.h
which will link JIT engine.
The LLVM ASM you posted isn't a correct translation of the C code you presented. You're allocating %a as a stack variable, and then loading uninitialized data from it and using it. What you want to be doing is naming your argument %a and using that value. Try using this code instead:
define internal i64 #do_math(i64 %a) {
entry:
%x = alloca i64 ; <i64*> [#uses=1]
%1 = add i64 5, 3 ; <i64> [#uses=1]
%2 = mul i64 %a, %1 ; <i64> [#uses=1]
store i64 %2, i64* %x
ret void
}
Also, your main() prototype might not match what your C runtime library expects. And, beyond that, you do realize that you're not returning the result from do_math(), right?