I am interested in language creation and compiler construction, and have been working through the example here: http://gnuu.org/2009/09/18/writing-your-own-toy-compiler/. The author was using LLVM 2.6, and after making a couple changes for LLVM 2.7, I got all the code generation code to compile. When feeding the complier the test code,
int do_math( int a ) {
int x = a * 5 + 3
}
do_math( 10 )
the program works correctly until it tries to run the code, at which point it segfaults. I am in the process of building LLDB on my system, but it the meantime, anyone see an obvious seg fault in this LLVM asm?
; ModuleID = 'main'
define internal void #main() {
entry:
%0 = call i64 #do_math(i64 10) ; <i64> [#uses=0]
ret void
}
define internal i64 #do_math(i64) {
entry:
%a = alloca i64 ; <i64*> [#uses=1]
%x = alloca i64 ; <i64*> [#uses=1]
%1 = add i64 5, 3 ; <i64> [#uses=1]
%2 = load i64* %a ; <i64> [#uses=1]
%3 = mul i64 %2, %1 ; <i64> [#uses=1]
store i64 %3, i64* %x
ret void
}
The output is just:
Segmentation fault
My arch is OS X x86_64.
Thanks.
I got same problem. I stripped down Loren's compiler and everything was working fine except execution.
Segmentation fault was caused by the fact that:
ExecutionEngine *ee = EngineBuilder(module).create();
returns NULL. To see the actual error, you need to get error string:
std::string error;
ExecutionEngine *ee = EngineBuilder(module).setErrorStr(&error).create();
In your case you should probably see:
"Unable to find target for this triple (no targets are registered)
To fix that you need to call
InitializeNativeTarget();
But if you get:
JIT has not been linked in.
You should include:
llvm/ExecutionEngine/MCJIT.h
which will link JIT engine.
The LLVM ASM you posted isn't a correct translation of the C code you presented. You're allocating %a as a stack variable, and then loading uninitialized data from it and using it. What you want to be doing is naming your argument %a and using that value. Try using this code instead:
define internal i64 #do_math(i64 %a) {
entry:
%x = alloca i64 ; <i64*> [#uses=1]
%1 = add i64 5, 3 ; <i64> [#uses=1]
%2 = mul i64 %a, %1 ; <i64> [#uses=1]
store i64 %2, i64* %x
ret void
}
Also, your main() prototype might not match what your C runtime library expects. And, beyond that, you do realize that you're not returning the result from do_math(), right?
Related
So my C code is:
#include <stdio.h>
void main(){
int a, b,c, d;
b = 18, c = 112;
b = a - d;
d = a - d;
}
and part of its IR is:
%5 = load i32, i32* %1, align 4
%6 = load i32, i32* %4, align 4
%7 = sub nsw i32 %5, %6
store i32 %7, i32* %2, align 4
%8 = load i32, i32* %1, align 4
%9 = load i32, i32* %4, align 4
%10 = sub nsw i32 %8, %9
store i32 %10, i32* %4, align 4
I have implemented LVN algorithm to detect the redundant expression which is d = a - d. Now for optimization, I need to manipulate the instruction and make it d = b. I am not sure how to do it with llvm and how I can manipulate the IR.
I am new in llvm so it might be a silly question but I am really confused. Since, llvm works on IR, I understand that when it see "d = a - d" it will first load a and d, but the binary operation and store instruction in IR needs to be changed so that %4 gets the value from %2. Can anyone help me checking if I am understanding this correctly and how I can manipulate the IR to optimize the code.
First of all, let's replace your example program with one that does not invoke undefined behaviour (due to accessing uninitialized variables), so that the UB does not confuse the issue:
void f(int a, int b, int c, int d){
b = a - d;
d = a - d;
// Code that uses b and d
}
(I've also removed the two assignments as they didn't have any effect and will disappear after mem2reg anyway.)
Now to actually answer your question: Most optimizations run after the mem2reg pass, which converts memory accesses to registers where possible. This is important because, unlike memory locations, LLVM registers can only be assigned from a single point in the source, so mem2reg turns the code into SSA form, which is required for many optimizations to work.
If we apply mem2reg to the example code, we get:
define void #f(i32, i32, i32, i32) #0 {
%5 = sub nsw i32 %0, %3
%6 = sub nsw i32 %0, %3
; Code that uses b and d
}
So now we'd apply your analysis to find out that %6 is equivalent to %5. With that information we can remove the definition of %6 and replace all the occurrences of %6 with %5 (note that this would be more complicated if %5 and %6 were in the different basic blocks where one didn't dominate the other). To do that you can find all uses of %6 using the uses() method, which tells you which instructions have %6 as which operand. Then you can just set that operand to be a reference to %5 instead.
I playing with LLVM and tried to compile simple C++ code using it
#include <stdio.h>
#include <stdlib.h>
int main()
{
int test = rand();
if (test % 2)
test += 522;
else
test *= 333;
printf("test %d\n", test);
}
Especially to test how LLVM treats code branches
Result I got is very strange, it gives valid result on execution, but looks unefficient
; Function Attrs: nounwind
define i32 #main() local_unnamed_addr #0 {
%1 = tail call i32 #rand() #3
%2 = and i32 %1, 1
%3 = icmp eq i32 %2, 0
%4 = add nsw i32 %1, 522
%5 = mul nsw i32 %1, 333
%6 = select i1 %3, i32 %5, i32 %4
%7 = tail call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([9 x i8], [9 x i8]* #.str, i64 0, i64 0), i32 %6)
ret i32 0
}
It looks like it executing both ways even if only one is needen
My question is: Should not LLVM in this case generate labels and why?
Thank you
P.S. I'm using http://ellcc.org/demo/index.cgi for this test
Branches can be expensive, so generating code without branches at the cost of one unnecessary add or mul instruction, will usually work out to be faster in practice.
If you make the branches of your if longer, you'll see that it'll eventually become a proper branch instead of a select.
The compiler tends to have a good understanding of which option is faster in which case, so I'd trust it unless you have specific benchmarks that show the version with select to be slower than a version that branches.
Assume a simple partial evaluation scenario:
#include <vector>
/* may be known at runtime */
int someConstant();
/* can be partially evaluated */
double foo(std::vector<double> args) {
return args[someConstant()] * someConstant();
}
Let's say that someConstant() is known and does not change at runtime (e.g. given by the user once) and can be replaced by the corresponding int literal. If foo is part of the hot path, I expect a significant performance improvement:
/* partially evaluated, someConstant() == 2 */
double foo(std::vector<double> args) {
return args[2] * 2;
}
My current take on that problem would be to generate LLVM IR at runtime, because I know the structure of the partially evaluated code (so I would not need a general purpose partial evaluator).
So I want to write a function foo_ir that generates IR code that does the same thing as foo, but not calling someConstant(), because it is known at runtime.
Simple enough, isn't it? Yet, when I look at the generated IR for the code above:
; Function Attrs: uwtable
define double #_Z3fooSt6vectorIdSaIdEE(%"class.std::vector"* %args) #0 {
%1 = call i32 #_Z12someConstantv()
%2 = sext i32 %1 to i64
%3 = call double* #_ZNSt6vectorIdSaIdEEixEm(%"class.std::vector"* %args, i64 %2)
%4 = load double* %3
%5 = call i32 #_Z12someConstantv()
%6 = sitofp i32 %5 to double
%7 = fmul double %4, %6
ret double %7
}
; Function Attrs: nounwind uwtable
define linkonce_odr double* #_ZNSt6vectorIdSaIdEEixEm(%"class.std::vector"* %this, i64 %__n) #1 align 2 {
%1 = alloca %"class.std::vector"*, align 8
%2 = alloca i64, align 8
store %"class.std::vector"* %this, %"class.std::vector"** %1, align 8
store i64 %__n, i64* %2, align 8
%3 = load %"class.std::vector"** %1
%4 = bitcast %"class.std::vector"* %3 to %"struct.std::_Vector_base"*
%5 = getelementptr inbounds %"struct.std::_Vector_base"* %4, i32 0, i32 0
%6 = getelementptr inbounds %"struct.std::_Vector_base<double, std::allocator<double> >::_Vector_impl"* %5, i32 0, i32 0
%7 = load double** %6, align 8
%8 = load i64* %2, align 8
%9 = getelementptr inbounds double* %7, i64 %8
ret double* %9
}
I see, that the [] was included from the STL definition (function #_ZNSt6vectorIdSaIdEEixEm) - fair enough. The problem is: It could as well be some member function, or even a direct data access, I simply cannot assume the data layout to be the same everywhere, so at development-time, I do not know the concrete std::vector layout of the host machine.
Is there some way to use C++ metaprogramming to get the required information at compile time? i.e. is there some way to ask llvm to provide IR for std::vector's [] method?
As a bonus: I would prefer to not enforce the compilation of the library with clang, instead, LLVM shall be a runtime-dependency, so just invoking clang at compile time (even if I do not know how to do this) is a second-best solution.
Answering my own question:
While I still have no solution for the general case (e.g. std::map), there exists a simple solution for std::vector:
According to the C++ standard, the following holds for the member function data()
Returns a direct pointer to the memory array used internally by the
vector to store its owned elements.
Because elements in the vector are guaranteed to be stored in
contiguous storage locations in the same order as represented by the
vector, the pointer retrieved can be offset to access any element in
the array.
So in fact, the object-level layout of std::vector is fixed by the standard.
I'm trying to figure out how to use the trampoline intrinsics in LLVM. The documentation makes mention of some amount of storage that's needed to store the trampoline in, which is platform dependent. My question is, how do I figure out how much is needed?
I found this example, that picks 32 bytes for apparently no reason. How does one choose a good value?
declare void #llvm.init.trampoline(i8*, i8*, i8*);
declare i8* #llvm.adjust.trampoline(i8*);
define i32 #foo(i32* nest %ptr, i32 %val)
{
%x = load i32* %ptr
%sum = add i32 %x, %val
ret i32 %sum
}
define i32 #main(i32, i8**)
{
%closure = alloca i32
store i32 13, i32* %closure
%closure_ptr = bitcast i32* %closure to i8*
%tramp_buf = alloca [32 x i8], align 4
%tramp_ptr = getelementptr [32 x i8]* %tramp_buf, i32 0, i32 0
call void #llvm.init.trampoline(
i8* %tramp_ptr,
i8* bitcast (i32 (i32*, i32)* #foo to i8*),
i8* %closure_ptr)
%ptr = call i8* #llvm.adjust.trampoline(i8* %tramp_ptr)
%fp = bitcast i8* %ptr to i32(i32)*
%val2 = call i32 %fp (i32 13)
; %val = call i32 #foo(i32* %closure, i32 42);
ret i32 %val2
}
Yes, trampolines are used to generate some code "on fly". It's unclear why do you need these intrinsics at all, because they are used to implement GCC's nested functions extension (in particular, when the address of the nested function is captured and the function access the stuff inside the enclosing function).
The best way to figure out the necessary size and alignment of trampoline buffer is to grep gcc sources for "TRAMPOLINE_SIZE" and "TRAMPOLINE_ALIGNMENT".
As far as I can see, at the time of this writing, the buffer of 72 bytes and alignment of 16 bytes will be enough for all the platforms gcc / LLVM supports.
From an llvm pass, I need to print an llvm instruction (Type llvm::Instruction) on the screen, just like as it appears in the llvm bitcode file. Actually my compilation is crashing, and does not reach the point where bitcode file is generated. So for debugging I want to print some instructions to know what is going wrong.
Assuming I is your instruction
I.print(errs());
By simply using the print method.
For a simple Hello World program, using C++'s range-based loops, you can do something like this:
for(auto& B: F){
for(auto& I: B){
errs() << I << "\n";
}
}
This gives the output:
%3 = alloca i32, align 4
%4 = alloca i8**, align 8
store i32 %0, i32* %3, align 4
store i8** %1, i8*** %4, align 8
%5 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([15 x i8], [15 x i8]* #.str, i64 0, i64 0))
ret i32 0