Optimization generate super long function definition - llvm

In my code, I generate the following function:
define i32 #gl.qi([500 x i32] %x, i32 %i) {
entry:
%x. = alloca [500 x i32]
%i. = alloca i32
%0 = alloca [500 x i32]
store [500 x i32] %x, [500 x i32]* %x.
store i32 %i, i32* %i.
%x.1 = load [500 x i32], [500 x i32]* %x.
%i.2 = load i32, i32* %i.
store [500 x i32] %x.1, [500 x i32]* %0
%1 = icmp slt i32 %i.2, 500
br i1 %1, label %in-bound, label %out-of-bound
out-of-bound: ; preds = %entry
call void #gen.panic(i8* getelementptr inbounds ([22 x i8], [22 x i8]* #pool.str.2, i32 0, i32 0))
unreachable
in-bound: ; preds = %entry
%2 = getelementptr inbounds [500 x i32], [500 x i32]* %0, i32 0, i32 %i.2
%idx = load i32, i32* %2
ret i32 %idx
}
the high level functionality is to use %i to index %x, and if %i is out of bound, a panic function is called instead.
consider the store line:
store [500 x i32] %x, [500 x i32]* %x.
once I pass this function to opt -O1 -S --verify --verify-each, it generates code like this:
define i32 #gl.qi([500 x i32] %x, i32 %i) local_unnamed_addr {
entry:
%0 = alloca [500 x i32], align 4
%x.fca.0.extract = extractvalue [500 x i32] %x, 0
%x.fca.1.extract = extractvalue [500 x i32] %x, 1
%x.fca.2.extract = extractvalue [500 x i32] %x, 2
%x.fca.3.extract = extractvalue [500 x i32] %x, 3
%x.fca.4.extract = extractvalue [500 x i32] %x, 4
%x.fca.5.extract = extractvalue [500 x i32] %x, 5
until 500. I put the number to 50000 and it won't stop.
This is puzzling. I am not sure why must a store command be expanded to a sequence of etractvalues then stores? Is there a way to turn off this particular optimization without turning off the whole optimization?
Or am I looking at the wrong way to do this simple task?

To store %x in %x you need to first load the values from %x (function call arguments) before storing them back in %x (local variable). That's what the extractvalue and subsequent store is for. Same thing for %i. So it's not an optimization. It's just the way a variable a gets assigned to another variable b

Related

Executing LLVM code results with Segmentation fault

I have the following code:
#.str_specifier = constant [4 x i8] c"%s\0A\00"
#.int_specifier = constant [4 x i8] c"%d\0A\00"
#.string_var1 = constant [2 x i8] c"f\00"
#.string_var2 = constant [6 x i8] c"Error\00"
; >>> Start Program
declare i32 #printf(i8*, ...)
declare void #exit(i32)
define void #print(i8*) {
call i32 (i8*, ...) #printf(i8* getelementptr ([4 x i8], [4 x i8]* #.str_specifier, i32 0, i32 0), i8* %0)
ret void
}
define void #printi(i32) {
call i32 (i8*, ...) #printf(i8* getelementptr ([4 x i8], [4 x i8]* #.int_specifier, i32 0, i32 0), i32 %0)
ret void
}
declare i8* #malloc(i32)
declare void #free(i8*)
declare void #llvm.memcpy.p0i8.p0i8.i32(i8*, i8*, i32, i1)
define void #main()
{ ; >>> Adding function scope
%funcArgs1 = alloca [50 x i32]
; >>> Adding function arguments allocation
; >>> Function body of main
call void #print(i8* getelementptr ([2 x i8], [2 x i8]* #.string_var1, i32 0, i32 0))
%register1 = call i8* #malloc(i32 48)
%register2 = bitcast i8* %register1 to i32*
%register3 = getelementptr inbounds [50 x i32], [50 x i32]* %funcArgs1, i32 0, i32 0
%register4 = ptrtoint i32* %register2 to i32
store i32 %register4, i32* %register3
%register5 = getelementptr inbounds i32, i32* %register2, i32 0
%register6 = add i32 0, 12
store i32 %register6, i32* %register5
%register7 = getelementptr inbounds i32, i32* %register2, i32 1
%register8 = add i32 0, 2
store i32 %register8, i32* %register7
%register9 = getelementptr inbounds i32, i32* %register2, i32 2
store i32 0, i32* %register9
%register10 = getelementptr inbounds i32, i32* %register2, i32 3
store i32 0, i32* %register10
%register11 = getelementptr inbounds i32, i32* %register2, i32 4
store i32 0, i32* %register11
%register12 = getelementptr inbounds i32, i32* %register2, i32 5
store i32 0, i32* %register12
%register13 = getelementptr inbounds i32, i32* %register2, i32 6
store i32 0, i32* %register13
%register14 = getelementptr inbounds i32, i32* %register2, i32 7
store i32 0, i32* %register14
%register15 = getelementptr inbounds i32, i32* %register2, i32 8
store i32 0, i32* %register15
%register16 = getelementptr inbounds i32, i32* %register2, i32 9
store i32 0, i32* %register16
%register17 = getelementptr inbounds i32, i32* %register2, i32 10
store i32 0, i32* %register17
%register18 = getelementptr inbounds i32, i32* %register2, i32 11
store i32 0, i32* %register18
%register19 = load i32, i32* %register3 ; Get variable x
%register20 = add i32 0, 2
%register21 = inttoptr i32 %register20 to i32*
%register22 = getelementptr inbounds i32, i32* %register21, i32 1
%register23 = load i32, i32* %register22
%register24 = getelementptr inbounds i32, i32* %register21, i32 0
%register25 = load i32, i32* %register24
%register26 = add i32 %register23, %register25
%register27 = sub i32 %register26, 4
%register28 = icmp sgt i32 %register20, %register27
br i1 %register28, label %label1, label %label_cont1
label_cont1:
br label %label2
label1:
call void #print(i8* getelementptr ([6 x i8], [6 x i8]* #.string_var2, i32 0, i32 0))
call void #exit(i32 1)
%register200 = add i32 0, 2
br label %label2
label2:
ret void
} ; >>> Closing function scope
For some reason when I run it, it fails with Segmentation fault (core dumped) without printing an understandable error. The strange thing is if I comment the commands in label1 and keep it:
;call void #print(i8* getelementptr ([6 x i8], [6 x i8]* #.string_var2, i32 0, i32 0))
;call void #exit(i32 1)
;%register200 = add i32 0, 2
br label %label2
It does not result with Segmentation fault. If I comment out at least one of those commands (for example print or the sum), it will fail. Why does it happen?
EDIT: I think I'm getting the same result here. (Here with comments)
I understand that "Segmentation fault" means that I tried to access memory that
I do not have access to. but why can't I even create an new register with some value?
EDIT2: It looks like br i1 %register28, label %label1, label %label_cont1 is the real reason.
Edit3: The actual full code I'm trying to figure can be found here. The problem is that changing it to alloca i32 will result with Error (instead of printing 1). It also contains the C code I'm trying to copy to LLVM.
The segfault originates from this line
%register21 = inttoptr i32 %register20 to i32*
After the cast, register21 supposedly points to some memory location. But what memory location ?? It's value is a non existent address that wasn't gotten through a an alloca instr or malloc call.
Therefore all the other registers that try to dereference this pointer get disappointed.
I've altered the inttptr line

LLVM IR wired behavior on memory alignment of struct?

I get wired behavior on memory alignment of struct(LLVM 10), it doesn't match my learning of memory alignment.
For below c++ code:
struct CC {
char c1 = 'a';
double d1 = 2.0;
int i1 = 12;
bool b1 = true;
int i2 = 13;
bool b2 = true;
} cc1;
int main() {
CC cc2;
}
And it will generate IR like:
%struct.CC = type <{ i8, [7 x i8], double, i32, i8, [3 x i8], i32, i8, [3 x i8] }>
#cc1 = global { i8, double, i32, i8, i32, i8 } { i8 97, double 2.000000e+00, i32 12, i8 1, i32 13, i8 1 }, align 8
define linkonce_odr void #_ZN2CCC2Ev(%struct.CC*) unnamed_addr #1 align 2 {
%2 = alloca %struct.CC*, align 8
store %struct.CC* %0, %struct.CC** %2, align 8
%3 = load %struct.CC*, %struct.CC** %2, align 8
%4 = getelementptr inbounds %struct.CC, %struct.CC* %3, i32 0, i32 0
store i8 97, i8* %4, align 8
%5 = getelementptr inbounds %struct.CC, %struct.CC* %3, i32 0, i32 2
store double 2.000000e+00, double* %5, align 8
%6 = getelementptr inbounds %struct.CC, %struct.CC* %3, i32 0, i32 3
store i32 12, i32* %6, align 8
%7 = getelementptr inbounds %struct.CC, %struct.CC* %3, i32 0, i32 4
store i8 1, i8* %7, align 4
%8 = getelementptr inbounds %struct.CC, %struct.CC* %3, i32 0, i32 6
store i32 13, i32* %8, align 8
%9 = getelementptr inbounds %struct.CC, %struct.CC* %3, i32 0, i32 7
store i8 1, i8* %9, align 4
ret void
}
My Questions:
Must %struct.CC have to add extra data type([7xi8], [3xi8]) for alignment? Is there another way to align struct type?
Why #cc1 doesn't use %struct.CC?
Why #cc1 doesn't add extra data type for alignment?
Why i1 aligns 8 not 4 while using store? What if aligns 4?
Why i2 aligns 8 not 4 while using store?
Too many questions, very very grateful if anyone can answer some of them.
The answer to almost all questions is the same: Platform C/C++ ABI. LLVM (or rather clang frontend) does the necessary thing to do the struct layout as prescribed by the ABI. In order to do so it could add necessary padding as struct members should have proper alignments.

LLVM IR: <N x i1> vector instructions ordering causing different results

I have been using LLVM as a backend for my compiler (I'm not using the LLVM libraries but my own to generate the necessary IR for numerous reasons). At the moment, I am implementing vector operations. Comparisons of vectors emit vectors of booleans <N x i1> and these are causing me problems.
To access vector elements, I have been using extractelement and insertelement however, I am getting some weird behaviour when I execute these instructions in a different orders. The code examples below have the same instructions and should be logically the same. Version 1 outputs BAA while Version 2 outputs BAB. Version 2 is the logically correct version but I cannot figure out why version 1 outputs the wrong version but has the exact same instructions, just in a different order.
EDIT: A workaround to solve this problem is to pass the IR to opt -mem2reg and then compile the bitcode. This is okay but not useful for debugging and thus does not solve my problem.
Version 1
; Version 1 - Generated by my naïve SSA generator
; Outputs: BAA (incorrect)
declare i32 #putchar(i32)
define void #main() {
entry:
%0 = alloca <8 x i1>, align 8 ; v
store <8 x i1> zeroinitializer, <8 x i1>* %0
%1 = alloca <8 x i1>, align 8
store <8 x i1> zeroinitializer, <8 x i1>* %1
%2 = load <8 x i1>, <8 x i1>* %1, align 8
%3 = insertelement <8 x i1> %2, i1 true, i64 0
%4 = insertelement <8 x i1> %3, i1 false, i64 1
%5 = insertelement <8 x i1> %4, i1 true, i64 2
%6 = insertelement <8 x i1> %5, i1 false, i64 3
%7 = insertelement <8 x i1> %6, i1 true, i64 4
%8 = insertelement <8 x i1> %7, i1 false, i64 5
%9 = insertelement <8 x i1> %8, i1 true, i64 6
%10 = insertelement <8 x i1> %9, i1 false, i64 7
store <8 x i1> %10, <8 x i1>* %0
%11 = load <8 x i1>, <8 x i1>* %0, align 8
%12 = extractelement <8 x i1> %11, i64 0
%13 = zext i1 %12 to i32
%14 = add i32 %13, 65 ; + 'A'
%15 = call i32 #putchar(i32 %14)
%16 = load <8 x i1>, <8 x i1>* %0, align 8
%17 = extractelement <8 x i1> %16, i64 1
%18 = zext i1 %17 to i32
%19 = add i32 %18, 65 ; + 'A'
%20 = call i32 #putchar(i32 %19)
%21 = load <8 x i1>, <8 x i1>* %0, align 8
%22 = extractelement <8 x i1> %21, i64 2
%23 = zext i1 %22 to i32
%24 = add i32 %23, 65 ; + 'A'
%25 = call i32 #putchar(i32 %24)
%26 = call i32 #putchar(i32 10) ; \n
ret void
}
Version 2
; Version 2 - Manually modified version of Version 1
; Outputs: BAB (correct)
declare i32 #putchar(i32)
define void #main() {
entry:
%0 = alloca <8 x i1>, align 8 ; v
store <8 x i1> zeroinitializer, <8 x i1>* %0
%1 = alloca <8 x i1>, align 8
store <8 x i1> zeroinitializer, <8 x i1>* %1
%2 = load <8 x i1>, <8 x i1>* %1, align 8
%3 = insertelement <8 x i1> %2, i1 true, i64 0
%4 = insertelement <8 x i1> %3, i1 false, i64 1
%5 = insertelement <8 x i1> %4, i1 true, i64 2
%6 = insertelement <8 x i1> %5, i1 false, i64 3
%7 = insertelement <8 x i1> %6, i1 true, i64 4
%8 = insertelement <8 x i1> %7, i1 false, i64 5
%9 = insertelement <8 x i1> %8, i1 true, i64 6
%10 = insertelement <8 x i1> %9, i1 false, i64 7
store <8 x i1> %10, <8 x i1>* %0
%11 = load <8 x i1>, <8 x i1>* %0, align 8
%12 = load <8 x i1>, <8 x i1>* %0, align 8
%13 = load <8 x i1>, <8 x i1>* %0, align 8
%14 = extractelement <8 x i1> %11, i64 0
%15 = extractelement <8 x i1> %12, i64 1
%16 = extractelement <8 x i1> %13, i64 2
%17 = zext i1 %14 to i32
%18 = zext i1 %15 to i32
%19 = zext i1 %16 to i32
%20 = add i32 %17, 65 ; + 'A'
%21 = add i32 %18, 65 ; + 'A'
%22 = add i32 %19, 65 ; + 'A'
%23 = call i32 #putchar(i32 %20)
%24 = call i32 #putchar(i32 %21)
%25 = call i32 #putchar(i32 %22)
%26 = call i32 #putchar(i32 10) ; \n
ret void
}

Conversion of vector of bool's to integer in llvm ir

I am writing a llvm-ir code which involves vector operations. I did a integer vector comparison with 'icmp' instruction which resulted in a vector of bools say <8 x i1>, my problem is I want to convert this 8 bits to its corresponding integer value with out traversing the vector(extracting elements from vector), I tried 'bitcast <8 x i1> to i8' which seems converting first bit of the vector to i8, correct me if am wrong. Can someone suggest me a way to do this.
define i8 #main() #0 {
entry:
%A = alloca [8 x i32], align 16
%B = alloca [8 x i32], align 16
%arrayidx = getelementptr inbounds [8 x i32], [8 x i32]* %A, i64 0, i64 0
store i32 90, i32* %arrayidx, align 4
%arrayidx1 = getelementptr inbounds [8 x i32], [8 x i32]* %A, i64 0, i64 1
store i32 91, i32* %arrayidx1, align 4
%arrayidx2 = getelementptr inbounds [8 x i32], [8 x i32]* %A, i64 0, i64 2
store i32 92, i32* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds [8 x i32], [8 x i32]* %A, i64 0, i64 3
store i32 93, i32* %arrayidx3, align 4
%arrayidx4 = getelementptr inbounds [8 x i32], [8 x i32]* %B, i64 0, i64 0
store i32 90, i32* %arrayidx4, align 4
%arrayidx5 = getelementptr inbounds [8 x i32], [8 x i32]* %B, i64 0, i64 1
store i32 1, i32* %arrayidx5, align 4
%arrayidx6 = getelementptr inbounds [8 x i32], [8 x i32]* %B, i64 0, i64 2
store i32 92, i32* %arrayidx6, align 8
%arrayidx7 = getelementptr inbounds [8 x i32], [8 x i32]* %B, i64 0, i64 3
store i32 93, i32* %arrayidx7, align 4
br label %vector.body
vector.body:
%0 = bitcast [8 x i32]* %A to <8 x i32>*
%1 = bitcast [8 x i32]* %B to <8 x i32>*
%2 = load <8 x i32>, <8 x i32>* %0
%3 = load <8 x i32>, <8 x i32>* %1
%4 = icmp eq <8 x i32> %2, %3
%5 = bitcast <8 x i1> %4 to i8
ret i8 %5;
}
am using 'lli' for running this code with out any flags. Output is expected to be 11 but am getting 1 or 0
Thank you so much in advance.
As far as I inderstand, you can't do that without calling a platform specific intrinsic. I noticed that by being unable to write target independant code in c++.
For example, the code below:
typedef int v8i __attribute__((vector_size(32)));
int main() {
v8i a = { 1, 2, 3, 4, 5, 6, 7, 8};
v8i b = { 0, 2, 3, 4, 5, 6, 7, 0};
v8i cmp = (a == b);
char res = *(char*)&cmp;
printf("%d\n", res);
return 0;
}
produces llvm-IR which is quite close from what you wrote (with the appropriate bitcast).
Unfortunately it didn't work as expected.
That's because <8 x i1> doesn't exist on the processor. For example, in x86 AVX2, _mm256_cmpeq_epi32 yields a __m256i.
Bitcasting that to a char will just take the first 8 bits of that register.
I wrote instead intel AVX2 specific code, and found the appropriate instruction : intel intrinsic guide
So this code does what you need:
#include <cstdio>
#include <cstdlib>
#include <immintrin.h>
int main() {
__m256i a = _mm256_set_epi32(8, 7, 6, 5, 4, 3, 2, 1);
__m256i b = _mm256_set_epi32(0, 7, 6, 5, 4, 3, 2, 0);
__m256i eq = _mm256_cmpeq_epi32(a, b);
int res = _mm256_movemask_ps(_mm256_castsi256_ps(eq));
printf("res = %d\n", res);
for(int i = 0; i < 8; ++i) {
printf("%d %d -> %d\n", _mm256_extract_epi32(a, i), _mm256_extract_epi32(b, i), !!((res << i) & 0x80));
}
return 0;
}
In terms of ll code, it turns out you need a few additional bitcast (to float), and a call to the intrinsic
#llvm.x86.avx.movmsk.ps.256
rewriting by hand the llvm-IR code leads to :
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"
#formatString = private constant [4 x i8] c"%d\0A\00"
define i32 #main() #0 {
%a = alloca <8 x i32>, align 32
%b = alloca <8 x i32>, align 32
store <8 x i32> <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>, <8 x i32>* %a, align 32
store <8 x i32> <i32 0, i32 2, i32 3, i32 0, i32 5, i32 0, i32 7, i32 0>, <8 x i32>* %b, align 32
%1 = load <8 x i32>, <8 x i32>* %a, align 32
%2 = load <8 x i32>, <8 x i32>* %b, align 32
%3 = icmp eq <8 x i32> %1, %2
%4 = sext <8 x i1> %3 to <8 x i32>
%5 = bitcast <8 x i32> %4 to <8 x float>
%res = call i32 #llvm.x86.avx.movmsk.ps.256(<8 x float> %5)
%6 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* #formatString, i32 0, i32 0), i32 %res)
ret i32 0
}
declare i32 #llvm.x86.avx.movmsk.ps.256(<8 x float>) #1
declare i32 #printf(i8*, ...) #2
attributes #0 = { norecurse uwtable "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="haswell" "target-features"="+aes,+avx,+avx2,+bmi,+bmi2,+cmov,+cx16,+f16c,+fma,+fsgsbase,+fxsr,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+rdrnd,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+xsave,+xsaveopt,-adx,-avx512bw,-avx512cd,-avx512dq,-avx512er,-avx512f,-avx512pf,-avx512vl,-fma4,-hle,-pku,-prfchw,-rdseed,-rtm,-sha,-sse4a,-tbm,-xop,-xsavec,-xsaves" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="haswell" "target-features"="+aes,+avx,+avx2,+bmi,+bmi2,+cmov,+cx16,+f16c,+fma,+fsgsbase,+fxsr,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+rdrnd,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+xsave,+xsaveopt,-adx,-avx512bw,-avx512cd,-avx512dq,-avx512er,-avx512f,-avx512pf,-avx512vl,-fma4,-hle,-pku,-prfchw,-rdseed,-rtm,-sha,-sse4a,-tbm,-xop,-xsavec,-xsaves" "unsafe-fp-math"="false" "use-soft-float"="false" }
The generated assembly (by llc) looks quite optimal:
vmovaps .LCPI0_0(%rip), %ymm0 # ymm0 = [1,2,3,4,5,6,7,8]
vmovaps %ymm0, 32(%rsp)
vmovdqa .LCPI0_1(%rip), %ymm0 # ymm0 = [0,2,3,0,5,0,7,0]
vmovdqa %ymm0, (%rsp)
vpcmpeqd 32(%rsp), %ymm0, %ymm0
vmovmskps %ymm0, %esi
I found this way working.
define i8 #main() #0 {
entry:
%0 = icmp eq <8 x i32> <i32 90,i32 91,i32 92,i32 93, i32 94,i32 95,i32 96,i32 97>, <i32 90,i32 91,i32 92,i32 93, i32 94,i32 95,i32 96,i32 97>
%1 = bitcast <8 x i1> %0 to <1 x i8>
%2 = extractelement <1 x i8> %1, i32 0
ret i8 %2
}
This is similar code as I posted in the question, I checked the result with "echo $?" am getting the result as expected.

Create local string using LLVM

I'm trying to create a local variable using LLVM to store strings, but my code is currently throwing a syntax error.
lli: test2.ll:8:23: error: constant expression type mismatch
%1 = load [6 x i8]* c"hello\00"
My IR code that allocates and store the string:
#.string = private constant [4 x i8] c"%s\0A\00"
define void #main() {
entry:
%a = alloca [255 x i8]
%0 = bitcast [255 x i8]* %a to i8*
%1 = load [6 x i8]* c"hello\00"
%2 = bitcast [6 x i8]* %1 to i8*
%3 = tail call i8* #strncpy(i8* %0, i8* %2, i64 255) nounwind
%4 = getelementptr inbounds [6 x i8]* %a, i32 0, i32 0
%5 = call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([4 x i8]* #.string, i32 0, i32 0), i8* %4)
ret void
}
declare i32 #printf(i8*, ...)
declare i8* #strncpy(i8*, i8* nocapture, i64) nounwind
Using llc I could see that the way llvm implements is allocating and assigning to a global variable, but I want it to be local (inside a basic block). The code below works, but I don't want to create this var "#.str"...
#str = global [1024 x i8] zeroinitializer, align 16
#.str = private unnamed_addr constant [6 x i8] c"hello\00", align 1
#.string = private constant [4 x i8] c"%s\0A\00"
define i32 #main() nounwind uwtable {
%1 = tail call i8* #strncpy(i8* getelementptr inbounds ([1024 x i8]* #str, i64 0, i64 0), i8* getelementptr inbounds ([6 x i8]* #.str, i64 0, i64 0), i64 1024) nounwind
%2 = call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([4 x i8]* #.string, i32 0, i32 0), i8* %1)
ret i32 0
}
declare i8* #strncpy(i8*, i8* nocapture, i64) nounwind
declare i32 #printf(i8*, ...) #2
Thanks
I figured out by myself after messing more with my previous code.
Below is the code, so people who had the same problem as I had can check
#.string = private constant [4 x i8] c"%s\0A\00"
define void #main() {
entry:
%a = alloca [6 x i8]
store [6 x i8] [i8 104,i8 101,i8 108,i8 108, i8 111, i8 0], [6 x i8]* %a
%0 = bitcast [6 x i8]* %a to i8*
%1 = call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([4 x i8]* #.string, i32 0, i32 0), i8* %0)
ret void
}
declare i32 #printf(i8*, ...)
Basically, I had to store each of the characters individually in the array and then bitcast to i8* so I could use the printf function. I couldn't use the c" ... " method which is the one shown in LLVM webpage http://llvm.org/docs/LangRef.html#id669 . It seems it is a special case in the language specification of the IR and they required to be in the global scope.
UPDATE: I was working on the same code again and I found out that the best way was to store a constant instead of each of the i8 symbols. So the line 6, will be replaced by:
store [6 x i8] c"hello\00", [6 x i8]* %0
It is easier to generate code using llvm and it's more readable!