The LLVM documentation for 'shl' says that
<result> = shl i32 1, 32
is an undefined value because it's shifting by greater than or equal to the number of bits in an i32. However, it's not clear to me what happens with
<result> = shl <2 x i32> < i32 1, i32 1>, < i32 1, i32 32>
Is only the second element of the result undefined (result=<2 x i32> < i32 2, i32 undef>), or is the result as a whole undefined (result=<2 x i32> undef)?
Related
I am new to llvm programming, and I am trying to write cpp to generate llvm ir for a simple C code like this:
int a[10];
a[0] = 1;
I want to generate something like this to store 1 into a[0]
%3 = getelementptr inbounds [10 x i32], [10 x i32]* %2, i64 0, i64 0
store i32 1, i32* %3, align 16
And I tried CreateGEP: auto arrayPtr = builder.CreateInBoundsGEP(var, num); where var and
num are both of type llvm::Value*
but I only get
%1 = getelementptr inbounds [10 x i32], [10 x i32]* %0, i32 0
store i32 1, [10 x i32]* %1
I searched google for a long time and looked the llvm manual but still don't know what Cpp api to use and how to use it.
Really appreciate it if you can help!
Note that the 2nd argument to IRBuilder::CreateInBoundsGEP (1st overload) is actually ArrayRef<Value *>, which means it accepts an array of Value * values (including C-style array, std::vector<Value *> and std::array<Value *, LEN> and others).
To generate a GEP instruction with multiple (child) addresses, pass an array of Value * to the second argument:
Value *i32zero = ConstantInt::get(contexet, APInt(32, 0));
Value *indices[2] = {i32zero, i32zero};
builder.CreateInBoundsGEP(var, ArrayRef<Value *>(indices, 2));
Which will yield
%1 = getelementptr inbounds [10 x i32], [10 x i32]* %0, i32 0, i32 0
You can correctly identify that %1 is of type i32*, pointing to the first item in the array pointed to by %0.
LLVM documentation on GEP instruction: https://llvm.org/docs/GetElementPtr.html
I am trying to write an LLVM pass that counts instructions of vector type.
for instructions like :
%24 = or <2 x i64> %21, %23
%25 = bitcast <16 x i8> %12 to <8 x i16>
%26 = shl <8 x i16> %25, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
%27 = bitcast <8 x i16> %26 to <2 x i64>
I wrote this code:
for (auto &F : M) {
for (auto &B : F) {
for (auto &I : B) {
if (auto* VI = dyn_cast<InsertElementInst>(&I)) {
Value* op = VI->getOperand(0);
if (op->getType()->isVectorTy()){
++vcount;
}
}
But for some reason if (auto* VI = dyn_cast<InsertElementInst>(&I)) is never satisfied.
Any idea why?
Thanks in advance.
InsertElementInst is one specific instruction (that inserts an element into a vector) - and there is none in your list of instructiokns.
You probably want to dyn_cast to a regular use the Instruction in I as it is.
[I personally would use a one of the function or module pass classes as a base, so you only need to implement the inner loops of your code, but that's more of a "it's how you're supposed to do things", not something you HAVE to do to make it work].
In LLVM, the instruction is the same as it's result. so for an example
%25 = bitcast <16 x i8> %12 to <8 x i16>
when you cast Instruction I to value you get %25
Value* psVal = cast<Value>(&I);
and then you can check if it is of vector type or not by getType()->isVectorTy().
Also i suggest you look at inheritance diagram of llvm Value for more clarification
here http://llvm.org/docs/doxygen/html/classllvm_1_1Value.html
Experiencing with C++, I tried to understand the performance difference between sizeof and strlen for string literal.
Here my small benchmark code:
#include <iostream>
#include <cstring>
#define LOOP_COUNT 1000000000
unsigned long long rdtscl(void)
{
unsigned int lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
int main()
{
unsigned long long before = rdtscl();
size_t ret;
for (int i = 0; i < LOOP_COUNT; i++)
ret = strlen("abcd");
unsigned long long after = rdtscl();
std::cout << "Strlen " << (after - before) << " ret=" << ret << std::endl;
before = rdtscl();
for (int i = 0; i < LOOP_COUNT; i++)
ret = sizeof("abcd");
after = rdtscl();
std::cout << "Sizeof " << (after - before) << " ret=" << ret << std::endl;
}
Compiling with clang++, I get the following result:
clang++ -O3 -Wall -o sizeof_vs_strlen sizeof_vs_strlen.cpp
./sizeof_vs_strlen
Strlen 36 ret=4
Sizeof 62092396 ret=5
With g++:
g++ -O3 -Wall -o sizeof_vs_strlen sizeof_vs_strlen.cpp
./sizeof_vs_strlen
Strlen 30 ret=4
Sizeof 30 ret=5
I strongly suspect that g++ does optimize the loop with sizeof and clang++ don't.
Is this result a known issue?
EDIT:
The assembly generated by clang++ for the loop with sizeof:
rdtsc
mov %edx,%r14d
shl $0x20,%r14
mov $0x3b9aca01,%ecx
xchg %ax,%ax
add $0xffffffed,%ecx // 0x400ad0
jne 0x400ad0 <main+192>
mov %eax,%eax
or %rax,%r14
rdtsc
And the one by g++:
rdtsc
mov %edx,%esi
mov %eax,%ecx
rdtsc
I don't understand why clang++ do the {add, jne} loop, it seems useless. Is it a bug?
For information:
g++ (GCC) 5.1.0
clang version 3.6.2 (tags/RELEASE_362/final)
EDIT2:
It it likely to be a bug in clang.
I opened a bug report.
I'd call that a bug in clang.
It's actually optimising the sizeof itself, just not the loop.
To make the code much clearer, I change the std::cout to printf, and then you get the following LLVM-IR code for main:
; Function Attrs: nounwind uwtable
define i32 #main() #0 {
entry:
%0 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
%asmresult1.i = extractvalue { i32, i32 } %0, 1
%conv2.i = zext i32 %asmresult1.i to i64
%shl.i = shl nuw i64 %conv2.i, 32
%asmresult.i = extractvalue { i32, i32 } %0, 0
%conv.i = zext i32 %asmresult.i to i64
%or.i = or i64 %shl.i, %conv.i
%1 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
%asmresult.i.25 = extractvalue { i32, i32 } %1, 0
%asmresult1.i.26 = extractvalue { i32, i32 } %1, 1
%conv.i.27 = zext i32 %asmresult.i.25 to i64
%conv2.i.28 = zext i32 %asmresult1.i.26 to i64
%shl.i.29 = shl nuw i64 %conv2.i.28, 32
%or.i.30 = or i64 %shl.i.29, %conv.i.27
%sub = sub i64 %or.i.30, %or.i
%call2 = tail call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([21 x i8], [21 x i8]* #.str, i64 0, i64 0), i64 %sub, i64 4)
%2 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
%asmresult1.i.32 = extractvalue { i32, i32 } %2, 1
%conv2.i.34 = zext i32 %asmresult1.i.32 to i64
%shl.i.35 = shl nuw i64 %conv2.i.34, 32
br label %for.cond.5
for.cond.5: ; preds = %for.cond.5, %entry
%i4.0 = phi i32 [ 0, %entry ], [ %inc10.18, %for.cond.5 ]
%inc10.18 = add nsw i32 %i4.0, 19
%exitcond.18 = icmp eq i32 %inc10.18, 1000000001
br i1 %exitcond.18, label %for.cond.cleanup.7, label %for.cond.5
for.cond.cleanup.7: ; preds = %for.cond.5
%asmresult.i.31 = extractvalue { i32, i32 } %2, 0
%conv.i.33 = zext i32 %asmresult.i.31 to i64
%or.i.36 = or i64 %shl.i.35, %conv.i.33
%3 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
%asmresult.i.37 = extractvalue { i32, i32 } %3, 0
%asmresult1.i.38 = extractvalue { i32, i32 } %3, 1
%conv.i.39 = zext i32 %asmresult.i.37 to i64
%conv2.i.40 = zext i32 %asmresult1.i.38 to i64
%shl.i.41 = shl nuw i64 %conv2.i.40, 32
%or.i.42 = or i64 %shl.i.41, %conv.i.39
%sub13 = sub i64 %or.i.42, %or.i.36
%call14 = tail call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([21 x i8], [21 x i8]* #.str, i64 0, i64 0), i64 %sub13, i64 5)
ret i32 0
}
As you can see, the call to printf is using the constant 5 from sizeof, and the for.cond.5: starts the empty loop:
a "phi" node (which selects the "new" value of i based on where we came from - before loop -> 0, inside loop -> %inc10.18)
an increment
a conditional branch that jumps back if %inc10.18 is not 100000001.
I don't know enough about clang and LLVM to explain why that empty loop isn't optimised away. But it's certainly not the sizeof that is taking time, as there is no sizeof inside the loop.
It's worth noting that sizeof is ALWAYS a constant at compile-time, it NEVER "takes time" beyond loading a constant value into a register.
The difference is, sizeof() is not a function call. The value, "returned" by sizeof() is known at compile-time.
At the same moment, strlen is function call (executed at run-time, apparently) and is not optimized at all. It seeks '\0' in a string, and it doesn't even have a single clue if it is a dynamically allocated string or a string literal.
So, sizeof is expected to be always faster for string literals.
I am not an expert, but your results might be explained by scheduling algorithms or overflow in your time variables.
I am writing llvm code using C++. I have a place in my code where the below scenario happens
1. %117 = phi <2 x double>* [ %105, %aligned ], [ %159, %116 ]
7. %123 = getelementptr <2 x double>* %117, i32 0
8. %127 = getelementptr <2 x double>* %123, i32 0
9. %128 = load <2 x double>* %127
10. %129 = getelementptr <2 x double>* %123, i32 1
11. %130 = load <2 x double>* %129
12. %131 = shufflevector <2 x double> %128, <2 x double> %130, <2 x i32> <i32 1, i32 3>
I am trying to compute the same address which should point to same data type twice in lines 7 and 8 with the address parameter value different. Is it safe to do this or will this lead to undefined results?
The instruction
%x = getelementptr %anytype* %y, i32 0
Is completely meaningless; it's as if you've written (the illegal):
%x = %y
So yes, both %123 and %127 will point to the same memory. It's safe, but redundant: you can just use %117 directly wherever %123 or %127 are used. The only problematic thing in your snippet is that the value numbering is not sequential, but I assume that's just from pasting just parts of the code here.
I'm writing a compiler that's generating LLVM IR instructions. I'm working extensively with vectors.
I would like to be able to sum all the elements in a vector. Right now I'm just extracting each element individually and adding them up manually, but it strikes me that this is precisely the sort of thing that the hardware should be able to help with (as it sounds like a pretty common operation). But there doesn't seem to be an intrinsic to do it.
What's the best way to do this? I'm using LLVM 3.2.
First of all, even without using intrinsics, you can generate log(n) vector additions (with n being vector length) instead of n scalar additions, here's an example with vector size 8:
define i32 #sum(<8 x i32> %a) {
%v1 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%v2 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
%sum1 = add <4 x i32> %v1, %v2
%v3 = shufflevector <4 x i32> %sum1, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
%v4 = shufflevector <4 x i32> %sum1, <4 x i32> undef, <2 x i32> <i32 2, i32 3>
%sum2 = add <2 x i32> %v3, %v4
%v5 = extractelement <2 x i32> %sum2, i32 0
%v6 = extractelement <2 x i32> %sum2, i32 1
%sum3 = add i32 %v5, %v6
ret i32 %sum3
}
If your target has support for these vector additions then it seems highly likely the above will be lowered to use those instructions, giving you performance.
Regarding intrinsics, there are no target-independent intrinsics to handle this. If you're compiling to x86, though, you do have access to the hadd instrinsics (e.g. llvm.x86.int_x86_ssse3_phadd_sw_128 to add two <4 x i32> vectors together). You'll still have to do something similar to the above, only the add instructions could be replaced.
For more information about this you can search for "horizontal sum" or "horizontal vector sum"; for instance, here are some relevant stackoverflow questions for a horizontal sum on x86:
horizontal sum of 8 packed 32bit floats
Fastest way to do horizontal vector sum with AVX instructions
Fastest way to do horizontal float vector sum on x86