LLVM IR: efficiently summing a vector - llvm

I'm writing a compiler that's generating LLVM IR instructions. I'm working extensively with vectors.
I would like to be able to sum all the elements in a vector. Right now I'm just extracting each element individually and adding them up manually, but it strikes me that this is precisely the sort of thing that the hardware should be able to help with (as it sounds like a pretty common operation). But there doesn't seem to be an intrinsic to do it.
What's the best way to do this? I'm using LLVM 3.2.

First of all, even without using intrinsics, you can generate log(n) vector additions (with n being vector length) instead of n scalar additions, here's an example with vector size 8:
define i32 #sum(<8 x i32> %a) {
%v1 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%v2 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
%sum1 = add <4 x i32> %v1, %v2
%v3 = shufflevector <4 x i32> %sum1, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
%v4 = shufflevector <4 x i32> %sum1, <4 x i32> undef, <2 x i32> <i32 2, i32 3>
%sum2 = add <2 x i32> %v3, %v4
%v5 = extractelement <2 x i32> %sum2, i32 0
%v6 = extractelement <2 x i32> %sum2, i32 1
%sum3 = add i32 %v5, %v6
ret i32 %sum3
}
If your target has support for these vector additions then it seems highly likely the above will be lowered to use those instructions, giving you performance.
Regarding intrinsics, there are no target-independent intrinsics to handle this. If you're compiling to x86, though, you do have access to the hadd instrinsics (e.g. llvm.x86.int_x86_ssse3_phadd_sw_128 to add two <4 x i32> vectors together). You'll still have to do something similar to the above, only the add instructions could be replaced.
For more information about this you can search for "horizontal sum" or "horizontal vector sum"; for instance, here are some relevant stackoverflow questions for a horizontal sum on x86:
horizontal sum of 8 packed 32bit floats
Fastest way to do horizontal vector sum with AVX instructions
Fastest way to do horizontal float vector sum on x86

Related

How to use CreateInBoundsGEP in cpp api of llvm to access the element of an array?

I am new to llvm programming, and I am trying to write cpp to generate llvm ir for a simple C code like this:
int a[10];
a[0] = 1;
I want to generate something like this to store 1 into a[0]
%3 = getelementptr inbounds [10 x i32], [10 x i32]* %2, i64 0, i64 0
store i32 1, i32* %3, align 16
And I tried CreateGEP: auto arrayPtr = builder.CreateInBoundsGEP(var, num); where var and
num are both of type llvm::Value*
but I only get
%1 = getelementptr inbounds [10 x i32], [10 x i32]* %0, i32 0
store i32 1, [10 x i32]* %1
I searched google for a long time and looked the llvm manual but still don't know what Cpp api to use and how to use it.
Really appreciate it if you can help!
Note that the 2nd argument to IRBuilder::CreateInBoundsGEP (1st overload) is actually ArrayRef<Value *>, which means it accepts an array of Value * values (including C-style array, std::vector<Value *> and std::array<Value *, LEN> and others).
To generate a GEP instruction with multiple (child) addresses, pass an array of Value * to the second argument:
Value *i32zero = ConstantInt::get(contexet, APInt(32, 0));
Value *indices[2] = {i32zero, i32zero};
builder.CreateInBoundsGEP(var, ArrayRef<Value *>(indices, 2));
Which will yield
%1 = getelementptr inbounds [10 x i32], [10 x i32]* %0, i32 0, i32 0
You can correctly identify that %1 is of type i32*, pointing to the first item in the array pointed to by %0.
LLVM documentation on GEP instruction: https://llvm.org/docs/GetElementPtr.html

LLVM IR: initialize and cast [20 x i8]

I am trying to initialize and then cast a number of LLVM IR variables in the following way:
store i64 %content, i64* %5
%tt2 = load i64, i64* %5
%ttt2 = trunc i64 %tt2 to i32
While this seems trivial and works fine, I am trapped to do the same thing for a [20 * i8] typed variable. Something like:
store [20 x i8] %content, [20 x i8]* %5
%tt2 = load [20 x i8], [20 x i8]* %5
%ttt2 = trunc [20 x i8] %tt2 to i32
Currently I got the following error msg for the third line:
invalid cast opcode for cast from [20 x i8] to i32
Could anyone shed some lights on this issue? Thanks!
You can trunc from one int to another, but not from an array to an int. That's just how trunc is defined — if the input isn't an int, then trunc would need to do something markedly different from "drop the higher-order bits and preserve the lower-order bits".
I think the most common approach is to cast the pointer and then load/store from a pointer that already matches the type you want to load/store.
(Note that %ttt2 etc. aren't LLVM variables, they're LLVM values. They don't vary, ever.)

LLVM: How to create char array reference

I am trying to implement an opt pass that will compress string constants in ROM and then, at the cost of CPU + RAM, re-materialize the values at runtime. Before implementing compression, I just want to place all strings in a table, and do a lookup.
Example:
printf("Hello");
Would become the equivalent of
char placeholder[6];
int strID = 0;
tableLookup(placeholder, 0 /*ID*/); // Fill array
printf(placeholder);
The LLVM IR I was able to generate looks like the following:
%fakeString = alloca [10 x i8], align 1
call void #llvm.dbg.declare(metadata [10 x i8]* %fakeString, metadata !60, metadata !25), !dbg !64
%arraydecay = getelementptr inbounds [10 x i8], [10 x i8]* %fakeString, i32 0, i32 0, !dbg !65
call void #tableLookup(i8* %arraydecay, i32 0) #2, !dbg !66
How would I be able to create this programatically? The two main pieces I am missing:
1. How to create the array reference (after creating the alloca instruction)
2. How to get result from tableLookup and replace the operand in the old printf()
Any help would be much appreciated!

LLVM pass to count vector type instructions

I am trying to write an LLVM pass that counts instructions of vector type.
for instructions like :
%24 = or <2 x i64> %21, %23
%25 = bitcast <16 x i8> %12 to <8 x i16>
%26 = shl <8 x i16> %25, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
%27 = bitcast <8 x i16> %26 to <2 x i64>
I wrote this code:
for (auto &F : M) {
for (auto &B : F) {
for (auto &I : B) {
if (auto* VI = dyn_cast<InsertElementInst>(&I)) {
Value* op = VI->getOperand(0);
if (op->getType()->isVectorTy()){
++vcount;
}
}
But for some reason if (auto* VI = dyn_cast<InsertElementInst>(&I)) is never satisfied.
Any idea why?
Thanks in advance.
InsertElementInst is one specific instruction (that inserts an element into a vector) - and there is none in your list of instructiokns.
You probably want to dyn_cast to a regular use the Instruction in I as it is.
[I personally would use a one of the function or module pass classes as a base, so you only need to implement the inner loops of your code, but that's more of a "it's how you're supposed to do things", not something you HAVE to do to make it work].
In LLVM, the instruction is the same as it's result. so for an example
%25 = bitcast <16 x i8> %12 to <8 x i16>
when you cast Instruction I to value you get %25
Value* psVal = cast<Value>(&I);
and then you can check if it is of vector type or not by getType()->isVectorTy().
Also i suggest you look at inheritance diagram of llvm Value for more clarification
here http://llvm.org/docs/doxygen/html/classllvm_1_1Value.html

What does an overlong bitshift on a LLVM vector yield?

The LLVM documentation for 'shl' says that
<result> = shl i32 1, 32
is an undefined value because it's shifting by greater than or equal to the number of bits in an i32. However, it's not clear to me what happens with
<result> = shl <2 x i32> < i32 1, i32 1>, < i32 1, i32 32>
Is only the second element of the result undefined (result=<2 x i32> < i32 2, i32 undef>), or is the result as a whole undefined (result=<2 x i32> undef)?