LLVM AMDGPU alloca address spaces - llvm

I've been trying to get .NET CIL to run on multiple platforms (I'm particularly interested in GPUs) by means of LLVM. I've used Mono.Compiler for translating CIL to LLVM.
I'm having trouble getting AMDGCN to work. For a simple add function, I'm getting the following translated IR:
; ModuleID = 'bitout.bc'
source_filename = "llvmmodule_1"
define i32 #llvmmodule_1_AddMethod(i32, i32) {
entry:
%A0 = alloca i32
store i32 %0, i32* %A0
%A1 = alloca i32
store i32 %1, i32* %A1
%T0 = load i32, i32* %A0
%T1 = load i32, i32* %A1
%T2 = add i32 %T0, %T1
ret i32 %T2
}
I've tried emitting it directly through libLLVM's TargetMachineEmitTo{File, MemoryBuffer} as well as indirectly, via llc.
Emitting directly results in a SIGSEGV:
Thread 1 "mono" received signal SIGSEGV, Segmentation fault.
0x00007fffeddadb1a in llvm::AMDGPUInstPrinter::getRegisterName(unsigned int) () from /usr/lib/libLLVM.so
This seems to happen due to a (negative) buffer overflow in the above function (as far as I could tell from gdb).
llc fails on both amdgcn and r600 with:
Allocation instruction pointer not in the stack address space!
%A0 = alloca i32
Allocation instruction pointer not in the stack address space!
%A1 = alloca i32
Otherwise, llc compiles fine for all other platforms (except wasm64).
After some digging, I've been wondering whether this could be from not specifying the address space in alloca (though in the LLVM Guide for AMDGPU there's nothing really explained about this); so I got a copy of the translated IR and changed the address space. Turned out that llc compiles it if allocations are in the Private address space - which I guess works as the stack space. But I'm finding it weird though that neither the Global nor Region address spaces aren't working - Shouldn't I be able to allocate space in Global memory? What am I missing here?
On the same note, I can't find a way to create an alloca instruction that takes an address space (BuildAlloca doesn't take an address space as an argument and I couldn't find any documentation or examples that mention alternatives).
If it matters, I'm using the default libLLVM on ArchLinux (at this time, llvm 8.0.1).

For AMDGPU Clang uses the following code to get the alloca:
LangAS getASTAllocaAddressSpace() const override {
return getLangASFromTargetAS(
getABIInfo().getDataLayout().getAllocaAddrSpace());
}
Ignore getLangASFromTargetAS here is just a way to be able to make it work with the LangAS enum.
The takeaway is that you need to get the address space from the DataLayout instead of setting it to zero – but only for AMDGPU.

Related

Insert constant value into code segment with a relocation

I'm in a process of making a native compiled language using LLVM as backend.
For a couple of special features I need to be able to do two things via LLVM API:
Provide custom relocations into both data and code segments to LLVM
Ability to insert a constant value (specifically arrays, but it doesn't really matter) into code segment, in specific places in between functions and create a relocation of it for LLVM-defined objects (assume functions, but it doesn't matter).
It looks like if I insert non-zero initialized global variables, they are going in the segment in the order of their declaration in LLVM IR module, I would like to to the same in the code segment, but it is read-only at the runtime, so let it be constants as in rdata segment.
For example:
#myConst1 = const [2 x i32 (i32)*] {MyProc1, MyProc2} // how do I put this into code segment before first instruction of MyProc1?
define i32 #MyProc1() !dbg !524 {
ret i32 5
}
#myConst2 = const [16 x i8] zeroinitializer // ideally would like to do this, and create relocations manually into this array for two pointers to both MyProc1 and MyProc2
define i32 #MyProc2(i32 %0) !dbg !524 {
ret i32 %0
}
Is this even possible to do with LLVM and it's API?
If yes, I need help to understand how, as after reading a ton of documentation I'm unable to figure out how.
Thank you.

How LLVM mem2reg pass works

mem2reg is an important optimization pass in llvm. I want to understand how this optimization works but didn't find good articles, books, tutorials and similar.
I found these two links:
https://blog.katastros.com/a?ID=01300-3d6589c1-1993-4fb1-8975-939f10c20503
https://www.zzzconsulting.se/2018/07/16/llvm-exercise.html
Both links explains that one can use Cytron's classical SSA algorithm to implement this pass, but reading the original paper I didn't see how alloca instructions are converted to registers.
As alloca is an instruction specific to llvm IR, I wonder if the algorithm that converts alloca instructions to registers is an ad-hoc algorithm that only works for llvm. Or if there is a general theory framework that I just don't know the name yet that explains how to promote memory variables to registers variables.
On the official documentation, it says:
mem2reg: This file promotes memory references to be register references. It promotes alloca instructions which only have loads and stores as uses. An alloca is transformed by using dominator frontiers to place phi nodes, then traversing the function in depth-first order to rewrite loads and stores as appropriate. This is just the standard SSA construction algorithm to construct “pruned” SSA form.
By this description, it seems that one just need to check if all users of the variable are load and store instructions and if so, it can be promoted to a register.
So if you can link me to articles, books, algorithms and so on I would appreciate.
"Efficiently Computing Static Single Assignment Form and the Control Dependence Graph" by Cytron, Ferrante et al.
The alloca instruction is named after the alloca() function in C, the rarely used counterpart to malloc() that allocates memory on the stack instead of the heap, so that the memory disappears when the function returns without needing to be explicitly freed.
In the paper, figure 2 shows an example of "straight-line code":
V ← 4
← V + 5
V ← 6
← V + 7
That text isn't valid LLVM IR, if we wanted to rewrite the same example in pre-mem2reg LLVM IR, it would look like this:
; V ← 4
store i32 4, i32* %V.addr
; ← V + 5
%tmp1 = load i32* %V.addr
%tmp2 = add i32 %tmp1, 5
store i32 %tmp2, i32* %V.addr
; V ← 6
store i32 6, i32* %V.addr
; ← V + 7
%tmp3 = load i32* %V.addr
%tmp4 = add i32 %tmp3, i32 7
store i32 %tmp4, i32* %V.addr
It's easy enough to see in this example how you could always replace %tmp1 with i32 4 using store-to-load forwarding, but you can't always remove the final store. Knowing nothing about %V.addr means that we must assume it may be used elsewhere.
If you know that %V.addr is an alloca, you can simplify a lot of things. You can see the alloca instruction so you know you can see all the uses, you never have to worry that a store to %ptr may alias with %V.addr. You see the allocation so you know what its alignment is. You know pointer accesses can not fault anywhere in the function, there is no free() equivalent for an alloca that anyone could call. If you see a store that isn't loaded before the end of the function, you can eliminate that store as a dead store since you know the alloca's lifetime ends at the function return. Finally you may delete the alloca itself if you've removed all its uses.
Thus if you start with an alloca whose only users are loads and stores, the problem has little in common with most memory optimization problems: there's no possibility of pointer aliasing nor concern about faulting memory accesses. What you do need to do is place the ϕ functions (phi instructions in LLVM) in the right places given the control flow graph, and that's what the paper describes how to do.

How to use llvm elementwise atomic intrinsics?

LLVM has elementwise atomic intrinsics see here. However when I try to use them, with IR like the following:
call void #llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 4, i32 1)
then when I try to link, I get the error:
undefined reference to `__llvm_memcpy_element_unordered_atomic_1'
is there a special library I need to link to or something?
Also, this happens with code generated using the IRBuilder::CreateElementUnorderedAtomicMemCpy method.

How can I find the size of a type?

I'm holding a Type* in my hand. How do I find out its size (the size objects of this type will occupy in memory) in bits / bytes? I see all kinds of methods allowing me to get "primitive" or "scalar" size, but that won't help me with aggregate types...
If you only need the size because you are inserting it into the IR (e.g., so you can send it to a call to malloc()), you can use the getelementptr instruction to do the dirty work (with a little casting), as described here (with updating for modern LLVM):
Though LLVM does not contain a special purpose sizeof/offsetof instruction, the
getelementptr instruction can be used to evaluate these values. The basic idea
is to use getelementptr from the null pointer to compute the value as desired.
Because getelementptr produces the value as a pointer, the result is casted to
an integer before use.
For example, to get the size of some type, %T, we would use something like
this:
%Size = getelementptr %T* null, i32 1
%SizeI = ptrtoint %T* %Size to i32
This code is effectively pretending that there is an array of T elements,
starting at the null pointer. This gets a pointer to the 2nd T element
(element #1) in the array and treats it as an integer. This computes the
size of one T element.
The good thing about doing this is that it is useful in exactly the cases where you do not care what the value is; where you just need to pass the correct value from the IR to something. That's by far the most common case for my need for sizeof()-alike operations in the IR generation.
The page also goes on to describe how to do an offsetof() equivalent:
To get the offset of some field in a structure, a similar trick is used. For
example, to get the address of the 2nd element (element #1) of { i8, i32* }
(which depends on the target alignment requirement for pointers), something
like this should be used:
%Offset = getelementptr {i8,i32*}* null, i32 0, i32 1
%OffsetI = ptrtoint i32** %Offset to i32
This works the same way as the sizeof trick: we pretend there is an instance of
the type at the null pointer and get the address of the field we are interested
in. This address is the offset of the field.
Note that in both of these cases, the expression will be evaluated to a
constant at code generation time, so there is no runtime overhead to using this
technique.
The IR optimizer also converts the values to constants.
The size depends on the target (for several reasons, alignment being one of them).
In LLVM versions 3.2 and above, you need to use DataLayout, in particular its getTypeAllocSize method. This returns the size in bytes, there's also a bit version named getTypeAllocSizeInBits. A DataLayout instance can be obtained by creating it from the current module: DataLayout* TD = new DataLayout(M).
With LLVM up to version 3.1 (including), use TargetData instead of DataLayout. It exposes the same getTypeAllocSize methods, though.

How to see lowered c++

I'm trying to improve my understanding of how C++ actually works. Is there a way to see how the compiler lowers my code into something simpler? For example, I'd like to see how all the copy constructors are called, how overloaded function calls have been resolved, all the template expansion and instantiation complete, etc. Right now I'm learning about how C++ compilers interpret my code through experimentation, but it'd be nice just to see a lowered form of my code, even if it is very ugly. I'm looking for something analogous to g++ -E, which shows the result of the preprocessor, but for C++.
Edit: I should have added that I'm not looking for a disassembler. There's a huge gulf between C++ source code and assembled code. Inside this gulf are complicated things like template meta-programming and all sorts of implicit calls to operator methods (assignments! casts! constructors! ...) as well as heavily overloaded functions with very complicated resolution rules, etc. I'm looking for tools to help me understand how my code is interpreted by the C++ compiler. Right now, the only thing I can do is try little experiments and piecemeal put together an understanding of what the compiler is doing. I'd like to see more detail on what's going on. It would help greatly, for example, in debugging template metaprogramming problems.
At the moment, I think that your best bet is Clang (you can try some simple code on the Try Out LLVM page).
When compiling C, C++ or Obj-C with Clang/LLVM, you may ask the compiler to emit the Intermediate Representation (LLVM IR) instead of going the full way to assembly/binary form.
The LLVM IR is a full specified language used internally by the compiler:
CLang lowers the C++ code to LLVM IR
LLVM optimizes the IR
A LLVM Backend (for example x86) produces the assembly from the IR
The IR is the last step before machine-specific code, so you don't have to learn specific assembly directives and you still get a very low-level representation of what's really going on under the hood.
You can get the IR both before and after optimizations, the latter being more representative of real code, but further away from what you origially wrote.
Example with a C program:
#include <stdio.h>
#include <stdlib.h>
static int factorial(int X) {
if (X == 0) return 1;
return X*factorial(X-1);
}
int main(int argc, char **argv) {
printf("%d\n", factorial(atoi(argv[1])));
}
Corresponding IR:
; ModuleID = '/tmp/webcompile/_10956_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"
#.str = private unnamed_addr constant [4 x i8] c"%d\0A\00"
define i32 #main(i32 %argc, i8** nocapture %argv) nounwind {
; <label>:0
%1 = getelementptr inbounds i8** %argv, i64 1
%2 = load i8** %1, align 8, !tbaa !0
%3 = tail call i64 #strtol(i8* nocapture %2, i8** null, i32 10) nounwind
%4 = trunc i64 %3 to i32
%5 = icmp eq i32 %4, 0
br i1 %5, label %factorial.exit, label %tailrecurse.i
tailrecurse.i: ; preds = %tailrecurse.i, %0
%indvar.i = phi i32 [ %indvar.next.i, %tailrecurse.i ], [ 0, %0 ]
%accumulator.tr1.i = phi i32 [ %6, %tailrecurse.i ], [ 1, %0 ]
%X.tr2.i = sub i32 %4, %indvar.i
%6 = mul nsw i32 %X.tr2.i, %accumulator.tr1.i
%indvar.next.i = add i32 %indvar.i, 1
%exitcond = icmp eq i32 %indvar.next.i, %4
br i1 %exitcond, label %factorial.exit, label %tailrecurse.i
factorial.exit: ; preds = %tailrecurse.i, %0
%accumulator.tr.lcssa.i = phi i32 [ 1, %0 ], [ %6, %tailrecurse.i ]
%7 = tail call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([4 x i8]* #.str, i64 0, i64 0), i32 %accumulator.tr.lcssa.i) nounwind
ret i32 0
}
declare i32 #printf(i8* nocapture, ...) nounwind
declare i64 #strtol(i8*, i8** nocapture, i32) nounwind
!0 = metadata !{metadata !"any pointer", metadata !1}
!1 = metadata !{metadata !"omnipotent char", metadata !2}
!2 = metadata !{metadata !"Simple C/C++ TBAA", null}
I personally find it relatively readable (it tries to preserve the variable names, somewhat, the function names are still there) once you get past the original discovery of the language.
The first C++ compiler was cfront, which was, as the name implies, a front-end for C; in theory, cfront's output is what you'd like to see. But cfront hasn't been available for many years; it was a commercial product, and the source is not available.
Modern C++ compilers don't use a C intermediary; if there's an intermediary at all, it's an internal compiler representation, not something you'd enjoy looking at! The -S option to g++ will spit out *.s files: assembly code, which includes just enough symbols that you could, in theory, follow it.
The very first (circa 1989) C++ compilers compiled C++ into C. But that's not been true for a very long time, very long time meaning I know of no widely available compiler than did things that way for the last 15 years. The best you are going to do is look at the assembly language output, which requires some amount of knowledge and analysis to understand.
The assembly level output of a C++ compiler is generally not called 'lowered'. It's called 'compiled'. I can understand how you might come by that terminology. Assembly is a lower-level language. But that's not the terminology everybody else uses and it will confuse people if you use it.
Most popular C++ compilers have an option somewhere that allows you to see the assembly level output. The Open Source g++ compiler has the -S option that does this. It will create a file that ends in .s. You can look through this file to see the resulting assembly language.
In order for the assembly language to more directly correspond to the C++ code I would recommend compiling with the -O0 option to turn off optimization. The results of optimization can result in assembly code that bears little or no obvious resemblance to the original C++ code. Though viewing that code can help you understand what the optimizer is doing.
Another problem is that the symbols (the names for functions and classes and things) in the assembly output will be what is called 'mangled'. This is because most assembly languages do not allow :: as part of the symbol name, and because C++ can also have the same names for different kinds of symbols. The compiler transforms the names of things in your C++ code into different names that will be valid in the assembly code.
For g++ this mangling can be undone with the c++filt program.
c++filt <myprogram.s >myprogram_demangled.s
This will help make the assembly file more readable.
First step you can preprocess it (It is the first step that the compiler actually do before compiling)
with cpp or g++ -E
Second step is to parse and translate it
with g++ -S
this link about the compilation process might interest you
You can run g++ (or any gcc front-end) with one or more of the -fdump-tree- flags (complete list), which will dump intermediate representations of the code from different compiler passes in an output format that looks similar to C. However, this output is usually quite hard to read, with lots of compiler-generated temporary variables and other artifacts of compilation. It's mainly intended for debugging the compiler itself, but for simple examples you might be able to infer what gcc is doing with your C++ code by studying the intermediate representation.
The Comeau C++ compiler generates C code. But you'll have to pay for it.
Instead of doing experimentation, you can use a debugger and see the flow of your code. This way you can easily see which constructors or overloaded functions actual mapping is happening.