Simplify expressions in LLVM SSA - llvm

There is a pass that breaks a constant GEP expression out of an instruction's operand into its own instruction, so that such nested GEP expressions become explicit and are thus easier to work with in subsequent passes.
Now I have a similar problem. This SSA Phi instruction (link):
while.cond: ; preds = %while.body, %entry
%n.0 = phi %struct.Node* [ bitcast ({ %struct.Node*, i32, [4 x i8] }* #n1 to %struct.Node*), %entry ], [ %13, %while.body ]
...
contains a bitcast instruction (link) as its "inlined" operand. Exists there a pass which allows me to break up the SSA of a given module into its most basic instructions, essentially "un-inlining" such nested expressions to make them explicit SSA instructions?

I don't know of any such pass.
However, it looks to me that modifying SAFECode's BreakConstantGEPs pass to do that should be very easy: just change the condition to be initially inserted into the worklist to be isa<PHINode> instead of an operand loop checking hasConstantGEP.

Related

Write or Read Instructions in LLVM

I just wanted to make sure I understand getOperand() right. It seems like getOperand() return operands in a reverse order:
so if I have:
%1 = mul nsw i32 7, 2 # The c source code is: a = 7; b = a*2
ret i32 %1 # The c source code is: return a;
Correct me if I'm wrong:
In the first instruction, getOperand(0) gives me 'i32' (what is being read) and getOpernad(1) 'nsw' (what is being written to).
In the second instruction, the only operand is i32 which is being read.
So I guess my question is, if the instruction is writing to something, is it the last operand?
The mul instruction is multiplication, so no, its operand do not correspond to those C expressions. You see this instruction instead of allocas and stores because Clang figured out your code is constant expression and propagated it. And AFAIK, there is nothing you can do to stop it - Clang performs constant propagation even with -O0.

Passing structs by-value in LLVM IR

I'm generating LLVM IR for JIT purposes, and I notice that LLVM's calling conventions don't seem to match the C calling conventions when aggregate values are involved. For instance, when I declare a function as taking a {i32, i32} (that is, a struct {int a, b;} in C terms) parameter, it appears to pass each of the struct elements in its own x86-64 GPR to the function, even though the x86-64 ABI specifies (sec. 3.2.3) that such a struct should be packed in a single 64-bit GPR.
This is in spite of LLVM's documentation claiming to match the C calling convention by default:
“ccc” - The C calling convention
This calling convention (the default if no other calling convention is specified) matches the target C calling conventions. This calling convention supports varargs function calls and tolerates some mismatch in the declared prototype and implemented declaration of the function (as does normal C).
My question, then, is: Am I doing something wrong to cause LLVM to not match the C calling convention, or is this known behavior? (At the very least, the documentation seems to be wrong, no?)
I can find only very few references to the issue at all on the web, such as this bug report from 2007, which claims to be fixed. It also claims that "First, LLVM has no way to deal with aggregates as singular Value*'s", which I don't know if it was true in 2007, but it doesn't seem to be true now, given the extractvalue/insertvalue instructions. I also found this SO question whose second (non-accepted) answer simply seems to accept implicitly that argument coercion has to be done manually.
I'm currently building code for doing argument coercion in my IR generator, but it is complicating my design considerably (not to mention making it architecture-specific), so if I'm simply doing something wrong, I'd rather know about that. :)
LLVM's support for C-language compatible calling convention is extremely limited I'm afraid. Several folks have wished for more direct calling convention support in LLVM (or a related library), but so far this has not emerged. That logic is currently encoded in the C-language frontend (Clang for example).
What LLVM provides is a mapping from specific LLVM IR types to specific C ABI lowerings for a specific CPU backend. You can see which IR types to use for a given C function by using Clang to emit LLVM IR, much as the comment above suggests:
https://c.compiler-explorer.com/z/8jWExWPYq
struct S { int x, y; };
void f(struct S s);
void test(int x, int y) {
struct S s = {x, y};
f(s);
}
Turns into:
define dso_local void #test(i32 noundef %0, i32 noundef %1) #0 {
%3 = alloca i32, align 4
%4 = alloca i32, align 4
%5 = alloca %struct.S, align 4
store i32 %0, ptr %3, align 4
store i32 %1, ptr %4, align 4
%6 = getelementptr inbounds %struct.S, ptr %5, i32 0, i32 0
%7 = load i32, ptr %3, align 4
store i32 %7, ptr %6, align 4
%8 = getelementptr inbounds %struct.S, ptr %5, i32 0, i32 1
%9 = load i32, ptr %4, align 4
store i32 %9, ptr %8, align 4
%10 = load i64, ptr %5, align 4
call void #f(i64 %10)
ret void
}
declare void #f(i64) #1
There is sadly some non-trivial logic to map specific C types into the LLVM IR that will match the ABI when lowered for a platform. Outside of extremely simple types (basic C integer types, pointers, float, double, maybe a few others), these aren't even portable between the different architecture ABIs/calling-conventions.
FWIW, the situation is even worse for C++ which has much more complexity here I'm afraid.
So your choices are to:
Use a very small set of types in a limited range of signatures that you build custom logic to lower correctly into LLVM IR, checking that it matches what Clang (or another C frontend) produces in every case.
Directly use Clang or another C frontend to emit the LLVM IR.
Take on the major project of extracting this ABI/calling-convention logic from Clang into a re-usable library. There has in the past been appetite for this in the LLVM/Clang communities, but it is a very large and complex undertaking from my understanding. There are some partial efforts (specifically for C and JITs) that you may be able to find and re-use, but I don't have a good memory of where all those are.

Compile-time constant into LLVM intrinsic

I have a compile-time constant and I need to pass it to an intrinsic through its arguments e.g.
#1 = private constant [4 x i8] c"dev\00", align 1
// intrinsic
define linkonce i32 #myIntrinsic( i32 %p0 ) alwaysinline {
%r0 = call i32 asm sideeffect " instr $0(add_constant_here);", "=r"(i32 %p0)
ret i32 %r0
}
Unfortunately I know that inline asm only deals with string literals, is there any other way that I can accomplish this?
Simple and easy: I wrote my own inline string.

What is the performance hit for the compiler if it initializes variables?

Sutter says this:
"In the low-level efficiency tradition of C and C++ alike, the
compiler is often not required to initialize variables unless you do
it explicitly (e.g., local variables, forgotten members omitted from
constructor initializer lists)"
I have always wondered why the compiler doesn't initialize primitives like int32 and float to 0. What is the performance hit if the compiler initializes it? It should be better than incorrect code.
This argument is incomplete, actually. Unitialized variables may have two reasons: efficiency and the lack of a suitable default.
1) Efficiency
It is, mostly, a left-over of the old days, when C compilers were simply C to assembly translators and performed no optimization whatsoever.
These days we have smart compilers and Dead Store Elimination which in most cases will eliminate redundant stores. Demo:
int foo(int a) {
int r = 0;
r = a + 3;
return r;
}
Is transformed into:
define i32 #foo(i32 %a) nounwind uwtable readnone {
%1 = add nsw i32 %a, 3
ret i32 %1
}
Still, there are cases where even the smarter compiler cannot eliminate the redundant store and this may have an impact. In the case of a large array that is later initialized piecemeal... the compiler may not realize that all values will end up being initialized and thus not remove the redundant writes:
int foo(int a) {
int* r = new int[10]();
for (unsigned i = 0; i <= a; ++i) {
r[i] = i;
}
return r[a % 2];
}
Note in the following the call to memset (that I required by suffixing the new call with () which is value initialization). It was not eliminated even though the 0 are unneeded.
define i32 #_Z3fooi(i32 %a) uwtable {
%1 = tail call noalias i8* #_Znam(i64 40)
%2 = bitcast i8* %1 to i32*
tail call void #llvm.memset.p0i8.i64(i8* %1, i8 0, i64 40, i32 4, i1 false)
br label %3
; <label>:3 ; preds = %3, %0
%i.01 = phi i32 [ 0, %0 ], [ %6, %3 ]
%4 = zext i32 %i.01 to i64
%5 = getelementptr inbounds i32* %2, i64 %4
store i32 %i.01, i32* %5, align 4, !tbaa !0
%6 = add i32 %i.01, 1
%7 = icmp ugt i32 %6, %a
br i1 %7, label %8, label %3
; <label>:8 ; preds = %3
%9 = srem i32 %a, 2
%10 = sext i32 %9 to i64
%11 = getelementptr inbounds i32* %2, i64 %10
%12 = load i32* %11, align 4, !tbaa !0
ret i32 %12
}
2) Default ?
The other issue is the lack of a suitable value. While a float could perfectly be initialized to NaN, what of integers ? There is no integer value that represents the absence of value, none at all! 0 is one candidate (among others), but one could argue it's one of the worst candidate: it's a very likely number, and thus likely has a specific meaning for the usecase at hand; are you sure that you are comfortable with this meaning being the default ?
Food for thought
Finally, there is one neat advantage of unitialized variables: they are detectable. The compiler may issue warnings (if it's smart enough), and Valgrind will raise errors. This make logical issues detectable, and only what is detected can be corrected.
Of course a sentinel value, such as NaN, would be as useful. Unfortunately... there is none for integers.
There are two ways in which initialisation might impact performance.
First, initialising a variable takes time. Granted, for a single variable it's probably negligible, but as others have suggested, it can add up with large numbers of variables, arrays, etc.
Second, who's to say that zero is a reasonable default? For every variable for which zero is a useful default, there's probably another one for which it isn't. In that case, if you do initialise to zero you then incur further overhead re-initialising the variable to whatever value you actually want. You essentially pay the initialisation overhead twice, rather than once if the default initialisation does not occur. Note that this is true no matter what you choose as a default value, zero or otherwise.
Given the overhead exists, it's typically more efficient to not initialise and to let the compiler catch any references to uninitialised variables.
Basically, a variable references a place in memory which can be modified to hold data. For an unitialized variable, all that the program needs to know is where this place is, and the compiler usually figures this out ahead of time, so no instructions are required. But when you want it to be initialized (to say, 0), the program needs to use an extra instruction to do so.
One idea might be to zero out the entire heap while the program is starting, with memset, then initialize all of the static stuff, but this isn't needed for anything that is set dynamically before it's read. This would also be a problem for stack-based functions which would need to zero out their stack frame every time a function is called. In short, it's much more efficient to allow variables to default to undefined, particularly when the stack is frequently being overwritten with newly called functions.
Compile with -Wmaybe-uninitialized and find out. Those are the only places the compiler would not be able to optmize out the primitive initialization.
As for the heap ...

Emmiting llvm bytecode from clang: 'byval' attribute for passing objects with nontrivial destructor into a function

I have a source C++ code which I parse using clang, producing llvm bytecode. From this point I want to process the file myself...
However I encoudered a problem. Consider the following scenario:
- I create a class with a nontrivial destructor or copy constructor.
- I define a function, where an object of this class is passed as a parameter, by value (no reference or pointer).
In the produced bytecode, I get a pointer instead. For classes without the destructor, the parameter is annotated as 'byval', but it is not so in this case.
As a result, I cannot distinguish if the parameter is passed by value, or really by a pointer.
Consider the following example:
Input file - cpass.cpp:
class C {
public:
int x;
~C() {}
};
void set(C val, int x) {val.x=x;};
void set(C *ptr, int x) {ptr->x=x;}
Compilation command line:
clang++ -c cpass.cpp -emit-llvm -o cpass.bc; llvm-dis cpass.bc
Produced output file (cpass.ll):
; ModuleID = 'cpass.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"
%class.C = type { i32 }
define void #_Z3set1Ci(%class.C* %val, i32 %x) nounwind {
%1 = alloca i32, align 4
store i32 %x, i32* %1, align 4
%2 = load i32* %1, align 4
%3 = getelementptr inbounds %class.C* %val, i32 0, i32 0
store i32 %2, i32* %3, align 4
ret void
}
define void #_Z3setP1Ci(%class.C* %ptr, i32 %x) nounwind {
%1 = alloca %class.C*, align 8
%2 = alloca i32, align 4
store %class.C* %ptr, %class.C** %1, align 8
store i32 %x, i32* %2, align 4
%3 = load i32* %2, align 4
%4 = load %class.C** %1, align 8
%5 = getelementptr inbounds %class.C* %4, i32 0, i32 0
store i32 %3, i32* %5, align 4
ret void
}
As you can see, the parameters of both set functions look exactly the same. So how can I tell that the first function was meant to take the parameter by value, instead of a pointer?
One solution could be to somehow parse the mangled function name, but it may not be always viable. What if somebody puts extern "C" before the function?
Is there a way to tell clang to keep the byval annotation, or to produce an extra annotation for each function parameter passed by a value?
Anton Korobeynikov suggests that I should dig into clang's LLVM IR emission. Unfortunately I know almost nothing about clang internals, the documentation is rather sparse. The Internals Manual of clang does not talk about IR emission. So I don't really know how to start, where to go to get the problem solved, hopefully without actually going through all of clang source code. Any pointers? Hints? Further reading?
In response to Anton Korobeynikov:
I know more-or-less how C++ ABI looks like with respect of parameter passing. Found some good reading here: http://agner.org./optimize/calling_conventions.pdf. But this is very platform dependent! This approach might not be feasable on different architectures or in some special circumstances.
In my case, for example, the function is going to be run on a different device than where it is being called from. The two devices don't share memory, so they don't even share the stack. Unless the user is passing a pointer (in which case we assume he knows what he is doing), an object should always be passed within the function-parameters message. If it has a nontrivial copy constructor, it should be executed by the caller, but the object should be created in the parameter area as well.
So, what I would like to do is to somehow override the ABI in clang, without too much intrusion into their source code. Or maybe add some additional annotation, which would be ignored in a normal compilation pipeline, but I could detect when parsing the .bc/.ll file. Or somehow differently reconstruct the function signature.
Unfortunately, "byval" is not just "annotation", it's parameter attribute which means a alot for optimizers and backends. Basically, the rules how to pass small structs / classes with and without non-trivial functions are government by platform C++ ABI, so you cannot just always use byval here.
In fact, byval here is just a result of minor optimization at frontend level. When you're passing stuff by value, then temporary object should be constructed on stack (via the default copy ctor). When you have a class which is something POD-like, then clang can deduce that copy ctor will be trivial and will optimize the pair of ctor / dtor out, passing just the "contents".
For non-trivial classes (like in your case) clang cannot perform such optimization and have to call both ctor and dtor. Thus you're seeing the pointer to temporary object is created.
Try to call your set() functions and you'll see what's going there.