If I have a class defined as such:
class classWithInt
{
public:
classWithInt();
...
private:
int someInt;
...
}
and that someInt is the one and only one member variable in classWithInt, how much slower would it be to declare a new instance of this class than to just declare a new integer?
What about when you have, say 10 such integers in the class? 100?
With a compiler not written by drunk college students in the wee hours of the morning, the overhead is zero. At least until you start putting in virtual functions; then you must pay the cost for the virtual dispatch mechanism. Or if you have no data in the class, in which case the class is still required to take up some space (which in turn is because every object must have a unique address in memory).
Functions are not part of the data layout of the object. They are only a part of the mental concept of the object. The function is translated into code that takes an instance of the object as an additional parameter, and calls to the member function are correspondingly translated to pass the object.
The number of data members doesn't matter. Compare apples to apples; if you have a class with 10 ints in it, then it takes up the same space that 10 ints would.
Allocating things on the stack is effectively free, no matter what they are. The compiler adds up the size of all the local variables and adjusts the stack pointer all at once to make space for them. Allocating space in memory costs, but the cost will probably depend more on the number of allocations than the amount of space allocated.
well, lets just test it all out. I can compile, with full optimizations, a more complete example like so:
void use(int &);
class classWithInt
{
public:
classWithInt() : someInt(){}
int someInt;
};
class podWithInt
{
public:
int someInt;
};
int main() {
int foo;
classWithInt bar;
podWithInt baz;
use(foo);
use(bar.someInt);
use(baz.someInt);
return 5;
}
And this is the output I get from gcc-llvm
; ModuleID = '/tmp/webcompile/_21792_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-linux-gnu"
%struct.classWithInt = type { i32 }
define i32 #main() {
entry:
%foo = alloca i32, align 4 ; <i32*> [#uses=1]
%bar = alloca %struct.classWithInt, align 8 ; <%struct.classWithInt*> [#uses=1]
%baz = alloca %struct.classWithInt, align 8 ; <%struct.classWithInt*> [#uses=1]
%0 = getelementptr inbounds %struct.classWithInt* %bar, i64 0, i32 0 ; <i32*> [#uses=2]
store i32 0, i32* %0, align 8
call void #_Z3useRi(i32* %foo)
call void #_Z3useRi(i32* %0)
%1 = getelementptr inbounds %struct.classWithInt* %baz, i64 0, i32 0 ; <i32*> [#uses=1]
call void #_Z3useRi(i32* %1)
ret i32 5
}
declare void #_Z3useRi(i32*)
There are some differences in each case. In the simplest case, the POD type differs from the plain int in only one way, it requires a different alignment, it's 8 byte aligned instead of just 4 byte.
The other noticeable thing is that the POD and bare int do not get initialized. Their storage is used right as is in from the wilderness of the stack. The non-pod type, which has a non-trivial constructor, causes a zero to be stored before the instance can be used for anything else.
Related
I am writing an LLVM pass where I want to print out the values of function arguments. I am only focusing on integers and pointers to integers (which will also involve char, I guess but that is not a problem for my purposes). If the argument is a pointer, I want to dereference it and print the value that is pointed to by the pointer. To guard against null pointers, in normal C code I can do:
function foo(int* bar, int** baz) {
if (bar) {
printf("bar: %d\n", *bar);
}
if (baz) {
printf("baz: %d\n", **baz);
}
}
Is there a way to do something similar in LLVM? Currently I am doing the following to actually load the value by dereferencing the pointer. I am working off a StoreInst that LLVM generates to set up the values of function parameters. For example, for the function above, there would be something like:
%bar.addr = alloca i32*, align 8
%baz.addr = alloca i32**, align 8
store i32* %bar, i32** %bar.addr, align 8
store i32** %baz, i32*** %baz.addr, align 8
So using that StoreInst, I am doing something like this:
auto *value = store->getValueOperand();
auto *type = value->getType();
int indirection = 0;
while (type->isPointerTy()) {
indirection++;
type = type->getPointerElementType();
}
Value* load = value;
unsigned int bitWidth = type->getIntegerBitWidth(); // I have already verified earlier on that
// this eventually resolves to an IntegerType
while (indirection > 0) {
load = (indirection == 1) ? irb.CreateLoad(getIntegerType(bitWidth), load)
: irb.CreateLoad(getPointerToIntegerType(indirection - 1, bitWidth), load);
--indirection;
}
value = load;
The getIntegerType is just a convenience function that returns an IntegerType with the provided bit-width and using the current LLVM context. The getPointerToIntegerType function is as follows:
Type* getPointerToIntegerType(int indirection, unsigned int bitWidth) {
Type* type = getIntegerType(bitWidth);
while (indirection > 0) {
type = PointerType::get(type, 0);
--indirection;
}
return type;
}
Using this I get the following load instructions after the store instructions (I've omitted some stuff from the call instructions for brevity):
store i32* %bar, i32** %bar.addr, align 8
%1 = load i32, i32* %bar
call void (i8, ...) #__print_argument_value(...)
store i32** %baz, i32*** %baz.addr, align 8
%2 = load i32*, i32** %baz
%3 = load i32, i32* %2
call void (i8, ...) #__print_argument_value(...)
This works perfectly as long as the pointers are not null. Is there an easy way I can generate IR to guard the call to __print_argument_value? I did write out an explicit if (bar) { ... } and a if (baz) { ... } and then looked at the generated IR to see how LLVM generates it. I saw that it performs an icmp ne of the pointer (the .addr variable) against null. Then if the result is true, it breaks to a label if.then where it calls the function, or else to the label if.end which skips over it.
The problem is that I'm having a hard time trying to figure out how I can generate IR like that from my LLVM pass. Are there any examples I can look at? I did look up the documentation but I still can't figure out how to put the pieces together using the LLVM API even though I know what the IR should look like. I saw some documentation about to PHI nodes which I don't see in my IR from the explicit null check, but I am not entirely clear on what they are and how they would help me.
PS: Please excuse any weirdness in the C/C++ code; I usually program in Java.
In section 20 of Scott Meyer's Effective C++, he states:
some compilers refuse to put objects consisting of only a double into a register
When passing built-in types by value, compilers will happily place the data in registers and quickly send ints/doubles/floats/etc. along. However, not all compilers will treat small objects with the same grace. I can easily understand why compilers would treat Objects differently - to pass an Object by value can be a lot more work than copying data members between the vtable and all the constructors.
But still. This seems like an easy problem for modern compilers to solve: "This class is small, maybe I can treat it differently". Meyer's statement seemed to imply that compilers WOULD make this optimization for objects consisting of only an int (or char or short).
Can someone give further insight as to why this optimization sometimes doesn't happen?
I found this document online on "Calling conventions for different C++ compilers and operating systems" (updated on 2018-04-25) which has a table depicting "Methods for passing structure, class and union objects".
From the table you can see that if an object contains long double, copy of entire object is transferred to stack for all compilers shown here.
Also from the same resource (with emphasis added):
There are several different methods to transfer a parameter to a function if the parameter is a structure, class or union object. A copy of the object is always made, and this copy is transferred to the called function either in registers, on the stack, or by a pointer, as specified in table 6. The symbols in the table specify which method to use. S takes precedence over I and R. PI and PS take precedence over all other passing methods.
As table 6 tells, an object cannot be transferred in registers if it is too big or too complex. For example, an object that has a copy constructor cannot be transferred in registers because the copy constructor needs an address of the object. The copy constructor is called by the caller, not the callee.
Objects passed on the stack are aligned by the stack word size, even if higher alignment would be desired. Objects passed by pointers are not aligned by any of the compilers studied, even if alignment is explicitly requested. The 64bit Windows ABI requires that objects passed by pointers be aligned by 16.
An array is not treated as an object but as a pointer, and no copy of the array is made, except if the array is wrapped into a structure, class or union.
The 64 bit compilers for Linux differ from the ABI (version 0.97) in the following respects: Objects with inheritance, member functions, or constructors can be passed in registers. Objects with copy constructor, destructor or virtual are passed by pointers rather than on the stack.
The Intel compilers for Windows are compatible with Microsoft. Intel compilers for Linux are compatible with Gnu.
Here is an example showing that LLVM clang with optimization level O3 treats a class with a single double data member just like it was a double:
$ cat main.cpp
#include <stdio.h>
class MyDouble {
public:
double d;
MyDouble(double _d):d(_d){}
};
void foo(MyDouble d)
{
printf("%lg\n",d.d);
}
int main(int argc, char **argv)
{
if (argc>5)
{
double x=(double)argc;
MyDouble d(x);
foo(d);
}
return 0;
}
When I compile it and view the generated bitcode file, I see that foo behaves
as if it operates on a double type input parameter:
$ clang++ -O3 -c -emit-llvm main.cpp
$ llvm-dis main.bc
Here is the relevant part:
; Function Attrs: nounwind uwtable
define void #_Z3foo8MyDouble(double %d.coerce) #0 {
entry:
%call = tail call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([5 x i8]* #.str, i64 0, i64 0), double %d.coerce)
ret void
}
See how foo declares its input parameter as double, and moves it around for
printing ``as is". Now let's compile the exact same code with O0:
$ clang++ -O0 -c -emit-llvm main.cpp
$ llvm-dis main.bc
When we look at the relevant part, we see that clang uses a getelementptr instruction to access its first (and only) data member d:
; Function Attrs: uwtable
define void #_Z3foo8MyDouble(double %d.coerce) #0 {
entry:
%d = alloca %class.MyDouble, align 8
%coerce.dive = getelementptr %class.MyDouble* %d, i32 0, i32 0
store double %d.coerce, double* %coerce.dive, align 1
%d1 = getelementptr inbounds %class.MyDouble* %d, i32 0, i32 0
%0 = load double* %d1, align 8
%call = call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([5 x i8]* #.str, i32 0, i32 0), double %0)
ret void
}
What is the LLVM way to extract values from an array of unions?
Unions are not directly supported and this seems to complicate things.
Background:
I am calling a function returned by the JIT execution machine
and pass it 1 argument, namely the base address of an array
of unions which contains the arguments.
The data structure is set up from C like:
std::array<union{int,float*}> arguments(5);
The sequence of occurences of int and float* is encoded
in a vector<llvm::Type*>:
i32
i32
float*
float*
float*
Right now I am trying this (this is the jitted function):
define void #main([8 x i8]* %arg_ptr) {
entrypoint:
%0 = getelementptr [8 x i8]* %arg_ptr, i32 0
%1 = getelementptr [8 x i8]* %arg_ptr, i32 1
%2 = getelementptr [8 x i8]* %arg_ptr, i32 2
%3 = getelementptr [8 x i8]* %arg_ptr, i32 3
%4 = getelementptr [8 x i8]* %arg_ptr, i32 4
}
First of all, is the functions' signature correct (assuming
pointer size is 8 bytes)?
How do I get the first i32 out of the [8 x i8] stored in %0?
Do I need to cast the array [8 x i8] first to a pointer i32*,
then create another GEP to its first element?
Note that LLVM IR does not really have unions. What happens in practice is that Clang (which is aware of the target triple, so the details are platform/ABI specific) will create a single-element struct big enough to contain you union and act on that. Here's some C code:
typedef union {
double dnum;
int inum;
float* fptr;
} my_union;
int bar(my_union* mu) {
return mu[4].inum;
}
Converting this to LLVM IR with clang (using the default target on a x86-64 machine, we get this (optimized code, to reduce clutter):
%union.my_union = type { double }
define i32 #bar(%union.my_union* nocapture readonly %mu) #0 {
entry:
%arrayidx = getelementptr inbounds %union.my_union* %mu, i64 4
%inum = bitcast %union.my_union* %arrayidx to i32*
%0 = load i32* %inum, align 4
ret i32 %0
}
A few things to note here, with answers to some of the sub-questions embedded in your question:
The union is replaced by the C type struct {double} because that's large enough to contain all union members, and it also provides the correct alignment constraints. LLVM knows absolutely nothing about unions. From this point on, it acts on a struct aggregate.
Access to array members in LLVM is done with a GEP which gets two numeric indices. For an in-depth explanation of why it works this way, see http://llvm.org/docs/GetElementPtr.html
Once you have the member, you just load the value from it. Clang knows how members are laid out within an enum, so it loads a i32* directly from the member.
Sutter says this:
"In the low-level efficiency tradition of C and C++ alike, the
compiler is often not required to initialize variables unless you do
it explicitly (e.g., local variables, forgotten members omitted from
constructor initializer lists)"
I have always wondered why the compiler doesn't initialize primitives like int32 and float to 0. What is the performance hit if the compiler initializes it? It should be better than incorrect code.
This argument is incomplete, actually. Unitialized variables may have two reasons: efficiency and the lack of a suitable default.
1) Efficiency
It is, mostly, a left-over of the old days, when C compilers were simply C to assembly translators and performed no optimization whatsoever.
These days we have smart compilers and Dead Store Elimination which in most cases will eliminate redundant stores. Demo:
int foo(int a) {
int r = 0;
r = a + 3;
return r;
}
Is transformed into:
define i32 #foo(i32 %a) nounwind uwtable readnone {
%1 = add nsw i32 %a, 3
ret i32 %1
}
Still, there are cases where even the smarter compiler cannot eliminate the redundant store and this may have an impact. In the case of a large array that is later initialized piecemeal... the compiler may not realize that all values will end up being initialized and thus not remove the redundant writes:
int foo(int a) {
int* r = new int[10]();
for (unsigned i = 0; i <= a; ++i) {
r[i] = i;
}
return r[a % 2];
}
Note in the following the call to memset (that I required by suffixing the new call with () which is value initialization). It was not eliminated even though the 0 are unneeded.
define i32 #_Z3fooi(i32 %a) uwtable {
%1 = tail call noalias i8* #_Znam(i64 40)
%2 = bitcast i8* %1 to i32*
tail call void #llvm.memset.p0i8.i64(i8* %1, i8 0, i64 40, i32 4, i1 false)
br label %3
; <label>:3 ; preds = %3, %0
%i.01 = phi i32 [ 0, %0 ], [ %6, %3 ]
%4 = zext i32 %i.01 to i64
%5 = getelementptr inbounds i32* %2, i64 %4
store i32 %i.01, i32* %5, align 4, !tbaa !0
%6 = add i32 %i.01, 1
%7 = icmp ugt i32 %6, %a
br i1 %7, label %8, label %3
; <label>:8 ; preds = %3
%9 = srem i32 %a, 2
%10 = sext i32 %9 to i64
%11 = getelementptr inbounds i32* %2, i64 %10
%12 = load i32* %11, align 4, !tbaa !0
ret i32 %12
}
2) Default ?
The other issue is the lack of a suitable value. While a float could perfectly be initialized to NaN, what of integers ? There is no integer value that represents the absence of value, none at all! 0 is one candidate (among others), but one could argue it's one of the worst candidate: it's a very likely number, and thus likely has a specific meaning for the usecase at hand; are you sure that you are comfortable with this meaning being the default ?
Food for thought
Finally, there is one neat advantage of unitialized variables: they are detectable. The compiler may issue warnings (if it's smart enough), and Valgrind will raise errors. This make logical issues detectable, and only what is detected can be corrected.
Of course a sentinel value, such as NaN, would be as useful. Unfortunately... there is none for integers.
There are two ways in which initialisation might impact performance.
First, initialising a variable takes time. Granted, for a single variable it's probably negligible, but as others have suggested, it can add up with large numbers of variables, arrays, etc.
Second, who's to say that zero is a reasonable default? For every variable for which zero is a useful default, there's probably another one for which it isn't. In that case, if you do initialise to zero you then incur further overhead re-initialising the variable to whatever value you actually want. You essentially pay the initialisation overhead twice, rather than once if the default initialisation does not occur. Note that this is true no matter what you choose as a default value, zero or otherwise.
Given the overhead exists, it's typically more efficient to not initialise and to let the compiler catch any references to uninitialised variables.
Basically, a variable references a place in memory which can be modified to hold data. For an unitialized variable, all that the program needs to know is where this place is, and the compiler usually figures this out ahead of time, so no instructions are required. But when you want it to be initialized (to say, 0), the program needs to use an extra instruction to do so.
One idea might be to zero out the entire heap while the program is starting, with memset, then initialize all of the static stuff, but this isn't needed for anything that is set dynamically before it's read. This would also be a problem for stack-based functions which would need to zero out their stack frame every time a function is called. In short, it's much more efficient to allow variables to default to undefined, particularly when the stack is frequently being overwritten with newly called functions.
Compile with -Wmaybe-uninitialized and find out. Those are the only places the compiler would not be able to optmize out the primitive initialization.
As for the heap ...
I have a source C++ code which I parse using clang, producing llvm bytecode. From this point I want to process the file myself...
However I encoudered a problem. Consider the following scenario:
- I create a class with a nontrivial destructor or copy constructor.
- I define a function, where an object of this class is passed as a parameter, by value (no reference or pointer).
In the produced bytecode, I get a pointer instead. For classes without the destructor, the parameter is annotated as 'byval', but it is not so in this case.
As a result, I cannot distinguish if the parameter is passed by value, or really by a pointer.
Consider the following example:
Input file - cpass.cpp:
class C {
public:
int x;
~C() {}
};
void set(C val, int x) {val.x=x;};
void set(C *ptr, int x) {ptr->x=x;}
Compilation command line:
clang++ -c cpass.cpp -emit-llvm -o cpass.bc; llvm-dis cpass.bc
Produced output file (cpass.ll):
; ModuleID = 'cpass.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"
%class.C = type { i32 }
define void #_Z3set1Ci(%class.C* %val, i32 %x) nounwind {
%1 = alloca i32, align 4
store i32 %x, i32* %1, align 4
%2 = load i32* %1, align 4
%3 = getelementptr inbounds %class.C* %val, i32 0, i32 0
store i32 %2, i32* %3, align 4
ret void
}
define void #_Z3setP1Ci(%class.C* %ptr, i32 %x) nounwind {
%1 = alloca %class.C*, align 8
%2 = alloca i32, align 4
store %class.C* %ptr, %class.C** %1, align 8
store i32 %x, i32* %2, align 4
%3 = load i32* %2, align 4
%4 = load %class.C** %1, align 8
%5 = getelementptr inbounds %class.C* %4, i32 0, i32 0
store i32 %3, i32* %5, align 4
ret void
}
As you can see, the parameters of both set functions look exactly the same. So how can I tell that the first function was meant to take the parameter by value, instead of a pointer?
One solution could be to somehow parse the mangled function name, but it may not be always viable. What if somebody puts extern "C" before the function?
Is there a way to tell clang to keep the byval annotation, or to produce an extra annotation for each function parameter passed by a value?
Anton Korobeynikov suggests that I should dig into clang's LLVM IR emission. Unfortunately I know almost nothing about clang internals, the documentation is rather sparse. The Internals Manual of clang does not talk about IR emission. So I don't really know how to start, where to go to get the problem solved, hopefully without actually going through all of clang source code. Any pointers? Hints? Further reading?
In response to Anton Korobeynikov:
I know more-or-less how C++ ABI looks like with respect of parameter passing. Found some good reading here: http://agner.org./optimize/calling_conventions.pdf. But this is very platform dependent! This approach might not be feasable on different architectures or in some special circumstances.
In my case, for example, the function is going to be run on a different device than where it is being called from. The two devices don't share memory, so they don't even share the stack. Unless the user is passing a pointer (in which case we assume he knows what he is doing), an object should always be passed within the function-parameters message. If it has a nontrivial copy constructor, it should be executed by the caller, but the object should be created in the parameter area as well.
So, what I would like to do is to somehow override the ABI in clang, without too much intrusion into their source code. Or maybe add some additional annotation, which would be ignored in a normal compilation pipeline, but I could detect when parsing the .bc/.ll file. Or somehow differently reconstruct the function signature.
Unfortunately, "byval" is not just "annotation", it's parameter attribute which means a alot for optimizers and backends. Basically, the rules how to pass small structs / classes with and without non-trivial functions are government by platform C++ ABI, so you cannot just always use byval here.
In fact, byval here is just a result of minor optimization at frontend level. When you're passing stuff by value, then temporary object should be constructed on stack (via the default copy ctor). When you have a class which is something POD-like, then clang can deduce that copy ctor will be trivial and will optimize the pair of ctor / dtor out, passing just the "contents".
For non-trivial classes (like in your case) clang cannot perform such optimization and have to call both ctor and dtor. Thus you're seeing the pointer to temporary object is created.
Try to call your set() functions and you'll see what's going there.