Instrumenting C/C++ code using LLVM - c++

I want to write a LLVM pass to instrument every memory access.
Here is what I am trying to do.
Given any C/C++ program (like the one given below), I am trying to insert calls to some function, before and after every instruction that reads/writes to/from memory. For example consider the below C++ program (Account.cpp)
#include <stdio.h>
class Account {
int balance;
public:
Account(int b)
{
balance = b;
}
~Account(){ }
int read()
{
int r;
r = balance;
return r;
}
void deposit(int n)
{
balance = balance + n;
}
void withdraw(int n)
{
int r = read();
balance = r - n;
}
};
int main ()
{
Account* a = new Account(10);
a->deposit(1);
a->withdraw(2);
delete a;
}
So after the instrumentation my program should look like :
#include <stdio.h>
class Account
{
int balance;
public:
Account(int b)
{
balance = b;
}
~Account(){ }
int read()
{
int r;
foo();
r = balance;
foo();
return r;
}
void deposit(int n)
{
foo();
balance = balance + n;
foo();
}
void withdraw(int n)
{
foo();
int r = read();
foo();
foo();
balance = r - n;
foo();
}
};
int main ()
{
Account* a = new Account(10);
a->deposit(1);
a->withdraw(2);
delete a;
}
where foo() may be any function like get the current system time or increment a counter .. so on.
Please give me examples (source code, tutorials etc) and steps on how to run it. I have read the tutorial on how make a LLVM Pass given on http://llvm.org/docs/WritingAnLLVMPass.html, but couldn't figure out how write a pass for the above problem.

I'm not very familiar with LLVM, but I am a bit more familiar with GCC (and its plugin machinery), since I am the main author of GCC MELT (a high level domain specific language to extend GCC, which by the way you could use for your problem). So I will try to answer in general terms.
You should first know why you want to adapt a compiler (or a static analyzer). It is a worthwhile goal, but it does have drawbacks (in particular, w.r.t. redefining some operators or others constructs in your C++ program).
The main point when extending a compiler (be it GCC or LLVM or something else) is that you very probably should handle all its internal representation (and you probably cannot skip parts of it, unless you have a very narrow defined problem). For GCC it means to handle the more than 100 kinds of Tree-s and nearly 20 kinds of Gimple-s: in GCC middle end, the tree-s represent the operands and declarations, and the gimple-s represent the instructions.
The advantage of this approach is that once you've done that, your extension should be able to handle any software acceptable by the compiler. The drawback is the complexity of compilers' internal representations (which is explainable by the complexity of the definitions of the C & C++ source languages accepted by the compilers, and by the complexity of the target machine code they are generating, and by the increasing distance between source & target languages).
So hacking a general compiler (be it GCC or LLVM), or a static analyzer (like Frama-C), is quite a big task (more than a month of work, not a few days). To deal only with a tiny C++ programs like you are showing, it is not worth it. But it is definitely worth the effort if you plain to deal with large source software bases.
Regards

Try something like this: ( you need to fill in the blanks and make the iterator loop work despite the fact that items are being inserted )
class ThePass : public llvm::BasicBlockPass {
public:
ThePass() : BasicBlockPass() {}
virtual bool runOnBasicBlock(llvm::BasicBlock &bb);
};
bool ThePass::runOnBasicBlock(BasicBlock &bb) {
bool retval = false;
for (BasicBlock::iterator bbit = bb.begin(), bbie = bb.end(); bbit != bbie;
++bbit) { // Make loop work given updates
Instruction *i = bbit;
CallInst * beforeCall = // INSERT THIS
beforeCall->insertBefore(i);
if (!i->isTerminator()) {
CallInst * afterCall = // INSERT THIS
afterCall->insertAfter(i);
}
}
return retval;
}
Hope this helps!

Related

Polymorphism Without Virtual Functions

I am trying to optimize the run time of my code and I was told that removing unnecessary virtual functions was the way to go. With that in mind I would still like to use inheritance to avoid unnecessary code bloat. I thought that if I simply redefined the functions I wanted and initialized different variable values I could get by with just downcasting to my derived class whenever I needed derived class specific behavior.
So I need a variable that identifies the type of class that I am dealing with so I can use a switch statement to downcast properly. I am using the following code to test this approach:
Classes.h
#pragma once
class A {
public:
int type;
static const int GetType() { return 0; }
A() : type(0) {}
};
class B : public A {
public:
int type;
static const int GetType() { return 1; }
B() : {type = 1}
};
Main.cpp
#include "Classes.h"
#include <iostream>
using std::cout;
using std::endl;
using std::getchar;
int main() {
A *a = new B();
cout << a->GetType() << endl;
cout << a->type;
getchar();
return 0;
}
I get the output expected: 0 1
Question 1: Is there a better way to store type so that I do not need to waste memory for each instance of the object created (like the static keyword would allow)?
Question 2: Would it be more effective to put the switch statement in the function to decide that it should do based on the type value, or switch statement -> downcast then use a derived class specific function.
Question 3: Is there a better way to handle this that I am entirely overlooking that does not use virtual functions? For Example, should I just create an entirely new class that has many of the same variables
Question 1: Is there a better way to store type so that I do not need to waste memory for each instance of the object created (like the static keyword would allow)?
There's the typeid() already enabled with RTTI, there's no need you implement that yourself in an error prone and unreliable way.
Question 2: Would it be more effective to put the switch statement in the function to decide that it should do based on the type value, or switch statement -> downcast then use a derived class specific function.
Certainly no! That's a heavy indicator of bad (sic!) class inheritance hierarchy design.
Question 3: Is there a better way to handle this that I am entirely overlooking that does not use virtual functions? For Example, should I just create an entirely new class that has many of the same variables
The typical way to realize polymorphism without usage of virtual functions is the CRTP (aka Static Polymorphism).
That's a widely used technique to avoid the overhead of virtual function tables when you don't really need them, and just want to adapt your specific needs (e.g. with small targets, where low memory overhead is crucial).
Given your example1, that would be something like this:
template<class Derived>
class A {
protected:
int InternalGetType() { return 0; }
public:
int GetType() { static_cast<Derived*>(this)->InternalGetType(); }
};
class B : public A<B> {
friend class A<B>;
protected:
int InternalGetType() { return 1; }
};
All binding will be done at compile time, and there's zero runtime overhead.
Also binding is safely guaranteed using the static_cast, that will throw compiler errors, if B doesn't actually inherits A<B>.
Note (almost disclaimer):
Don't use that pattern as a golden hammer! It has it's drawbacks also:
It's harder to provide abstract interfaces, and without prior type trait checks or concepts, you'll confuse your clients with hard to read compiler error messages at template instantiantion.
That's not applicable for plugin like architecture models, where you really want to have late binding, and modules loaded at runtime.
If you don't have really heavy restrictions regarding executable's code size and performance, it's not worth doing the extra work necessary. For most systems you can simply neglect the dispatch overhead done with virtual function defintions.
1)The semantics of GetType() isn't necessarily the best one, but well ...
Go ahead and use virtual functions, but make sure each of those functions is doing enough work that the overhead of an indirect call is insignificant. That shouldn't be very hard to do, a virtual call is pretty fast - it wouldn't be part of C++ if it wasn't.
Doing your own pointer casting is likely to be even slower, unless you can use that pointer a significant number of times.
To make this a little more concrete, here's some code:
class A {
public:
int type;
int buffer[1000000];
A() : type(0) {}
virtual void VirtualIncrease(int n) { buffer[n] += 1; }
void NonVirtualIncrease(int n) { buffer[n] += 1; }
virtual void IncreaseAll() { for i=0; i<1000000; ++i) buffer[i] += 1; }
};
class B : public A {
public:
B() : {type = 1}
virtual void VirtualIncrease(int n) { buffer[n] += 2; }
void NonVirtualIncrease(int n) { buffer[n] += 2; }
virtual void IncreaseAll() { for i=0; i<1000000; ++i) buffer[i] += 2; }
};
int main() {
A *a = new B();
// easy way with virtual
for (int i = 0; i < 1000000; ++i)
a->VirtualIncrease(i);
// hard way with switch
for (int i = 0; i < 1000000; ++i) {
switch(a->type) {
case 0:
a->NonVirtualIncrease(i);
break;
case 1:
static_cast<B*>(a)->NonVirtualIncrease(i);
break;
}
}
// fast way
a->IncreaseAll();
getchar();
return 0;
}
The code that switches using a type code is not only much harder to read, it's probably slower as well. Doing more work inside a virtual function ends up being both cleanest and fastest.

Jump as an alternative to RTTI

I am learning how c++ is compiled into assembly and I found how exceptions works under the hood very interesting. If its okay to have more then one execution paths for exceptions why not for normal functions.
For example, lets say you have a function that can return a pointer to class A or something derived from A. The way your supposed to do it is with RTTI.
But why not, instead, have the called function, after computing the return value, jump back to the caller function into the specific location that matchs up with the return type. Like how exceptions, the execution flow can go normal or, if it throws, it lands in one of your catch handlers.
Here is my code:
class A
{
public:
virtual int GetValue() { return 0; }
};
class B : public A
{
public:
int VarB;
int GetValue() override { return VarB; }
};
class C : public A
{
public:
int VarC;
int GetValue() override { return VarC; }
};
A* Foo(int i)
{
if(i == 1) return new B;
if(i == 2)return new C;
return new A;
}
void main()
{
A* a = Foo(2);
if(B* b = dynamic_cast<B*>(a))
{
b->VarB = 1;
}
else if(C* c = dynamic_cast<C*>(a)) // Line 36
{
c->VarC = 2;
}
else
{
assert(a->GetValue() == 0);
}
}
So instead of doing it with RTTI and dynamic_cast checks, why not have the Foo function just jump to the appropriate location in main. So in this case Foo returns a pointer to C, Foo should instead jump to line 36 directly.
Whats wrong with this? Why aren't people doing this? Is there a performance reason? I would think this would be cheaper then RTTI.
Or is this just a language limitation, regardless if its a good idea or not?
First of all, there are million different ways of defining the language. C++ is defined as it is defined. Nice or not really does not matter. If you want to improve the language, you are free to write a proposal to C++ committee. They will review it and maybe include in future standards. Sometimes this happens.
Second, although exceptions are dispatched under the hood, there are no strong reasons to think that this is more efficient comparing your handwritten code that uses RTTI. Exception dispatch still requires CPU cycles. There is no miracle there. The real difference is that for using RTTI you need to write the code yourself, while the exception dispatch code is generated for you by compiler.
You may want to call you function 10000 times and find out what will run faster: RTTI based code or exception dispatch.

cannot resize c++ std::vector that's a member variable of a class

code (simplified version)
(part of) the class definition:
struct foo {
std::vector<int> data;
foo(int a=0):data(a+1,0) {}
void resize(int a) {
data.resize(a+1,0);
}
}
The a+1 part is because I want the data to be 1-indexed to simplify some operations.
in global scope:
int k;
foo bar;
in main function:
std::cin>>k;
bar.resize(k);
Later in the main function, there is a call to another member function (in foo) that accesses the data, causing a segmentation fault (segsegv).
After debugging, I found that data.size() returns 0. Which is very unexpected.
After a very long session of debugging, I feel very confident that the problem is with the resizeing, which shouldn't cause any problems (it's from the standard library, after all!).
P.S. Don't accuse me for putting anything in global scope or giving public access to class members. I'm not writing any "real" program, because I'm just practicing for a programming competition.
After a very long session of debugging, I feel very confident that the problem is with the resize
It is almost certain that:
The issue doesn't have anything to do with resize().
You have a memory-related bug somewhere (double delete, uninitialized/dangling pointer, buffer overrun etc).
The thing with memory-related bugs is that they can be completely symptomless until well after the buggy code has done the damage.
My recommendation would be to run your program under valgrind (or at least show us an SSCCE that doesn't work for you).
The following works fine for me
#include <vector>
#include <cstdlib>
struct foo {
std::vector<int> data;
explicit foo(int a=0) : data(a+1, 0) {}
void resize(int a) {
data.resize(a+1, 0);
}
};
int main() {
foo test_foo(1);
for (size_t i = 0; i < 1000; ++i) {
int a = std::rand() % 65536;
test_foo.resize(a);
if (test_foo.data.size() != a + 1)
return 1;
}
return 0;
}

build function at runtime c++ from number of functions that built at compilation

I am creating scripting language that first parse the code
and then copy functions (To execute the code) to one buffer\memory as the parsed code.
There is a way to copy function's binary code to buffer and then execute the whole buffer?
I need to execute all the functions at once to get better performance.
To understand my question to best I want to do something like this:
#include <vector>
using namespace std;
class RuntimeFunction; //The buffer to my runtime function
enum ByteCodeType {
Return,
None
};
class ByteCode {
ByteCodeType type;
}
void ReturnRuntime() {
return;
}
RuntimeFunction GetExecutableData(vector<ByteCode> function) {
RuntimeFunction runtimeFunction=RuntimeFunction(sizeof(int)); //Returns int
for (int i = 0 ; i < function.size() ; i++ ) {
#define CurrentByteCode function[i]
if (CurrentByteCode.Type==Return) {
runtimeFunction.Append(&ReturnRuntime);
} //etc.
#undef
}
return runtimeFunction;
}
void* CallFunc(RuntimeFunction runtimeFunction,vector<void*> custom_parameters) {
for (int i=custom_parameters-1;i>=0;--i) { //Invert parameters loop
__asm {
push custom_parameters[i]
}
}
__asm {
call runtimeFunction.pHandle
}
}
There are a number of ways of doing this, depending on how deep you want to get into generating code at runtime, but one relatively simple way of doing it is with threaded code and a threaded code interpreter.
Basically, threaded code consists of an array of function pointers, and the interpreter goes through the array calling each pointed at function. The tricky part is that you generally have each function return the address of array element containing a pointer to the next function to call, which allows you to implement things like branches and calls without any effort in the interpreter
Usually you use something like:
typedef void *(*tc_func_t)(void *, runtime_state_t *);
void *interp(tc_func_t **entry, runtime_state_t *state) {
tc_func_t *pc = *entry;
while (pc) pc = (*pc)(pc+1, state);
return entry+1;
}
That's the entire interpreter. runtime_state_t is some kind of data structure containing some runtime state (usually one or more stacks). You call it by creating an array of tc_func_t function pointers and filling them in with function pointers (and possibly data), ending with a null pointer, and then call interp with the address of a variable containing the start of the array. So you might have something like:
void *add(tc_func_t *pc, runtime_state_t *state) {
int v1 = state->data.pop();
int v2 = state->data.pop();
state->data.push(v1 + v2);
return pc; }
void *push_int(tc_func_t *pc, runtime_state_t *state) {
state->data.push((int)*pc);
return pc+1; }
void *print(tc_func_t *pc, runtime_state_t *state) {
cout << state->data.pop();
return pc; }
tc_func_t program[] = {
(tc_func_t)push_int,
(tc_func_t)2,
(tc_func_t)push_int,
(tc_func_t)2,
(tc_func_t)add,
(tc_func_t)print,
0
};
void run_prgram() {
runtime_state_t state;
tc_func_t *entry = program;
interp(&entry, &state);
}
Calling run_program runs the little program that adds 2+2 and prints the result.
Now you may be confused by the slightly odd calling setup for interp, with an extra level of indirection on the entry argument. That's so that you can use interp itself as a function in a threaded code array, followed by a pointer to another array, and it will do a threaded code call.
edit
The biggest problem with threaded code like this is related to performance -- the threaded coded interpreter is extremely unfriendly to branch predictors, so performance is pretty much locked at one threaded instruction call per branch misprediction recovery time.
If you want more performance, you pretty much have to go to full-on runtime code generation. LLVM provides a good, machine independent interface to doing that, along with pretty good optimizers for common platforms that will produce pretty good code at runtime.

Explanation for D vs. C++ performance difference

Simple example in D:
import std.stdio, std.conv, core.memory;
class Foo{
int x;
this(int _x){x=_x;}
}
void main(string args[]) {
GC.disable();
int n = to!int(args[1]);
Foo[] m= new Foo[n];
for(int i=0;i<n;i++){
m[i] = new Foo(i);
}
}
C++ code:
#include <cstdlib>
using namespace std;
class Foo{
public:
int x;
Foo(int _x);
};
Foo::Foo(int _x){
x = _x;
}
int main(int argc, char** argv) {
int n = atoi(argv[1]);
Foo** gx = new Foo*[n];
for(int i=0;i<n;i++){
gx[i] = new Foo(i);
}
return 0;
}
No any comilation flags.
compiling and runing:
>dmd td.d
>time ./td 10000000
>real 0m2.544s
Anlogue example in C++ (gcc), runing:
>time ./tc 10000000
>real 0m0.523s
Why? Such a simple example, and such a big difference: 2.54s and 0.52s.
You're mainly measuring three differences:
The difference between the code generated by gcc and dmd
The extra time D takes to allocate using the GC.
The extra time D takes to allocate a class.
Now, you might think that point 2 is invalid because you used GC.disable();, but this only makes it so that the GC won't collect as it normally does. It does not make the GC disappear entirely and automatically redirect all memory allocations to C's malloc. It still must do most of what it normally does to ensure that the GC knows about the memory allocated, and all that takes time. Normally, this is a relatively insignificant part of program execution (even ignoring the benefits GCs give). However, your benchmark makes it the entirety of the program which exaggerates this effect.
Therefore, I suggest you consider two changes to your approach:
Either switch to using gdc to compare against gcc or switch to dmc to compare to dmd
Make the programs more equivalent. Either have both D and C++ allocate structs on the heap or, at the very least, make it so that D is allocating without touching the GC. If you're optimizing a program for maximum speed, you'd be using structs and C's malloc anyway, regardless of language.
I'd even recommend a 3rd change: since you're interested in maximum performance, you ought to try to come up with a better program entirely. Why not switch to structs and have them located contiguously in memory? This would make allocation (which is, essentially, the entire program) as fast as possible.
Use of your above code running using dmd & dmc on my machine results in the following times:
DMC 8.42n (no flags) : ~880ms
DMD 2.062 (no flags) : ~1300ms
Modifying the code to the following:
C++ code:
#include <cstdlib>
struct Foo {
int x;
};
int main(int argc, char** argv) {
int n = atoi(argv[1]);
Foo* gx = (Foo*) malloc(n * sizeof(Foo));
for(int i = 0; i < n; i++) {
gx[i].x = i;
}
free(gx);
return 0;
}
D code:
import std.conv;
struct Foo{
int x;
}
void main(string args[]) {
int n = to!int(args[1]);
Foo[] m = new Foo[](n);
foreach(i, ref e; m) {
e.x = i;
}
}
Use of my code using DMD & DMC results in the following times:
DMC 8.42n (no flags) : ~95ms +- 20ms
DMD 2.062 (no flags) : ~95ms +- 20ms
Essentially, identical (I'd have to start using some statistics to give you a better idea of which one is truly faster, but at this scale, it's irrelevant). Notice that using this is much, much faster than a naive approach and D is equally capable of using this strategy. In this case, the run-time difference is negligible, yet we retain the benefits of using a GC and there is definitely far fewer things that could go wrong in the writing of the D code (Notice how your program failed to delete all of its allocations?).
Furthermore, if you absolutely wanted, D allows you to use C's standard library by import std.c.stdlib; This would allow you to truly bypass the GC and achieve maximum performance by using C's malloc, if necessary. In this case, it's not necessary, so I erred on the side of safer, more readable code.
try this one:
import std.stdio, std.conv, core.memory;
class Foo{
int x = void;
this(in int _x){x=_x;}
}
void main(string args[]) {
GC.disable();
int n = to!int(args[1]);
Foo[] m= new Foo[n];
foreach(i; 0..n){
m[i] = new Foo(i);
}
}