LLVM Clang C++ code injection - c++

I am a bit confused about implementing a code injection function in LLVM Clang. I basically want to add a function before a variable or a pointer is created in the source code. Example:
#include <iostream>
int main() {
int a;
return 0;
}
to
#include <iostream>
int main() {
foo();
int a;
return 0;
}
I read the LLVM docs to find an answer but couldn't. Please help me.
Thank you in advance.

First step is to decide whether you want to do this in Clang or LLVM. Although they are "connected", they are not the same thing. At clang you can do it at AST level, in which case you need to write a recursive AST-visitor, and from that identify the function definitions that you want to instrument - inserting the AST to call your foo function. This will only work for functions implemented by the compiler.
There is information on how to write such a visitor here:
https://clang.llvm.org/docs/RAVFrontendAction.html
In LLVM you could write a function-pass, that inserts code into each function. This obviously works for ANY functions, regardless of language.
How to write an LLVM pass:
http://llvm.org/docs/WritingAnLLVMPass.html
However, while this may seem trivial at the beginning, there are some interesting quirks. In an LLVM function, the alloca instructions should be first, so you would have to "skip" those functions. There may be functions that "shouldbn't be instrumented" - for example, if your function foo prints something using cout << something;, it would be rather terrible idea to insert foo into the operator<<(ostream&, ...) type functions... ;) And you obviously don't want to instrument foo itself, or any functions it calls.
There are ways in Clang that you can determine if the source is the "main file" or some header-file - although that may not be enough in your case. It is much harder to determine "which function is this" in LLVM.

Related

Netbeans 8.2 code assistance issue with unique_ptr (not shared_ptr)

Using Netbeans 8.2 on Linux and GCC 8.1, unique_ptr::operator->() gives an erroneous warning
Unable to resolve template based identifier {var} and more importantly will not autocomplete member names.
Code compiles and runs just fine, and amazingly, the autocomplete still works with no warnings if a shared_ptr is used instead. I have no idea how that is possible. Here is a problematic example, for reference
#include <iostream>
#include <memory>
struct C { int a;};
int main () {
std::unique_ptr<C> foo (new C);
std::shared_ptr<C> bar (new C);
foo->a = 10; //Displays warning, does not auto-complete the members of C
bar->a = 20; //No warning, auto-completes members of C
return 0;
}
I've tried the following to resolve this issue with no luck:
Code Assistance > Reparse Project
Code Assistance > Clean C/C++ cache and restart IDE
Manually deleting cache in ~/.cache/netbeans/8.2
Looking through View > IDE Log for anything that may help
Setting both the C and C++ compilers to C11 and C++11 in project properties
Changing the pre-processing macro __cplusplus to both 201103L and 201402L
Creating a new Netbeans project and trying the above
A large variety of permutations of the above options in different orders
Again, everything compiles and runs just fine, it's just Code Assistance that is giving me an issue. I've run out of things that I've found in other stack overflow answers. My intuition tells me that shared_ptr working and unique_ptr not working is helpful, but I don't know enough about C++ to utilize that information. Please help me, I need to get back to work...
Edit 1
This stack overflow question, although referencing Clang and the libc++ implementation, suggests that it may be an implementation issue within GCC 8.1's libstdc++ unique_ptr.
TLDR Method 1
Add a using pointer = {structName}* directive in your structure fixes code assistance, and will compile and run as intended, like so:
struct C { using pointer = C*; int a;};
TLDR Method 2
The answer referenced in my edit does in fact work for libstdc++ as well. Simply changing the return type of unique_ptr::operator->() from pointer to element_type* will fix code assistance, compile and run as expected (the same thing can be done to unique_ptr::get()). However, changing things in implementations of the standard library worries me greatly.
More Information
As someone who is relatively new to c++ and barely understands the power of template specializations, reading through unique_ptr.h was scary, but here is what I think is messing up Netbeans:
Calling unique_ptr::operator->() calls unique_ptr::get()
unique_ptr::get() calls the private implementation's (__unique_ptr_impl) pointer function __unique_ptr_impl::_M_ptr(). All these calls return the __unique_ptr_impl::pointer type.
Within the private implementation, the type pointer is defined within an even more private implementation _Ptr. The struct _Ptr has two template definitions, one that returns a raw pointer to the initial template variable of unique_ptr, and the second that seems to strip any reference off this template variable, and then find it's type named pointer. I think that this is where Netbeans messes up.
So my understanding is when you call unique_ptr<elementType, deleterType>::operator->(), it goes to __unique_ptr_impl<elementType, deleterType>, where the internal pointer type is found by stripping elementType of any references and then getting the type named dereferenced(elementType)::pointer. So by including the directive using pointer =, Netbeans gets what it wants in finding the dereferenced(elementType)::pointer type.
Including the using directive is entirely superficial, evidenced by the fact that things will compile without it, and by the following example
#include <memory>
#include <iostream>
struct A{
using pointer = A*;
double memA;
A() : memA(1) {}
};
struct B {
using pointer = A*;
double memB;
B() : memB(2) {}
};
int main() {
unique_ptr<A> pa(new A);
unique_ptr<B> pb(new B);
std::cout << pa->memA << std::endl;
std::cout << pb->memB << std::endl;
};
outputs
1
2
As it should, even though in struct B contains using pointer = A*. Netbeans actually tries to autocomplete B-> to B->memA, further evidence that Netbeans is using the logic proposed above.
This solution is contrived as all hell but at least it works (in the wildly specific context that I've used it) without making changes to implementations of stl. Who knows, I'm still confused about the convoluted typing system within unique_ptr.

Dead virtual function elimination

Question
(Can I get clang or perhaps some other optimizing tool shipped with LLVM to identify unused virtual functions in a C++ program, to mark them for dead code elimination? I guess not.)
If there is no such functionality shipped with LLVM, how would one go about implementing a thing like this? What's the most appropriate layer to achieve this, and where can I find examples on which I could build this?
Thoughts
My first thought was an optimizer working on LLVM bitcode or IR. After all, a lot of optimizers are written for that representation. Simple dead code elimination is easy enough: any function which is neither called nor has its address taken and stored somewhere is dead code and can be omitted from the final binary. But a virtual function has its address taken and stored in the virtual function table of the corresponding class. In order to identify whether that function has a chance of getting called, an optimizer would not only have to identify all virtual function calls, but also identify the type hierarchy to map these virtual function calls to all possible implementations.
This makes things look quite hard to tackle at the bitcode level. It might be better to handle this somewhere closer to the front end, at a stage where more type information is available, and where calls to a virtual function might be more readily associated with implementations of these functions. Perhaps the VirtualCallChecker could serve as a starting point.
One problem is probably the fact that while it's possible to combine the bitcode of several objects into a single unit for link time optimization, one hardly ever compiles all the source code of a moderately sized project as a single translation unit. So the association between virtual function calls and implementations might have to be somehow maintained till that stage. I don't know if any kind of custom annotation is possible with LLVM; I have seen no indication of this in the language specification.
But I'm having a bit of a trouble with the language specification in any case. The only reference to virtual in there are the virtuality and virtualIndex properties of MDSubprogram, but so far I have found no information at all about their semantics. No documentation, nor any useful places inside the LLVM source code. I might be looking at the wrong documentation for my use case.
Cross references
eliminate unused virtual functions asked about pretty much the same thing in the context of GCC, but I'm specifically looking for a LLVM solution here. There used to be a -fvtable-gc switch to GCC, but apparently it was too buggy and got punted, and clang doesn't support it either.
Example:
struct foo {
virtual ~foo() { }
virtual int a() { return 12345001; }
virtual int b() { return 12345002; }
};
struct bar : public foo {
virtual ~bar() { }
virtual int a() { return 12345003; }
virtual int b() { return 12345004; }
};
int main(int argc, char** argv) {
foo* p = (argc & 1 ? new foo() : new bar());
int res = p->a();
delete p;
return res;
};
How can I write a tool to automatically get rid of foo::b() and bar::b() in the generated code?
clang++ -fuse-ld=gold -O3 -flto with clang 3.5.1 wasn't enough, as an objdump -d -C of the resulting executable showed.
Question focus changed
Originally I had been asking not only about how to use clang or LLVM to this effect, but possibly for third party tools to achieve the same if clang and LLVM were not up to the task. Questions asking for tools are frowned upon here, though, so by now the focus has shifted from finding a tool to writing one. I guess chances for finding one are slim in any case, since a web search revealed no hints in that direction.

unparse the intermediate representation of c code back to c

I have a file written in c programming language and is preprocessed using CIL. Now there are calls to a function say foo() in this file. I want to modify the c code in this file such that all calls to foo() are under a #ifdef guard. I want only the calls to be guarded and not the function body so that I have finer control over the calls. The calls can be inside a if condition or a while loop. The rules for macro name: name begins with MACRO_ and ends with the line number of the function call foo() in the original code.
This is to be automated inside a tool and I am looking for a compiler that can unparse c code for doing this.
Example:
Input source file
void foo(int x){
// do something
}
int main(){
int a;
printf("doing something");
foo(a);
printf("doing something again");
foo(a);
return 0;
}
Desired output
void foo(int x){
// do something
}
int main(){
int a;
printf("doing something");
#ifdef MACRO_1
foo(a);
#endif
printf("doing something again");
#ifdef MACRO_2
foo(a);
#endif
return 0;
}
For SIMPLE source code, you can obviously do this with a simple script and some regexps in your favourite scripting language (perl, php, awk, python, etc). But it does get increasingly difficult if you start deciding to support for example function calls inside if-statements, member function calls, etc [and want to end up with output code that actually compiles to a correct program].
In that case, you need something that can read (and "understand") C or C++ and produce some intermediate form that you can then process and reissue the source code with modifications. It's far from easy to write such code, no matter where you start from. One solution may be to use Clang as a library. It has facilities to rewrite C or C++ code from it's Abstract Syntax Tree (AST) form. This link shows an example of such a rewriter: http://eli.thegreenplace.net/2012/06/08/basic-source-to-source-transformation-with-clang
I'm not sure exactly what you want to do if you have code like:
if (x)
foo();
bar();
Clearly, just inserting #if for the call to foo(); will cause the call to bar() to be called only when x is true, which is probably not what you wanted...
You could customize some free software compiler. If using some recent GCC you could customize it with MELT (a Lispy domain specific language to extend gcc & g++ etc....).
You probably do not want to produce idiomatic C code. It would be much simpler to customize your compiler (e.g. GCC -or perhaps Clang/LLVM ...) to have the desired behavior.
Transforming some internal compiler representation (e.g. Gimple for GCC) is a bit simpler than outputting C code. It may still mean several weeks of work (because C and C++ are quite complex languages, and compilers have quite complex internal representations).
Notice that your question does not consider what is happenning when foo is called inside some macro (or inside some C++ template expansion, or perhaps even some inlined function). This shows why working on the intermediate representation(s) of your compiler is worthwhile.
BTW, you might perhaps be interested by coccinelle, a source to source free software transformer.
You could also in principle use Clang (to compile your C or C++ code to LLVM) then llvm-cbe (an experimental LLVM to C backend)
If the code is structured in such a way that guarding the lines with foo calls can simply be commented out and that more complex expressions such as bar(), foo(a) need not be handled, you could use awk like this:
awk '/^\s*foo\(/ { print "#ifdef MACRO_" NR; print; print "#endif"; next } 1' filename.c
This will
/^\s*foo\(/ { # handle lines that begin with foo( preceded
# optionally by whitespaces specially by:
print "#ifdef MACRO_" NR # printing #ifdef MACRO_linenumber before
print
print "#endif" # and #endif after the line.
next
}
1 # all other lines are printed unchanged.
Be aware that this is a dirty, dirty hack that does not attempt to parse the C code properly. There are a number of ways you can break this, among them such things as
if(something)
foo(a);
and
foo(
a
);
That would come out as
if(something)
#ifdef MACRO_foo
foo(a);
#endif
and
#ifdef MACRO_foo
foo(
#endif
a
);
respectively. It may work for your particular case, but it is not a general C-code handling tool.
If the task is to exclude calling of foo(int) from code when some macro undefined (or defined), maybe the following approach will work better:
void foo(int x){
#ifdef MACRO_foo
// do something
#endif
}
int main(){
int a;
printf("doing something");
foo(a);
printf("doing something again");
foo(a);
return 0;
}
So, you can just exclude body of function and leave function calls in the whole program.
I think you are asking CIL to do things that CIL cannot do. Since it operates on preprocessed source code, it doesn't represent preprocessor directives, so you can't "put them into CIL representation" to be regenerated. You might be able to hack the CIL implementation itself to spit out your directives when it encountered your special circumstance, but it is hard to believe that such a hack would be general in any way.
You said you were looking for a "compiler that can unparse c code to do this". If you insist on "this" as involving specifically CIL, I think you are out of luck; there's only CIL itself to do this.
If you give up on CIL and will consider a different tool, then I think I have an answer, that will do CIL like things, can retain preprocessor directives in the representation (and/or allow you to insert them according to custom rules), and can regenerate valid C source code text.
That tool is our DMS Software Reengineering Toolkit, a general purpose program transformation engine, and its C Front End. DMS parses C code into ASTs, and unparse them back to valid source code, including retaining comments.
It can be used to do source-to-source transformations using mixtures of procedure calls on its AST manipulation library, and/or surface-syntax source-to-source rewrites.
DMS will capture preprocessor directives in that AST (they are just "more syntax!) in most cases without issues; sometimes you need to modify the source code slightly (permanently) to make preprocessor directives palatable. DMS provides symbol tables for C, and control and data flow analysis; these will need some revision to handle preprocessor conditionality.
To match what you are doing with CIL, you can ask DMS to do the preprocessing; now you end up with an AST that is preprocessor free. DMS's existing symbol tables, CF and DF machinery handle this case directly, now.
So you can carry out sophisticated operations on the AST using that additional information, in way different than but equivalent to CIL. In addition, you can still modify the ASTs to insert preprocessor directives, which seems to be your key problem.
To achieve your specific effect of call-site specific conditionals, you can take advantage of DMS's surface syntax source-to-source transformation capability.
The following DMS transform does something like what you want:
rule wrap_function_call(i: Identifier, a:arguments ):statement -> statement
" \i(\a); "
->
" #ifdef \generate_macro_name\(\i\)
\i(\a);
#endif
"
if want_to_wrap(i);
This rule finds any syntax tree corresponding to a function call as a statement, and wraps its in a conditional. (You didn't say what you wanted to do if the function call was part of an expression; that case requires a bit more transformation but could also be handled). A custom helper function generated_macro_name manufactures the macro name using the source position information associated with that identifier AST node matched for the function name. The transformation is conditioned on another custom helper function want_to_wrap, that inspects each matched name to determine if it is one that should be wrapped.
When done transforming the code, you call DMS's prettyprinter machinery to print the AST as source text.

Custom stacktrace implementation for ARM

I need to have a stacktrace in my program that is written in C++ and runs on ARM-device. I can't find any reliable way to get starcktrace so I decided to write my own that will be as simple as possible, just to get something like stacktrace in gdb.
Here's an idea: write a macro that will push FUNCTION and __PRETTY_FUNCTION__. There are several questions:
Consider I have such a macro:
#define STACKTRACE_ENTER_FUNC \
... lock mutex
... push info into the global list
... set scope-exit handler to delete info at function exit
... unlock mutex
Now I need to place this macro in every function in my code. But there are too many of them. Is there any better way to achieve the goal or should I really change every function to include this macro:
void foo()
{
STACKTRACE_ENTER_FUNC;
...
}
void bar()
{
STACKTRACE_ENTER_FUNC;
...
}
The next question is: I can use __PRETTY_FUNCTION__ (because we use only gcc of fixed version and the stacktrace implementation is only for debug builds on the fixed platform, no cross-platform or compiler issues). I can even parse it a bit to split the string to function name and function arguments names. But how can I print all function arguments without knowing too much about them: like types or number of arguments? Like:
int foo(char x, float y)
{
PRINT_ARGS("arg1", "arg2"); // Gives me the string: "arg1 = 'A', arg2 = 13.37"
...
}
int main()
{
foo('A', 13.37);
...
}
P.S. If you know a better approach to get stack-trace in running program on ARMv6, please let me know (compiler: arm-openwrt-linux-uclibcgnueabi-gcc 4.7.3, libc: uClibc-0.9.33.2)
Thanks in advance.
The easier solution is to drop down to assembly - stack traces don't exist on C++ level anyway.
From an assembly perspective, you use a map of function addresses (which any linker can generate). The current Instruction Pointer identifies the top frame, the return addresses identify the call stack. The tricky part is tail-call optimization, which is a bit philosophical (do you want the logical or the actual call stack?)

Instrumenting C/C++ codes using LLVM

I just read about the LLVM project and that it could be used to do static analysis on C/C++ codes using the analyzer Clang which the front end of LLVM. I wanted to know if it is possible to extract all the accesses to memory(variables, local as well as global) in the source code using LLVM.
Is there any inbuilt library present in LLVM which I could use to extract this information.
If not please suggest me how to write functions to do the same.(existing source code, reference, tutorial, example...)
Of what i have thought, is I would first convert the source code into LLVM bc and then instrument it to do the analysis, but don't know exactly how to do it.
I tried to figure out myself which IR should I use for my purpose ( Clang's Abstract Syntax Tree (AST) or LLVM's SSA Intermediate Representation (IR). ), but couldn't really figure out which one to use.
Here is what I m trying to do.
Given any C/C++ program (like the one given below), I am trying to insert calls to some function, before and after every instruction that reads/writes to/from memory. For example consider the below C++ program ( Account.cpp)
#include <stdio.h>
class Account {
int balance;
public:
Account(int b) {
balance = b;
}
int read() {
int r;
r = balance;
return r;
}
void deposit(int n) {
balance = balance + n;
}
void withdraw(int n) {
int r = read();
balance = r - n;
}
};
int main () {
Account* a = new Account(10);
a->deposit(1);
a->withdraw(2);
delete a;
}
So after the instrumentation my program should look like:
#include <stdio.h>
class Account {
int balance;
public:
Account(int b) {
balance = b;
}
int read() {
int r;
foo();
r = balance;
foo();
return r;
}
void deposit(int n) {
foo();
balance = balance + n;
foo();
}
void withdraw(int n) {
foo();
int r = read();
foo();
foo();
balance = r - n;
foo();
}
};
int main () {
Account* a = new Account(10);
a->deposit(1);
a->withdraw(2);
delete a;
}
where foo() may be any function like get the current system time or increment a counter .. so on. I understand that to insert function like above I will have to first get the IR and then run an instrumentation pass on the IR which will insert such calls into the IR, but I don't really know how to achieve it. Please suggest me with examples how to go about it.
Also I understand that once I compile the program into the IR, it would be really difficult to get 1:1 mapping between my original program and the instrumented IR. So, is it possible to reflect the changes made in the IR ( because of instrumentation ) into the original program.
In order to get started with LLVM pass and how to make one on my own, I looked at an example of a pass that adds run-time checks to LLVM IR loads and stores, the SAFECode's load/store instrumentation pass (http://llvm.org/viewvc/llvm-project/safecode/trunk/include/safecode/LoadStoreChecks.h?view=markup and http://llvm.org/viewvc/llvm-project/safecode/trunk/lib/InsertPoolChecks/LoadStoreChecks.cpp?view=markup). But I couldn't figure out how to run this pass. Please give me steps how to run this pass on some program say the above Account.cpp.
First off, you have to decide whether you want to work with clang or LLVM. They both operate on very different data structures which have advantages and disadvantages.
From your sparse description of your problem, I'll recommend going for optimization passes in LLVM. Working with the IR will make it much easier to sanitize, analyze and inject code because that's what it was designed to do. The downside is that your project will be dependent on LLVM which may or may not be a problem for you. You could output the result using the C backend but that won't be usable by a human.
Another important downside when working with optimization passes is that you also lose all symbols from the original source code. Even if the Value class (more on that later) has a getName method, you should never rely on it to contain anything meaningful. It's meant to help you debug your passes and nothing else.
You will also have to have a basic understanding of compilers. For example, it's a bit of a requirement to know about basic blocks and static single assignment form. Fortunately they're not very difficult concepts to learn or understand (the Wikipedia articles should be adequate).
Before you can start coding, you first have to do some reading so here's a few links to get you started:
Architecture Overview: A quick architectural overview of LLVM. Will give you a good idea of what you're working with and whether LLVM is the right tool for you.
Documentation Head: Where you can find all the links below and more. Refer to this if I missed anything.
LLVM's IR reference: This is the full description of the LLVM IR which is what you'll be manipulating. The language is relatively simple so there isn't too much to learn.
Programmer's manual: A quick overview of basic stuff you'll need to know when working with LLVM.
Writting Passes: Everything you need to know to write transformation or analysis passes.
LLVM Passes: A comprehensive list of all the passes provided by LLVM that you can and should use. These can really help clean up the code and make it easier to analyze. For example, when working with loops, the lcssa, simplify-loop and indvar passes will save your life.
Value Inheritance Tree: This is the doxygen page for the Value class. The important bit here is the inheritance tree that you can follow to get the documentation for all the instructions defined in the IR reference page. Just ignore the ungodly monstrosity that they call the collaboration diagram.
Type Inheritance Tree: Same as above but for types.
Once you understand all that then it's cake. To find memory accesses? Search for store and load instructions. To instrument? Just create what you need using the proper subclass of the Value class and insert it before or after the store and load instruction. Because your question is a bit too broad, I can't really help you more than this. (See correction below)
By the way, I had to do something similar a few weeks ago. In about 2-3 weeks I was able to learn all I needed about LLVM, create an analysis pass to find memory accesses (and more) within a loop and instrument them with a transformation pass I created. There was no fancy algorithms involved (except the ones provided by LLVM) and everything was pretty straightforward. Moral of the story is that LLVM is easy to learn and work with.
Correction: I made an error when I said that all you have to do is search for load and store instructions.
The load and store instruction will only give accesses that are made to the heap using pointers. In order to get all memory accesses you also have to look at the values which can represent a memory location on the stack. Whether the value is written to the stack or stored in a register is determined during the register allocation phase which occurs in an optimization pass of the backend. Meaning that it's platform dependent and shouldn't be relied on.
Now unless you provide more information about what kind of memory accesses you're looking for, in what context and how you intend to instrument them, I can't help you much more then this.
Since there are no answer to your question after two days, I will offer his one which is slightly but not completely off-topic.
As an alternative to LLVM, for static analysis of C programs, you may consider writing a Frama-C plug-in.
The existing plug-in that computes a list of inputs for a C function needs to visit every lvalue in the function's body. This is implemented in file src/inout/inputs.ml. The implementation is short (the complexity is in other plug-ins that provide their results to this one, e.g. resolving pointers) and can be used as a skeleton for your own plug-in.
A visitor for the Abstract Syntax Tree is provided by the framework. In order to do something special for lvalues, you simply define the corresponding method. The heart of the inputs plug-in is the method definition:
method vlval lv = ...
Here is an example of what the inputs plug-in does:
int a, b, c, d, *p;
main(){
p = &a;
b = c + *p;
}
The inputs of main() are computed thus:
$ frama-c -input t.c
...
[inout] Inputs for function main:
a; c; p;
More information about writing Frama-C plug-ins in general can be found here.