I have source code which looks like this,
void update();
void update()
{
}
Iam trying to parse this code with clang and modify the code to this.
typedef float v4sf attribute ((vector_size(16)));
void update(v4sf& v1, v4sf& v2);
void update(v4sf& v1, v4sf& v2)
{
}
I looked at the Rewriter classes of clang. In the function which i wrote as shown below,
MyRecursiveASTVisitor::VisitFunctionDecl(FunctionDecl *f)
FunctionDecl has setParams() method which i could use. I would have to create params with this method.
static ParmVarDecl *Create(ASTContext &C, DeclContext *DC,
SourceLocation StartLoc,
SourceLocation IdLoc, IdentifierInfo *Id,
QualType T, TypeSourceInfo *TInfo,
StorageClass S, StorageClass SCAsWritten,
Expr *DefArg);
The first four arguments to the create function can be obtained from FunctionDecl. I am not sure what the rest of them have to be.
How do i create types and also assign values to them in clang? The types need not be builtin and could be the like the one added(v4sf) in transformed source code.
Is this way(using clang methods) to do transformations or can i use Rewriter.InsertText() to add the parameters?
Clang is not designed to support mutation of its AST, and it does not support re-exporting the AST as source code (preserving comments, macros and preprocessor directives). Adding AST nodes manually is likely to violate AST invariants, which can lead to crashes. You should use the Rewriter to perform rewrites of the source code, based on information you extract from the AST.
If you still want to perform AST modifications, you should do so by rebuilding the part of the AST you wish to modify, rather than changing it in place. The rebuild steps should be performed by calling methods on Sema, which knows how to provide the appropriate invariants when building the AST.
Related
PROBLEM:
I currently have a traditional module instrumentation pass that
inserts new function calls into a given IR according to some logic
(inserted functions are external from a small lib that is later linked
to given program). Running experiments, my overhead is from
the cost of executing a function call to the library function.
What I am trying to do:
I would like to inline these function bodies into the IR of
the given program to get rid of this bottleneck. I assume an intrinsic
would be a clean way of doing this, since an intrinsic function would
be expanded to its function body when being lowered to ASM (please
correct me if my understanding is incorrect here, this is my first
time working with intrinsics/LTO).
Current Status:
My original library call definition:
void register_my_mem(void *user_vaddr){
... C code ...
}
So far:
I have created a def in: llvm-project/llvm/include/llvm/IR/IntrinsicsX86.td
let TargetPrefix = "x86" in {
def int_x86_register_mem : GCCBuiltin<"__builtin_register_my_mem">,
Intrinsic<[], [llvm_anyint_ty], []>;
}
Added another def in:
otwm/llvm-project/clang/include/clang/Basic/BuiltinsX86.def
TARGET_BUILTIN(__builtin_register_my_mem, "vv*", "", "")
Added my library source (*.c, *.h) to the compiler-rt/lib/test_lib
and added to CMakeLists.txt
Replaced the function insertion with trying to insert the intrinsic
instead in: llvm/lib/Transforms/Instrumentation/myModulePass.cpp
WAS:
FunctionCallee sm_func =
curr_inst->getModule()->getOrInsertFunction("register_my_mem",
func_type);
ArrayRef<Value*> args = {
builder.CreatePointerCast(sm_arg_val, currType->getPointerTo())
};
builder.CreateCall(sm_func, args);
NEW:
Intrinsic::ID aREGISTER(Intrinsic::x86_register_my_mem);
Function *sm_func = Intrinsic::getDeclaration(currFunc->getParent(),
aREGISTER, func_type);
ArrayRef<Value*> args = {
builder.CreatePointerCast(sm_arg_val, currType->getPointerTo())
};
builder.CreateCall(sm_func, args);
Questions:
If my logic for inserting the intrinsic functions shouldnt be a
module pass, where do i put it?
Am I confusing LTO with intrinsics?
Do I put my library function definitions into the following files as mentioned in
http://lists.llvm.org/pipermail/llvm-dev/2017-June/114322.html as for example EmitRegisterMyMem()?
clang/lib/CodeGen/CodeGenFunction.cpp - define llvm::Instrinsic::ID
clang/lib/CodeGen/CodeGenFunction.h - declare llvm::Intrinsic::ID
My LLVM compiles, so it is semantically correct, but currently when
trying to insert this function call, LLVM segfaults saying "Not a valid type for function argument!"
I'm seeing multiple issues here.
Indeed, you're confusing LTO with intrinsics. Intrinsics are special "functions" that are either expanded into special instructions by a backend or lowered to library function calls. This is certainly not something you're going to achieve. You don't need an intrinsic at all, you'd just need to inline the function call in question: either by hands (from your module pass) or via LTO, indeed.
The particular error comes because you're declaring your intrinsic as receiving an integer argument (and this is how the declaration would look like), but:
asking the declaration of variadic intrinsic with invalid type (I'd assume your func_type is a non-integer type)
passing pointer argument
Hope this makes an issue clear.
See also: https://llvm.org/docs/LinkTimeOptimization.html
Thanks you for clearing up the issue #Anton Korobeynikov.
After reading your explanation, I also believe that I have to use LTO to accomplish what I am trying to do. I especially found this link very useful: https://llvm.org/docs/LinkTimeOptimization.html. It seems that I am now on a right path.
I am trying to find all places in a large and old code base where certain constructors or functions are called. Specifically, these are certain constructors and member functions in the std::string class (that is, basic_string<char>). For example, suppose there is a line of code:
std::string foo(fiddle->faddle(k, 9).snark);
In this example, it is not obvious looking at this that snark may be a char *, which is what I'm interested in.
Attempts To Solve This So Far
I've looked into some of the dump features of gcc, and generated some of them, but I haven't been able to find any that tell me that the given line of code will generate a call to the string constructor taking a const char *. I've also compiled some code with -s to save the generated equivalent assembly code. But this suffers from two things: the function names are "mangled," so it's impossible to know what is being called in C++ terms; and there are no line numbers of any sort, so even finding the equivalent place in the source file would be tough.
Motivation and Background
In my project, we're porting a large, old code base from HP-UX (and their aCC C++ compiler) to RedHat Linux and gcc/g++ v.4.8.5. The HP tool chain allowed one to initialize a string with a NULL pointer, treating it as an empty string. The Gnu tools' generated code fails with some flavor of a null dereference error. So we need to find all of the potential cases of this, and remedy them. (For example, by adding code to check for NULL and using a pointer to a "" string instead.)
So if anyone out there has had to deal with the base problem and can offer other suggestions, those, too, would be welcomed.
Have you considered using static analysis?
Clang has one called clang analyzer that is extensible.
You can write a custom plugin that checks for this particular behavior by implementing a clang ast visitor that looks for string variable declarations and checks for setting it to null.
There is a manual for that here.
See also: https://github.com/facebook/facebook-clang-plugins/blob/master/analyzer/DanglingDelegateFactFinder.cpp
First I'd create a header like this:
#include <string>
class dbg_string : public std::string {
public:
using std::string::string;
dbg_string(const char*) = delete;
};
#define string dbg_string
Then modify your makefile and add "-include dbg_string.h" to cflags to force include on each source file without modification.
You could also check how is NULL defined on your platform and add specific overload for it (eg. dbg_string(int)).
You can try CppDepend and its CQLinq a powerful code query language to detect where some contructors/methods/fields/types are used.
from m in Methods where m.IsUsing ("CClassView.CClassView()") select new { m, m.NbLinesOfCode }
I have a file written in c programming language and is preprocessed using CIL. Now there are calls to a function say foo() in this file. I want to modify the c code in this file such that all calls to foo() are under a #ifdef guard. I want only the calls to be guarded and not the function body so that I have finer control over the calls. The calls can be inside a if condition or a while loop. The rules for macro name: name begins with MACRO_ and ends with the line number of the function call foo() in the original code.
This is to be automated inside a tool and I am looking for a compiler that can unparse c code for doing this.
Example:
Input source file
void foo(int x){
// do something
}
int main(){
int a;
printf("doing something");
foo(a);
printf("doing something again");
foo(a);
return 0;
}
Desired output
void foo(int x){
// do something
}
int main(){
int a;
printf("doing something");
#ifdef MACRO_1
foo(a);
#endif
printf("doing something again");
#ifdef MACRO_2
foo(a);
#endif
return 0;
}
For SIMPLE source code, you can obviously do this with a simple script and some regexps in your favourite scripting language (perl, php, awk, python, etc). But it does get increasingly difficult if you start deciding to support for example function calls inside if-statements, member function calls, etc [and want to end up with output code that actually compiles to a correct program].
In that case, you need something that can read (and "understand") C or C++ and produce some intermediate form that you can then process and reissue the source code with modifications. It's far from easy to write such code, no matter where you start from. One solution may be to use Clang as a library. It has facilities to rewrite C or C++ code from it's Abstract Syntax Tree (AST) form. This link shows an example of such a rewriter: http://eli.thegreenplace.net/2012/06/08/basic-source-to-source-transformation-with-clang
I'm not sure exactly what you want to do if you have code like:
if (x)
foo();
bar();
Clearly, just inserting #if for the call to foo(); will cause the call to bar() to be called only when x is true, which is probably not what you wanted...
You could customize some free software compiler. If using some recent GCC you could customize it with MELT (a Lispy domain specific language to extend gcc & g++ etc....).
You probably do not want to produce idiomatic C code. It would be much simpler to customize your compiler (e.g. GCC -or perhaps Clang/LLVM ...) to have the desired behavior.
Transforming some internal compiler representation (e.g. Gimple for GCC) is a bit simpler than outputting C code. It may still mean several weeks of work (because C and C++ are quite complex languages, and compilers have quite complex internal representations).
Notice that your question does not consider what is happenning when foo is called inside some macro (or inside some C++ template expansion, or perhaps even some inlined function). This shows why working on the intermediate representation(s) of your compiler is worthwhile.
BTW, you might perhaps be interested by coccinelle, a source to source free software transformer.
You could also in principle use Clang (to compile your C or C++ code to LLVM) then llvm-cbe (an experimental LLVM to C backend)
If the code is structured in such a way that guarding the lines with foo calls can simply be commented out and that more complex expressions such as bar(), foo(a) need not be handled, you could use awk like this:
awk '/^\s*foo\(/ { print "#ifdef MACRO_" NR; print; print "#endif"; next } 1' filename.c
This will
/^\s*foo\(/ { # handle lines that begin with foo( preceded
# optionally by whitespaces specially by:
print "#ifdef MACRO_" NR # printing #ifdef MACRO_linenumber before
print
print "#endif" # and #endif after the line.
next
}
1 # all other lines are printed unchanged.
Be aware that this is a dirty, dirty hack that does not attempt to parse the C code properly. There are a number of ways you can break this, among them such things as
if(something)
foo(a);
and
foo(
a
);
That would come out as
if(something)
#ifdef MACRO_foo
foo(a);
#endif
and
#ifdef MACRO_foo
foo(
#endif
a
);
respectively. It may work for your particular case, but it is not a general C-code handling tool.
If the task is to exclude calling of foo(int) from code when some macro undefined (or defined), maybe the following approach will work better:
void foo(int x){
#ifdef MACRO_foo
// do something
#endif
}
int main(){
int a;
printf("doing something");
foo(a);
printf("doing something again");
foo(a);
return 0;
}
So, you can just exclude body of function and leave function calls in the whole program.
I think you are asking CIL to do things that CIL cannot do. Since it operates on preprocessed source code, it doesn't represent preprocessor directives, so you can't "put them into CIL representation" to be regenerated. You might be able to hack the CIL implementation itself to spit out your directives when it encountered your special circumstance, but it is hard to believe that such a hack would be general in any way.
You said you were looking for a "compiler that can unparse c code to do this". If you insist on "this" as involving specifically CIL, I think you are out of luck; there's only CIL itself to do this.
If you give up on CIL and will consider a different tool, then I think I have an answer, that will do CIL like things, can retain preprocessor directives in the representation (and/or allow you to insert them according to custom rules), and can regenerate valid C source code text.
That tool is our DMS Software Reengineering Toolkit, a general purpose program transformation engine, and its C Front End. DMS parses C code into ASTs, and unparse them back to valid source code, including retaining comments.
It can be used to do source-to-source transformations using mixtures of procedure calls on its AST manipulation library, and/or surface-syntax source-to-source rewrites.
DMS will capture preprocessor directives in that AST (they are just "more syntax!) in most cases without issues; sometimes you need to modify the source code slightly (permanently) to make preprocessor directives palatable. DMS provides symbol tables for C, and control and data flow analysis; these will need some revision to handle preprocessor conditionality.
To match what you are doing with CIL, you can ask DMS to do the preprocessing; now you end up with an AST that is preprocessor free. DMS's existing symbol tables, CF and DF machinery handle this case directly, now.
So you can carry out sophisticated operations on the AST using that additional information, in way different than but equivalent to CIL. In addition, you can still modify the ASTs to insert preprocessor directives, which seems to be your key problem.
To achieve your specific effect of call-site specific conditionals, you can take advantage of DMS's surface syntax source-to-source transformation capability.
The following DMS transform does something like what you want:
rule wrap_function_call(i: Identifier, a:arguments ):statement -> statement
" \i(\a); "
->
" #ifdef \generate_macro_name\(\i\)
\i(\a);
#endif
"
if want_to_wrap(i);
This rule finds any syntax tree corresponding to a function call as a statement, and wraps its in a conditional. (You didn't say what you wanted to do if the function call was part of an expression; that case requires a bit more transformation but could also be handled). A custom helper function generated_macro_name manufactures the macro name using the source position information associated with that identifier AST node matched for the function name. The transformation is conditioned on another custom helper function want_to_wrap, that inspects each matched name to determine if it is one that should be wrapped.
When done transforming the code, you call DMS's prettyprinter machinery to print the AST as source text.
I just read about the LLVM project and that it could be used to do static analysis on C/C++ codes using the analyzer Clang which the front end of LLVM. I wanted to know if it is possible to extract all the accesses to memory(variables, local as well as global) in the source code using LLVM.
Is there any inbuilt library present in LLVM which I could use to extract this information.
If not please suggest me how to write functions to do the same.(existing source code, reference, tutorial, example...)
Of what i have thought, is I would first convert the source code into LLVM bc and then instrument it to do the analysis, but don't know exactly how to do it.
I tried to figure out myself which IR should I use for my purpose ( Clang's Abstract Syntax Tree (AST) or LLVM's SSA Intermediate Representation (IR). ), but couldn't really figure out which one to use.
Here is what I m trying to do.
Given any C/C++ program (like the one given below), I am trying to insert calls to some function, before and after every instruction that reads/writes to/from memory. For example consider the below C++ program ( Account.cpp)
#include <stdio.h>
class Account {
int balance;
public:
Account(int b) {
balance = b;
}
int read() {
int r;
r = balance;
return r;
}
void deposit(int n) {
balance = balance + n;
}
void withdraw(int n) {
int r = read();
balance = r - n;
}
};
int main () {
Account* a = new Account(10);
a->deposit(1);
a->withdraw(2);
delete a;
}
So after the instrumentation my program should look like:
#include <stdio.h>
class Account {
int balance;
public:
Account(int b) {
balance = b;
}
int read() {
int r;
foo();
r = balance;
foo();
return r;
}
void deposit(int n) {
foo();
balance = balance + n;
foo();
}
void withdraw(int n) {
foo();
int r = read();
foo();
foo();
balance = r - n;
foo();
}
};
int main () {
Account* a = new Account(10);
a->deposit(1);
a->withdraw(2);
delete a;
}
where foo() may be any function like get the current system time or increment a counter .. so on. I understand that to insert function like above I will have to first get the IR and then run an instrumentation pass on the IR which will insert such calls into the IR, but I don't really know how to achieve it. Please suggest me with examples how to go about it.
Also I understand that once I compile the program into the IR, it would be really difficult to get 1:1 mapping between my original program and the instrumented IR. So, is it possible to reflect the changes made in the IR ( because of instrumentation ) into the original program.
In order to get started with LLVM pass and how to make one on my own, I looked at an example of a pass that adds run-time checks to LLVM IR loads and stores, the SAFECode's load/store instrumentation pass (http://llvm.org/viewvc/llvm-project/safecode/trunk/include/safecode/LoadStoreChecks.h?view=markup and http://llvm.org/viewvc/llvm-project/safecode/trunk/lib/InsertPoolChecks/LoadStoreChecks.cpp?view=markup). But I couldn't figure out how to run this pass. Please give me steps how to run this pass on some program say the above Account.cpp.
First off, you have to decide whether you want to work with clang or LLVM. They both operate on very different data structures which have advantages and disadvantages.
From your sparse description of your problem, I'll recommend going for optimization passes in LLVM. Working with the IR will make it much easier to sanitize, analyze and inject code because that's what it was designed to do. The downside is that your project will be dependent on LLVM which may or may not be a problem for you. You could output the result using the C backend but that won't be usable by a human.
Another important downside when working with optimization passes is that you also lose all symbols from the original source code. Even if the Value class (more on that later) has a getName method, you should never rely on it to contain anything meaningful. It's meant to help you debug your passes and nothing else.
You will also have to have a basic understanding of compilers. For example, it's a bit of a requirement to know about basic blocks and static single assignment form. Fortunately they're not very difficult concepts to learn or understand (the Wikipedia articles should be adequate).
Before you can start coding, you first have to do some reading so here's a few links to get you started:
Architecture Overview: A quick architectural overview of LLVM. Will give you a good idea of what you're working with and whether LLVM is the right tool for you.
Documentation Head: Where you can find all the links below and more. Refer to this if I missed anything.
LLVM's IR reference: This is the full description of the LLVM IR which is what you'll be manipulating. The language is relatively simple so there isn't too much to learn.
Programmer's manual: A quick overview of basic stuff you'll need to know when working with LLVM.
Writting Passes: Everything you need to know to write transformation or analysis passes.
LLVM Passes: A comprehensive list of all the passes provided by LLVM that you can and should use. These can really help clean up the code and make it easier to analyze. For example, when working with loops, the lcssa, simplify-loop and indvar passes will save your life.
Value Inheritance Tree: This is the doxygen page for the Value class. The important bit here is the inheritance tree that you can follow to get the documentation for all the instructions defined in the IR reference page. Just ignore the ungodly monstrosity that they call the collaboration diagram.
Type Inheritance Tree: Same as above but for types.
Once you understand all that then it's cake. To find memory accesses? Search for store and load instructions. To instrument? Just create what you need using the proper subclass of the Value class and insert it before or after the store and load instruction. Because your question is a bit too broad, I can't really help you more than this. (See correction below)
By the way, I had to do something similar a few weeks ago. In about 2-3 weeks I was able to learn all I needed about LLVM, create an analysis pass to find memory accesses (and more) within a loop and instrument them with a transformation pass I created. There was no fancy algorithms involved (except the ones provided by LLVM) and everything was pretty straightforward. Moral of the story is that LLVM is easy to learn and work with.
Correction: I made an error when I said that all you have to do is search for load and store instructions.
The load and store instruction will only give accesses that are made to the heap using pointers. In order to get all memory accesses you also have to look at the values which can represent a memory location on the stack. Whether the value is written to the stack or stored in a register is determined during the register allocation phase which occurs in an optimization pass of the backend. Meaning that it's platform dependent and shouldn't be relied on.
Now unless you provide more information about what kind of memory accesses you're looking for, in what context and how you intend to instrument them, I can't help you much more then this.
Since there are no answer to your question after two days, I will offer his one which is slightly but not completely off-topic.
As an alternative to LLVM, for static analysis of C programs, you may consider writing a Frama-C plug-in.
The existing plug-in that computes a list of inputs for a C function needs to visit every lvalue in the function's body. This is implemented in file src/inout/inputs.ml. The implementation is short (the complexity is in other plug-ins that provide their results to this one, e.g. resolving pointers) and can be used as a skeleton for your own plug-in.
A visitor for the Abstract Syntax Tree is provided by the framework. In order to do something special for lvalues, you simply define the corresponding method. The heart of the inputs plug-in is the method definition:
method vlval lv = ...
Here is an example of what the inputs plug-in does:
int a, b, c, d, *p;
main(){
p = &a;
b = c + *p;
}
The inputs of main() are computed thus:
$ frama-c -input t.c
...
[inout] Inputs for function main:
a; c; p;
More information about writing Frama-C plug-ins in general can be found here.
In some of the LLVM tutorials I'm seen where it's fairly easy to bind C function into a custom language based on LLVM. LLVM hands the programmer a pointer to the function that can be then be mixed in with the code being generated by LLVM.
What's the best method to do this with C++ libraries. Let's say I have a fairly complex library like Qt or Boost that I want to bind to my custom language. Do I need to create a stub library (like Python or Lua require), or does LLVM offer some sort of foreign function interface (FFI)?
In my LLVM code, I create extern "C" wrapper functions for this, and insert LLVM function declarations into the module in order to call them. Then, a good way to make LLVM know about the functions is not to let it use dlopen and search for the function name in the executing binary (this is a pain in the ass, since the function names need to be in the .dynsym section, and it is slow too), but to do the mapping manually, using ExecutionEngine::addGlobalMapping.
Just get the llvm::Function* of that declaration and the address of the function as given in C++ by &functionname converted to void* and pass these two things along to LLVM. The JIT executing your stuff will then know where to find the function.
For example, if you wanted to wrap QString you could create several functions that create, destroy and call functions of such an object
extern "C" void createQString(void *p, char const*v) {
new (p) QString(v); // placement-new
}
extern "C" int32_t countQString(void *p) {
QString *q = static_cast<QString*>(p);
return q->count();
}
extern "C" void destroyQString(void *p) {
QString *q = static_cast<QString*>(p);
q->~QString();
}
And create proper declarations and a mapping. Then you can call these functions, passing along a memory region suitably aligned and sized for QString (possibly alloca'ed) and a i8* pointing to the C string data for initialization.
If you compile some code in C++ and some in another language to LLVM bitcode, it should be perfectly possible to link these together and let one call the other... in theory.
In practice, you will need glue code to convert between the different language's types (e.g. there is no equivalent to a Python string in C++ unless you use CPython, so for void reverse(std::string s) to be callable with a str you need a conversion - worse, the whole object model is very different). And Qt specifically has a lot of magic that may require much more effort to expose after compilations. Also, there may be further potential problems I'm not aware of.
And even if that works, it's potentially very ugly to use. There are still get* and set* functions all over PyQt despite Python's very convenient descriptors - and much effort went into PyQt, they didn't just create some stubs.