unparse the intermediate representation of c code back to c

unparse the intermediate representation of c code back to c - c++

I have a file written in c programming language and is preprocessed using CIL. Now there are calls to a function say foo() in this file. I want to modify the c code in this file such that all calls to foo() are under a #ifdef guard. I want only the calls to be guarded and not the function body so that I have finer control over the calls. The calls can be inside a if condition or a while loop. The rules for macro name: name begins with MACRO_ and ends with the line number of the function call foo() in the original code.
This is to be automated inside a tool and I am looking for a compiler that can unparse c code for doing this.
Example:
Input source file
void foo(int x){
// do something
}
int main(){
int a;
printf("doing something");
foo(a);
printf("doing something again");
foo(a);
return 0;
}
Desired output
void foo(int x){
// do something
}
int main(){
int a;
printf("doing something");
#ifdef MACRO_1
foo(a);
#endif
printf("doing something again");
#ifdef MACRO_2
foo(a);
#endif
return 0;
}

For SIMPLE source code, you can obviously do this with a simple script and some regexps in your favourite scripting language (perl, php, awk, python, etc). But it does get increasingly difficult if you start deciding to support for example function calls inside if-statements, member function calls, etc [and want to end up with output code that actually compiles to a correct program].
In that case, you need something that can read (and "understand") C or C++ and produce some intermediate form that you can then process and reissue the source code with modifications. It's far from easy to write such code, no matter where you start from. One solution may be to use Clang as a library. It has facilities to rewrite C or C++ code from it's Abstract Syntax Tree (AST) form. This link shows an example of such a rewriter: http://eli.thegreenplace.net/2012/06/08/basic-source-to-source-transformation-with-clang
I'm not sure exactly what you want to do if you have code like:
if (x)
foo();
bar();
Clearly, just inserting #if for the call to foo(); will cause the call to bar() to be called only when x is true, which is probably not what you wanted...

You could customize some free software compiler. If using some recent GCC you could customize it with MELT (a Lispy domain specific language to extend gcc & g++ etc....).
You probably do not want to produce idiomatic C code. It would be much simpler to customize your compiler (e.g. GCC -or perhaps Clang/LLVM ...) to have the desired behavior.
Transforming some internal compiler representation (e.g. Gimple for GCC) is a bit simpler than outputting C code. It may still mean several weeks of work (because C and C++ are quite complex languages, and compilers have quite complex internal representations).
Notice that your question does not consider what is happenning when foo is called inside some macro (or inside some C++ template expansion, or perhaps even some inlined function). This shows why working on the intermediate representation(s) of your compiler is worthwhile.
BTW, you might perhaps be interested by coccinelle, a source to source free software transformer.
You could also in principle use Clang (to compile your C or C++ code to LLVM) then llvm-cbe (an experimental LLVM to C backend)

If the code is structured in such a way that guarding the lines with foo calls can simply be commented out and that more complex expressions such as bar(), foo(a) need not be handled, you could use awk like this:
awk '/^\s*foo\(/ { print "#ifdef MACRO_" NR; print; print "#endif"; next } 1' filename.c
This will
/^\s*foo\(/ { # handle lines that begin with foo( preceded
# optionally by whitespaces specially by:
print "#ifdef MACRO_" NR # printing #ifdef MACRO_linenumber before
print
print "#endif" # and #endif after the line.
next
}
1 # all other lines are printed unchanged.
Be aware that this is a dirty, dirty hack that does not attempt to parse the C code properly. There are a number of ways you can break this, among them such things as
if(something)
foo(a);
and
foo(
a
);
That would come out as
if(something)
#ifdef MACRO_foo
foo(a);
#endif
and
#ifdef MACRO_foo
foo(
#endif
a
);
respectively. It may work for your particular case, but it is not a general C-code handling tool.

If the task is to exclude calling of foo(int) from code when some macro undefined (or defined), maybe the following approach will work better:
void foo(int x){
#ifdef MACRO_foo
// do something
#endif
}
int main(){
int a;
printf("doing something");
foo(a);
printf("doing something again");
foo(a);
return 0;
}
So, you can just exclude body of function and leave function calls in the whole program.

I think you are asking CIL to do things that CIL cannot do. Since it operates on preprocessed source code, it doesn't represent preprocessor directives, so you can't "put them into CIL representation" to be regenerated. You might be able to hack the CIL implementation itself to spit out your directives when it encountered your special circumstance, but it is hard to believe that such a hack would be general in any way.
You said you were looking for a "compiler that can unparse c code to do this". If you insist on "this" as involving specifically CIL, I think you are out of luck; there's only CIL itself to do this.
If you give up on CIL and will consider a different tool, then I think I have an answer, that will do CIL like things, can retain preprocessor directives in the representation (and/or allow you to insert them according to custom rules), and can regenerate valid C source code text.
That tool is our DMS Software Reengineering Toolkit, a general purpose program transformation engine, and its C Front End. DMS parses C code into ASTs, and unparse them back to valid source code, including retaining comments.
It can be used to do source-to-source transformations using mixtures of procedure calls on its AST manipulation library, and/or surface-syntax source-to-source rewrites.
DMS will capture preprocessor directives in that AST (they are just "more syntax!) in most cases without issues; sometimes you need to modify the source code slightly (permanently) to make preprocessor directives palatable. DMS provides symbol tables for C, and control and data flow analysis; these will need some revision to handle preprocessor conditionality.
To match what you are doing with CIL, you can ask DMS to do the preprocessing; now you end up with an AST that is preprocessor free. DMS's existing symbol tables, CF and DF machinery handle this case directly, now.
So you can carry out sophisticated operations on the AST using that additional information, in way different than but equivalent to CIL. In addition, you can still modify the ASTs to insert preprocessor directives, which seems to be your key problem.
To achieve your specific effect of call-site specific conditionals, you can take advantage of DMS's surface syntax source-to-source transformation capability.
The following DMS transform does something like what you want:
rule wrap_function_call(i: Identifier, a:arguments ):statement -> statement
" \i(\a); "
->
" #ifdef \generate_macro_name\(\i\)
\i(\a);
#endif
"
if want_to_wrap(i);
This rule finds any syntax tree corresponding to a function call as a statement, and wraps its in a conditional. (You didn't say what you wanted to do if the function call was part of an expression; that case requires a bit more transformation but could also be handled). A custom helper function generated_macro_name manufactures the macro name using the source position information associated with that identifier AST node matched for the function name. The transformation is conditioned on another custom helper function want_to_wrap, that inspects each matched name to determine if it is one that should be wrapped.
When done transforming the code, you call DMS's prettyprinter machinery to print the AST as source text.

Related

Using MACROs to get the 'name' of function parameters

I've implemented a log function, that eventually is being used identically all over the code.
void func(int foo, int bar){
log_api_call("foo", foo, "bar",bar)
...
}
so I've decided to make it easier and just extract the variable names.
so it would be something like
log_api_call(foo,bar)
or even better
log_api_call()
and it would expand to log_api_call("foo", foo, "bar",bar) somehow.
I have no idea even where to start to 'extract' the function variable names.
help would be much appreciated.
Edit:
I understand that what I've asked previously is outside of the C++ preprocessor capabilities, but can C MACROS expand log_api(a,b) to log_api_call("a", a, "b", b) for any number of parameters?
for defined number the job is trivial.
Thanks.

This isn't actually too difficult.
I'd recommend a slight change in spec though; instead of:
expand log_api(a,b) to log_api_call("a", a, "b", b)
...it's more useful to expand something like NAMED_VALUES(a,b) to "a",a,"b",b. You can then call log_api(NAMED_VALUES(a,b)), but your log_api can stay more generic (e.g., log_api(NAMED_VALUES(a,b),"entering function") is possible). This approach also avoids a lot of complications about zero-argument cases.
// A preprocessor argument counter
#define COUNT(...) COUNT_I(__VA_ARGS__, 9, 8, 7, 6, 5, 4, 3, 2, 1,)
#define COUNT_I(_9,_8,_7,_6,_5,_4,_3,_2,_1,X,...) X
// Preprocessor paster
#define GLUE(A,B) GLUE_I(A,B)
#define GLUE_I(A,B) A##B
// chained caller
#define NAMED_VALUES(...) GLUE(NAMED_VALUES_,COUNT(__VA_ARGS__))(__VA_ARGS__)
// chain
#define NAMED_VALUES_1(a) #a,a
#define NAMED_VALUES_2(a,...) #a,a,NAMED_VALUES_1(__VA_ARGS__)
#define NAMED_VALUES_3(a,...) #a,a,NAMED_VALUES_2(__VA_ARGS__)
#define NAMED_VALUES_4(a,...) #a,a,NAMED_VALUES_3(__VA_ARGS__)
#define NAMED_VALUES_5(a,...) #a,a,NAMED_VALUES_4(__VA_ARGS__)
#define NAMED_VALUES_6(a,...) #a,a,NAMED_VALUES_5(__VA_ARGS__)
#define NAMED_VALUES_7(a,...) #a,a,NAMED_VALUES_6(__VA_ARGS__)
#define NAMED_VALUES_8(a,...) #a,a,NAMED_VALUES_7(__VA_ARGS__)
#define NAMED_VALUES_9(a,...) #a,a,NAMED_VALUES_8(__VA_ARGS__)
This supports up to 9 arguments, but it should be easy to see how to expand to more.

This is not possible in standard C++11 (or standard C11 - which nearly shares its preprocessor with C++). The C or C++ preprocessor don't know the AST of your code passed to the compiler (because it is running before the actual parsing of your code).
I have no idea even where to start to 'extract' the function variable names.
Notice that variable and function names are known only at compilation time (after preprocessing). So if you want them, you need to work during compilation. At execution time variables and functions names are generally lost (and you could strip your executable).
You could generate your C++ code (e.g.using some other preprocessor like GPP or M4, or writing your own thing).
You could customize your C++ compiler (e.g. with an extension in GCC MELT, or a GCC plugin) to e.g. have log_api_call invoke some new magic builtin (whose processing inside the compiler would do most of the job). This would take months and is very compiler specific, I don't think it is worth the pain.
You could parse DWARF debugging info (that would also take months, so I don't think it would be wise).
(I am implicitly thinking of C++ code compiled on a Linux system)
Read more about aspect programming.
If you want such powerful meta-programming facilities, C++ is the wrong programming language. Read more about the powerful macro system of Common Lisp...
but can C MACROS expand log_api(a,b) to log_api_call("a", a, "b", b) for any number of parameters? for defined number the job is trivial.
No. You need a more powerful preprocessor to do that job (or write your own). For that specific need, you might consider customizing your source code editor (e.g. write a hundred lines of ELisp code doing that extraction & expansion job at edit time for emacs).
PS In practice you could find some library (probably boost) limiting the arguments to some reasonable limit

I think the best you can achieve from inside the language is writing a macro LOG_API_CALL(foo,bar) that expands to log_api_call("foo", foo, "bar", bar):
#define LOG_API_CALL(P1,P2) log_api_call(#P1,P1,#P2,P1)
This gets pretty tricky if you want to support arbitrarily many arguments with a single macro name, but you could also have a separate macro for each number of arguments.

and it would expand to log_api_call("foo", foo, "bar",bar) somehow.
This is not possible in Standard C++.

C/C++ preprocessor directive for handing compilation errors

The title might be somewhat confusing, so I'll try to explain.
Is there a preprocessor directive that I can encapsulate a piece of code with, so that if this piece of code contains a compilation error, then some other piece of should be compiled instead?
Here is an example to illustrate my motivation:
#compile_if_ok
int a = 5;
a += 6;
int b = 7;
b += 8;
#else
int a = 5;
int b = 7;
a += 6;
b += 8;
#endif
The above example is not the problem I am dealing with, so please do not suggest specific solutions.
UPDATE:
Thank you for all the negative comments down there.
Here is the exact problem, perhaps someone with a little less negative approach will have an answer:
I'm trying to decide during compile-time whether some variable a is an array or a pointer.
I've figured I can use the fact that, unlike pointers, an array doesn't have an L-value.
So in essence, the following code would yield a compilation error for an array but not for a pointer:
int a[10];
a = (int*)5;
Can I somehow "leverage" this compilation error in order to determine that a is an array and not a pointer, without stopping the compilation process?
Thanks

No.
It's not uncommon for large C++ (and other-language) projects to have a "configuration" stage designed into their build system to attempt compilation of different snippets of code, generating a set of preprocessor definitions indicating which ones worked, so that the compilation of the project proper can then use the preprocessor definitions in #ifdef/#else/#endif statements to select between alternatives. For many UNIX/Linux software packages, running the "./configure" script coordinates this. You can read about the autoconf tool that helps create such scripts at http://www.gnu.org/software/autoconf/

This is not supported in standard C. However, many command shells make this fairly simple. For example, in bash, you can write a script such as:
#!/bin/bash
# Try to compile the program with Code0 defined.
if cc -o program -DCode0= "$*"; then
# That worked, do nothing extra. (Need some command here due to bash syntax.)
/bin/true
else
# The first compilation failed, try without Code0 defined.
cc -o program "$*"
fi
./program
Then your source code can test whether Code0 is defined:
#if defined Code0
foo bar;
#else
#include <stdio.h>
int main(void)
{
printf("Hello, world.\n");
return 0;
}
#endif
However, there are usually better ways to, in effect, make source code responsive to the environment or the target platform.

On the updated question :
If you're writing C++, use templates...
Specifically, to test the type of a variable you have helpers : std::enable_if, std::is_same, std::is_pointer, etc
See the type support module : http://en.cppreference.com/w/cpp/types

C11 _Generic macros might be able to handle this. If not, though, you're screwed in C.
Not in the C++ preprocessor. In C++ you can easily use overload resolution or a template or even expression SFINAE or anything like that to execute a different function depending on if a is an array or not. That is still occurring after preprocessing though.
If you need one that is both valid C and valid C++, the best you can do is #ifdef __cplusplus and handle it that way. Their common subset (which is mostly C89) definitely does not have something that can handle this at any stage of compilation.

Insert text into C++ code between functions

I have following requirement:
Adding text at the entry and exit point of any function.
Not altering the source code, beside inserting from above (so no pre-processor or anything)
For example:
void fn(param-list)
{
ENTRY_TEXT (param-list)
//some code
EXIT_TEXT
}
But not only in such a simple case, it'd also run with pre-processor directives!
Example:
void fn(param-list)
#ifdef __WIN__
{
ENTRY_TEXT (param-list)
//some windows code
EXIT_TEXT
}
#else
{
ENTRY_TEXT (param-list)
//some any-os code
if (condition)
{
return; //should become EXIT_TEXT
}
EXIT_TEXT
}
So my question is: Is there a proper way doing this?
I already tried some work with parsers used by compilers but since they all rely on running a pre-processor before parsing, they are useless to me.
Also some of the token generating parser, which do not need a pre-processor are somewhat useless because they generate a memory-mapping of tokens, which then leads to a complete new source code, instead of just inserting the text.
One thing I am working on is to try it with FLEX (or JFlex), if this is a valid option, I would appreciate some input on it. ;-)
EDIT:
To clarify a little bit: The purpose is to allow something like a stack trace.
I want to trace every function call, and in order to follow the call-hierachy, I need to place a macro at the entry-point of a function and at the exit point of a function.
This builds a function-call trace. :-)
EDIT2: Compiler-specific options are not quite suitable since we have many different compilers to use, and many that are propably not well supported by any tools out there.

Unfortunately, your idea is not only impractical (C++ is complex to parse), it's also doomed to fail.
The main issue you have is that exceptions will bypass your EXIT_TEXT macro entirely.
You have several solutions.
As has been noted, the first solution would be to use a platform dependent way of computing the stack trace. It can be somewhat imprecise, especially because of inlining: ie, small functions being inlined in their callers, they do not appear in the stack trace as no function call was generated at assembly level. On the other hand, it's widely available, does not require any surgery of the code and does not affect performance.
A second solution would be to only introduce something on entry and use RAII to do the exit work. Much better than your scheme as it automatically deals with multiple returns and exceptions, it suffers from the same issue: how to perform the insertion automatically. For this you will probably want to operate at the AST level, and modify the AST to introduce your little gem. You could do it with Clang (look up the c++11 migration tool for examples of rewrites at large) or with gcc (using plugins).
Finally, you also have manual annotations. While it may seem underpowered (and a lot of work), I would highlight that you do not leave logging to a tool... I see 3 advantages to doing it manually: you can avoid introducing this overhead in performance sensitive parts, you can retain only a "summary" of big arguments and you can customize the summary based on what's interesting for the current function.

I would suggest using LLVM libraries & Clang to get started.
You could also leverage the C++ language to simplify your process. If you just insert a small object into the code that is constructed on function scope entrance & rely on the fact that it will be destroyed on exit. That should massively simplify recording the 'exit' of the function.

This does not really answer you question, however, for your initial need, you may use the backtrace() function from execinfo.h (if you are using GCC).
How to generate a stacktrace when my gcc C++ app crashes

gcc for parsing code

I would like to know how to use GCC as a library to parse C/C++/Java/Objective C/Ada code for my program.
I want to bypass prepocessing and prefix all the functions that are user written with a prefix My.
like so Print(); becomes MyPrint(); I also wish to do this with the variables.

You can look here:
http://codesynthesis.com/~boris/blog/2010/05/03/parsing-cxx-with-gcc-plugin-part-1/
This is description of how to use gcc plugin interface to parse C++ code. Other language should be handled in the same manner.
Also you can try pork from mozilla:
https://wiki.mozilla.org/Pork
When I tried it (pork), I spend hour or so to fix compile problems, but then
I can write scripts like this:
rewrite SyncPrimitiveUpgrade {
type PRLock* => Mutex*
call PR_NewLock() => new Mutex()
call PR_Lock(lock) => lock->Lock()
call PR_Unlock(lock) => lock->Unlock()
call PR_DestroyLock(lock) => delete lock
}
so it found all type PRLock and replate it with Mutex, also it search call of functions
like PR_NewLock and replace it with "new Mutex".

You might wish to investigate the sparse C parser. It understands a lot of C (all the C used in the Linux kernel sources, which is a fairly good subset of legal ANSI-C and GNU-C extensions) and provides a few sample compiler backends to provide a lint-like static analysis tool for type checking.
While the code looks very clean and thorough, your task might be easier done via another mechanism -- the example.c included with the sparse source that demonstrates a compiler is 1955 lines long.

For C, you cannot do that reliably. If you skip preprocessing you will -- in general -- not have valid C code to be parsed. E.g.
#define FOO
#define BAR
#define BAZ
FOO void BAR qux BAZ(void) { }
How is the parser supposed to recognize this a function definition of qux without doing the preprocessing?

First, GCC is not a library, and is not structured to be one (in contrast to LLVM).
Why (i.e. what for) do you want to parse C, C++, Ada source code?
I would consider (assuming a GCC 4.6 version) extending GCC either thru plugins written in C, or preferably using MELT, a high level domain specific language to extend GCC (disclaimer: I am the main author of MELT).
But using GCC as a library is not realistic at all.
I really think that for what you want to achieve, MELT is the right tool. However, it is poorly documented. Please use the gcc-melt#googlegroups.com list to ask questions.
And be aware that extending GCC does take some amount of work (more than a week perhaps), because you need to partly understand the GCC internal representations.

Our DMS Software Reengineering Toolkit can parse C, C++, Java and Ada code (not Objective C at this time) in a wide variety of dialects and carry out transformations on the code. DMS's C and C++ front ends include a preprocessor, so you can you can cause preprocessing before you parse.
I'm probably don't understand what you want to do, because it seems strange to rename every function and (global?) variable with a "My...." prefix. But you could do that with some DMS rules (a rough sketch of renames of user functions for GCC3:
domain C~GCC3.
rule rewrite_function_names(t: type_designator, i: IDENTIFIER, p: parameter_list, s: statements):
function_header->functionheader
"\t \i(\p) { \s } " -> "\t \renamed\(\i\) (\p) { \s }" ;
and a helper function "renames" that takes a tree node containing an identifer, and returns a tree node with the renamed identifier.
Because DMS patterns only match against the parse trees, you won't get any false positives.
You'd need some additional patterns to handle various different syntax cases within each langauge (e.g, for C, "void" return type, because "void" isn't a type designator in the syntax, and global variable declarations), and different rules for different languages (Ada's syntax is not the same as that of C).
This might seem like big hammer for your task, but if you really insist on doing this for a variety of languages in a reliable way, it seems hard to avoid the problem of getting decent parsers for all those languages. (And if you are really going to do this for all these languages, DMS can be taught to handle ObjectiveC the same we we have taught it to handle the other langauges).
Your alternative is some kind of string hacking solution, which might work 95% of the time. If you can live with that, then Perl or something similar is likely your answer.

forget about GCC, its made as a compiler's parser, not an analysis parser, you'd do way better using something like libclang, a C interface to clang, which can process both C & C++

How To Extract Function Name From Main() Function Of C Source

I just want to ask your ideas regarding this matter. For a certain important reason, I must extract/acquire all function names of functions that were called inside a "main()" function of a C source file (ex: main.c).
Example source code:
int main()
{
int a = functionA(); // functionA must be extracted
int b = functionB(); // functionB must be extracted
}
As you know, the only thing that I can use as a marker/sign to identify these function calls are it's parenthesis "()". I've already considered several factors in implementing this function name extraction. These are:
1. functions may have parameters. Ex: functionA(100)
2. Loop operators. Ex: while()
3. Other operators. Ex: if(), else if()
4. Other operator between function calls with no spaces. Ex: functionA()+functionB()
As of this moment I know what you're saying, this is a pain in the $$$... So please share your thoughts and ideas... and bear with me on this one...
Note: this is in C++ language...

You can write a Small C++ parser by combining FLEX (or LEX) and BISON (or YACC).
Take C++'s grammar
Generate a C++ program parser with the mentioned tools
Make that program count the funcion calls you are mentioning
Maybe a little bit too complicated for what you need to do, but it should certainly work. And LEX/YACC are amazing tools!

One option is to write your own C tokenizer (simple: just be careful enough to skip over strings, character constants and comments), and to write a simple parser, which counts the number of {s open, and finds instances of identifier + ( within. However, this won't be 100% correct. The disadvantage of this option is that it's cumbersome to implement preprocessor directives (e.g. #include and #define): there can be a function called from a macro (e.g. getchar) defined in an #include file.
An option that works for 100% is compiling your .c file to an assembly file, e.g. gcc -S file.c, and finding the call instructions in the file.S. A similar option is compiling your .c file to an object file, e.g, gcc -c file.c, generating a disassembly dump with objdump -d file.o, and searching for call instructions.
Another option is finding a parser using Clang / LLVM.

gnu cflow might be helpful

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js