I just want to ask your ideas regarding this matter. For a certain important reason, I must extract/acquire all function names of functions that were called inside a "main()" function of a C source file (ex: main.c).
Example source code:
int main()
{
int a = functionA(); // functionA must be extracted
int b = functionB(); // functionB must be extracted
}
As you know, the only thing that I can use as a marker/sign to identify these function calls are it's parenthesis "()". I've already considered several factors in implementing this function name extraction. These are:
1. functions may have parameters. Ex: functionA(100)
2. Loop operators. Ex: while()
3. Other operators. Ex: if(), else if()
4. Other operator between function calls with no spaces. Ex: functionA()+functionB()
As of this moment I know what you're saying, this is a pain in the $$$... So please share your thoughts and ideas... and bear with me on this one...
Note: this is in C++ language...
You can write a Small C++ parser by combining FLEX (or LEX) and BISON (or YACC).
Take C++'s grammar
Generate a C++ program parser with the mentioned tools
Make that program count the funcion calls you are mentioning
Maybe a little bit too complicated for what you need to do, but it should certainly work. And LEX/YACC are amazing tools!
One option is to write your own C tokenizer (simple: just be careful enough to skip over strings, character constants and comments), and to write a simple parser, which counts the number of {s open, and finds instances of identifier + ( within. However, this won't be 100% correct. The disadvantage of this option is that it's cumbersome to implement preprocessor directives (e.g. #include and #define): there can be a function called from a macro (e.g. getchar) defined in an #include file.
An option that works for 100% is compiling your .c file to an assembly file, e.g. gcc -S file.c, and finding the call instructions in the file.S. A similar option is compiling your .c file to an object file, e.g, gcc -c file.c, generating a disassembly dump with objdump -d file.o, and searching for call instructions.
Another option is finding a parser using Clang / LLVM.
gnu cflow might be helpful
Related
I am trying to find all places in a large and old code base where certain constructors or functions are called. Specifically, these are certain constructors and member functions in the std::string class (that is, basic_string<char>). For example, suppose there is a line of code:
std::string foo(fiddle->faddle(k, 9).snark);
In this example, it is not obvious looking at this that snark may be a char *, which is what I'm interested in.
Attempts To Solve This So Far
I've looked into some of the dump features of gcc, and generated some of them, but I haven't been able to find any that tell me that the given line of code will generate a call to the string constructor taking a const char *. I've also compiled some code with -s to save the generated equivalent assembly code. But this suffers from two things: the function names are "mangled," so it's impossible to know what is being called in C++ terms; and there are no line numbers of any sort, so even finding the equivalent place in the source file would be tough.
Motivation and Background
In my project, we're porting a large, old code base from HP-UX (and their aCC C++ compiler) to RedHat Linux and gcc/g++ v.4.8.5. The HP tool chain allowed one to initialize a string with a NULL pointer, treating it as an empty string. The Gnu tools' generated code fails with some flavor of a null dereference error. So we need to find all of the potential cases of this, and remedy them. (For example, by adding code to check for NULL and using a pointer to a "" string instead.)
So if anyone out there has had to deal with the base problem and can offer other suggestions, those, too, would be welcomed.
Have you considered using static analysis?
Clang has one called clang analyzer that is extensible.
You can write a custom plugin that checks for this particular behavior by implementing a clang ast visitor that looks for string variable declarations and checks for setting it to null.
There is a manual for that here.
See also: https://github.com/facebook/facebook-clang-plugins/blob/master/analyzer/DanglingDelegateFactFinder.cpp
First I'd create a header like this:
#include <string>
class dbg_string : public std::string {
public:
using std::string::string;
dbg_string(const char*) = delete;
};
#define string dbg_string
Then modify your makefile and add "-include dbg_string.h" to cflags to force include on each source file without modification.
You could also check how is NULL defined on your platform and add specific overload for it (eg. dbg_string(int)).
You can try CppDepend and its CQLinq a powerful code query language to detect where some contructors/methods/fields/types are used.
from m in Methods where m.IsUsing ("CClassView.CClassView()") select new { m, m.NbLinesOfCode }
This question already has answers here:
Listing Unused Symbols
(2 answers)
Closed 7 years ago.
How do I detect function definitions which are never getting called and delete them from the file and then save it?
Suppose I have only 1 CPP file as of now, which has a main() function and many other function definitions (function definition can also be inside main() ). If I were to write a program to parse this CPP file and check whether a function is getting called or not and delete if it is not getting called then what is(are) the way(s) to do it?
There are few ways that come to mind:
I would find out line numbers of beginning and end of main(). I can do it by maintaining a stack of opening and closing braces { and }.
Anything after main would be function definition. Then I can parse for function definitions. To do this I can parse it the following way:
< string >< open paren >< comma separated string(s) for arguments >< closing paren >
Once I have all the names of such functions as described in (2), I can make a map with its names as key and value as a bool, indicating whether a function is getting called once or not.
Finally parse the file once again to check for any calls for functions with their name as in this map. The function call can be from within main or from some other function. The value for the key (i.e. the function name) could be flagged according to whether a function is getting called or not.
I feel I have complicated my logic and it could be done in a smarter way. With the above logic it would be hard to find all the corner cases (there would be many). Also, there could be function pointers to make parsing logic difficult. If that's not enough, the function pointers could be typedefed too.
How do I go about designing my program? Are a map (to maintain filenames) and stack (to maintain braces) the right data structures or is there anything else more suitable to deal with it?
Note: I am not looking for any tool to do this. Nor do I want to use any library (if it exists to make things easy).
I think you should not try to build a C++ parser from scratch, becuse of other said in comments that is really hard. IMHO, you'd better start from CLang libraries, than can do the low-level parsing for you and work directly with the abstract syntax tree.
You could even use crange as an example of how to use them to produce a cross reference table.
Alternatively, you could directly use GNU global, because its gtags command directly generates definition and reference databases that you have to analyse.
IMHO those two ways would be simpler than creating a C++ parser from scratch.
The simplest approach for doing it yourself I can think of is:
Write a minimal parser that can identify functions. It just needs to detect the start and ending line of a function.
Programmatically comment out the first function, save to a temp file.
Try to compile the file by invoking the complier.
Check if there are compile errors, if yes, the function is called, if not, it is unused.
Continue with the next function.
This is a comment, rather than an answer, but I post it here because it's too long for a comment space.
There are lots of issues you should consider. First of all, you should not assume that main() is a first function in a source file.
Even if it is, there should be some functions header declarations before the main() so that the compiler can recognize their invocation in main.
Next, function's opening and closing brace needn't be in separate lines, they also needn't be the only characters in their lines. Generally, almost whole C++ code can be put in a single line!
Furthermore, functions can differ with parameters' types while having the same name (overloading), so you can't recognize which function is called if you don't parse the whole code down to the parameters' types. And even more: you will have to perform type lists matching with standard convertions/casts, possibly considering inline constructors calls. Of course you should not forget default parameters. Google for resolving overloaded function call, for example see an outline here
Additionally, there may be chains of unused functions. For example if a() calls b() and b() calls c() and d(), but a() itself is not called, then the whole four is unused, even though there exist 'calls' to b(), c() and d().
There is also a possibility that functions are called through a pointer, in which case you may be unable to find a call. Example:
int (*testfun)(int) = whattotest ? TestFun1 : TestFun2; // no call
int testResult = testfun(paramToTest); // unknown function called
Finally the code can be pretty obfuscated with #defineās.
Conclusion: you'll probably have to write your own C++ compiler (except the machine code generator) to achieve your goal.
This is a very rough idea and I doubt it's very efficient but maybe it can help you get started. First traverse the file once, picking out any function names (I'm not entirely sure how you would do this). But once you have those names, traverse the file again, looking for the function name anywhere in the file, inside main and other functions too. If you find more than 1 instance it means that the function is being called and should be kept.
I have a file written in c programming language and is preprocessed using CIL. Now there are calls to a function say foo() in this file. I want to modify the c code in this file such that all calls to foo() are under a #ifdef guard. I want only the calls to be guarded and not the function body so that I have finer control over the calls. The calls can be inside a if condition or a while loop. The rules for macro name: name begins with MACRO_ and ends with the line number of the function call foo() in the original code.
This is to be automated inside a tool and I am looking for a compiler that can unparse c code for doing this.
Example:
Input source file
void foo(int x){
// do something
}
int main(){
int a;
printf("doing something");
foo(a);
printf("doing something again");
foo(a);
return 0;
}
Desired output
void foo(int x){
// do something
}
int main(){
int a;
printf("doing something");
#ifdef MACRO_1
foo(a);
#endif
printf("doing something again");
#ifdef MACRO_2
foo(a);
#endif
return 0;
}
For SIMPLE source code, you can obviously do this with a simple script and some regexps in your favourite scripting language (perl, php, awk, python, etc). But it does get increasingly difficult if you start deciding to support for example function calls inside if-statements, member function calls, etc [and want to end up with output code that actually compiles to a correct program].
In that case, you need something that can read (and "understand") C or C++ and produce some intermediate form that you can then process and reissue the source code with modifications. It's far from easy to write such code, no matter where you start from. One solution may be to use Clang as a library. It has facilities to rewrite C or C++ code from it's Abstract Syntax Tree (AST) form. This link shows an example of such a rewriter: http://eli.thegreenplace.net/2012/06/08/basic-source-to-source-transformation-with-clang
I'm not sure exactly what you want to do if you have code like:
if (x)
foo();
bar();
Clearly, just inserting #if for the call to foo(); will cause the call to bar() to be called only when x is true, which is probably not what you wanted...
You could customize some free software compiler. If using some recent GCC you could customize it with MELT (a Lispy domain specific language to extend gcc & g++ etc....).
You probably do not want to produce idiomatic C code. It would be much simpler to customize your compiler (e.g. GCC -or perhaps Clang/LLVM ...) to have the desired behavior.
Transforming some internal compiler representation (e.g. Gimple for GCC) is a bit simpler than outputting C code. It may still mean several weeks of work (because C and C++ are quite complex languages, and compilers have quite complex internal representations).
Notice that your question does not consider what is happenning when foo is called inside some macro (or inside some C++ template expansion, or perhaps even some inlined function). This shows why working on the intermediate representation(s) of your compiler is worthwhile.
BTW, you might perhaps be interested by coccinelle, a source to source free software transformer.
You could also in principle use Clang (to compile your C or C++ code to LLVM) then llvm-cbe (an experimental LLVM to C backend)
If the code is structured in such a way that guarding the lines with foo calls can simply be commented out and that more complex expressions such as bar(), foo(a) need not be handled, you could use awk like this:
awk '/^\s*foo\(/ { print "#ifdef MACRO_" NR; print; print "#endif"; next } 1' filename.c
This will
/^\s*foo\(/ { # handle lines that begin with foo( preceded
# optionally by whitespaces specially by:
print "#ifdef MACRO_" NR # printing #ifdef MACRO_linenumber before
print
print "#endif" # and #endif after the line.
next
}
1 # all other lines are printed unchanged.
Be aware that this is a dirty, dirty hack that does not attempt to parse the C code properly. There are a number of ways you can break this, among them such things as
if(something)
foo(a);
and
foo(
a
);
That would come out as
if(something)
#ifdef MACRO_foo
foo(a);
#endif
and
#ifdef MACRO_foo
foo(
#endif
a
);
respectively. It may work for your particular case, but it is not a general C-code handling tool.
If the task is to exclude calling of foo(int) from code when some macro undefined (or defined), maybe the following approach will work better:
void foo(int x){
#ifdef MACRO_foo
// do something
#endif
}
int main(){
int a;
printf("doing something");
foo(a);
printf("doing something again");
foo(a);
return 0;
}
So, you can just exclude body of function and leave function calls in the whole program.
I think you are asking CIL to do things that CIL cannot do. Since it operates on preprocessed source code, it doesn't represent preprocessor directives, so you can't "put them into CIL representation" to be regenerated. You might be able to hack the CIL implementation itself to spit out your directives when it encountered your special circumstance, but it is hard to believe that such a hack would be general in any way.
You said you were looking for a "compiler that can unparse c code to do this". If you insist on "this" as involving specifically CIL, I think you are out of luck; there's only CIL itself to do this.
If you give up on CIL and will consider a different tool, then I think I have an answer, that will do CIL like things, can retain preprocessor directives in the representation (and/or allow you to insert them according to custom rules), and can regenerate valid C source code text.
That tool is our DMS Software Reengineering Toolkit, a general purpose program transformation engine, and its C Front End. DMS parses C code into ASTs, and unparse them back to valid source code, including retaining comments.
It can be used to do source-to-source transformations using mixtures of procedure calls on its AST manipulation library, and/or surface-syntax source-to-source rewrites.
DMS will capture preprocessor directives in that AST (they are just "more syntax!) in most cases without issues; sometimes you need to modify the source code slightly (permanently) to make preprocessor directives palatable. DMS provides symbol tables for C, and control and data flow analysis; these will need some revision to handle preprocessor conditionality.
To match what you are doing with CIL, you can ask DMS to do the preprocessing; now you end up with an AST that is preprocessor free. DMS's existing symbol tables, CF and DF machinery handle this case directly, now.
So you can carry out sophisticated operations on the AST using that additional information, in way different than but equivalent to CIL. In addition, you can still modify the ASTs to insert preprocessor directives, which seems to be your key problem.
To achieve your specific effect of call-site specific conditionals, you can take advantage of DMS's surface syntax source-to-source transformation capability.
The following DMS transform does something like what you want:
rule wrap_function_call(i: Identifier, a:arguments ):statement -> statement
" \i(\a); "
->
" #ifdef \generate_macro_name\(\i\)
\i(\a);
#endif
"
if want_to_wrap(i);
This rule finds any syntax tree corresponding to a function call as a statement, and wraps its in a conditional. (You didn't say what you wanted to do if the function call was part of an expression; that case requires a bit more transformation but could also be handled). A custom helper function generated_macro_name manufactures the macro name using the source position information associated with that identifier AST node matched for the function name. The transformation is conditioned on another custom helper function want_to_wrap, that inspects each matched name to determine if it is one that should be wrapped.
When done transforming the code, you call DMS's prettyprinter machinery to print the AST as source text.
I would like to write a small tool that takes a C++ program (a single .cpp file), finds the "main" function and adds 2 function calls to it, one in the beginning and one in the end.
How can this be done? Can I use g++'s parsing mechanism (or any other parser)?
If you want to make it solid, use clang's libraries.
As suggested by some commenters, let me put forward my idea as an answer:
So basically, the idea is:
... original .cpp file ...
#include <yourHeader>
namespace {
SpecialClass specialClassInstance;
}
Where SpecialClass is something like:
class SpecialClass {
public:
SpecialClass() {
firstFunction();
}
~SpecialClass() {
secondFunction();
}
}
This way, you don't need to parse the C++ file. Since you are declaring a global, its constructor will run before main starts and its destructor will run after main returns.
The downside is that you don't get to know the relative order of when your global is constructed compared to others. So if you need to guarantee that firstFunction is called
before any other constructor elsewhere in the entire program, you're out of luck.
I've heard the GCC parser is both hard to use and even harder to get at without invoking the whole toolchain. I would try the clang C/C++ parser (libparse), and the tutorials linked in this question.
Adding a function at the beginning of main() and at the end of main() is a bad idea. What if someone calls return in the middle?.
A better idea is to instantiate a class at the beginning of main() and let that class destructor do the call function you want called at the end. This would ensure that that function always get called.
If you have control of your main program, you can hack a script to do this, and that's by far the easiet way. Simply make sure the insertion points are obvious (odd comments, required placement of tokens, you choose) and unique (including outlawing general coding practices if you have to, to ensure the uniqueness you need is real). Then a dumb string hacking tool to read the source, find the unique markers, and insert your desired calls will work fine.
If the souce of the main program comes from others sources, and you don't have control, then to do this well you need a full C++ program transformation engine. You don't want to build this yourself, as just the C++ parser is an enormous effort to get right. Others here have mentioned Clang and GCC as answers.
An alternative is our DMS Software Reengineering Toolkit with its C++ front end. DMS, using its C++ front end, can parse code (for a variety of C++ dialects), builds ASTs, carry out full name/type resolution to determine the meaning/definition/use of all symbols. It provides procedural and source-to-source transformations to enable changes to the AST, and can regenerate compilable source code complete with original comments.
So you know off the bat, this is a project I've been assigned. I'm not looking for an answer in code, but more a direction.
What I've been told to do is go through a file and count the actual lines of code while at the same time recording the function names and individual lines of code for the functions. The problem I am having is determining a way when reading from the file to determine if the line is the start of a function.
So far, I can only think of maybe having a string array of data types (int, double, char, etc), search for that in the line and then search for the parenthesis, and then search for the absence of the semicolon (so i know it isn't just the declaration of the function).
So my question is, is this how I should go about this, or are there other methods in which you would recommend?
The code in which I will be counting will be in C++.
Three approaches come to mind.
Use regular expressions. This is fairly similar to what you're thinking of. Look for lines that look like function definitions. This is fairly quick to do, but can go wrong in many ways.
char *s = "int main() {"
is not a function definition, but sure looks like one.
char
* /* eh? */
s
(
int /* comment? // */ a
)
// hello, world /* of confusion
{
is a function definition, but doesn't look like one.
Good: quick to write, can work even in the face of syntax errors; bad: can easily misfire on things that look like (or fail to look like) the "normal" case.
Variant: First run the code through, e.g., GNU indent. This will take care of some (but not all) of the misfires.
Use a proper lexer and parser. This is a much more thorough approach, but you may be able to re-use an open source lexer/parsed (e.g., from gcc).
Good: Will be 100% accurate (will never misfire). Bad: One missing semicolon and it spews errors.
See if your compiler has some debug output that might help. This is a variant of (2), but using your compiler's lexer/parser instead of your own.
Your idea can work in 99% (or more) of the cases. Only a real C++ compiler can do 100%, in which case I'd compile in debug mode (g++ -S prog.cpp), and get the function names and line numbers from the debug information of the assembly output (prog.s).
My thoughts for the 99% solution:
Ignore comments and strings.
Document that you ignore preprocessor directives (#include, #define, #if).
Anything between a toplevel { and } is a function body, except after typedef, class, struct, union, namespace and enum.
If you have a class, struct or union, you should be looking for method bodies inside it.
The function name is sometimes tricky to find, e.g. in long(*)(char) f(int); .
Make sure your parser works with template functions and template classes.
For recording function names I use PCRE and the regex
"(?<=[\\s:~])(\\w+)\\s*\\([\\w\\s,<>\\[\\].=&':/*]*?\\)\\s*(const)?\\s*{"
and then filter out names like "if", "while", "do", "for", "switch". Note that the function name is (\w+), group 1.
Of course it's not a perfect solution but a good one.
I feel manually doing the parsing is going to be a quite a difficult task. I would probably use a existing tool such as RSM redirect the output to a csv file (assuming you are on windows) and then parse the csv file to gather the required information.
Find a decent SLOC count program, eg, SLOCCounter. Not only can you count SLOC, but you have something against which to compare your results. (Update: here's a long list of them.)
Interestingly, the number of non-comment semicolons in a C/C++ program is a decent SLOC count.
How about writing a shell script to do this? An AWK program perhaps.