C++ parser or static code analyzing - c++

I´m looking for a free software, tool, library or whatever to analyze C++ code.
As far as I know tools for 'static code analysis' like 'Cppcheck' are not helpful to me because I cannot define my own rules or output. A library which gives me an AST (Abstract Syntax Tree) of a C++ file would be the best, I guess.
My goal is to program a command line tool which generates an output containing something like:
Test.cpp:
The file contains 42 global Integers.
The Class Test has the following attributes:
String name,
Int size.
The Class Test contains the following global functions:
void Test(),
int getTestSize(),
String renameTest(String newName).

You can use clang and the existing analyzer or implement your own analyzer on top of the provided APIs.

As David suggest Clang is a good choice, you have just to implement your own ASTConsumer , you can take as example the already existing clang ASTConsumers like ASTPrinter or ASTDumpXML

Related

How can I find all places a given member function or ctor is called in g++ code?

I am trying to find all places in a large and old code base where certain constructors or functions are called. Specifically, these are certain constructors and member functions in the std::string class (that is, basic_string<char>). For example, suppose there is a line of code:
std::string foo(fiddle->faddle(k, 9).snark);
In this example, it is not obvious looking at this that snark may be a char *, which is what I'm interested in.
Attempts To Solve This So Far
I've looked into some of the dump features of gcc, and generated some of them, but I haven't been able to find any that tell me that the given line of code will generate a call to the string constructor taking a const char *. I've also compiled some code with -s to save the generated equivalent assembly code. But this suffers from two things: the function names are "mangled," so it's impossible to know what is being called in C++ terms; and there are no line numbers of any sort, so even finding the equivalent place in the source file would be tough.
Motivation and Background
In my project, we're porting a large, old code base from HP-UX (and their aCC C++ compiler) to RedHat Linux and gcc/g++ v.4.8.5. The HP tool chain allowed one to initialize a string with a NULL pointer, treating it as an empty string. The Gnu tools' generated code fails with some flavor of a null dereference error. So we need to find all of the potential cases of this, and remedy them. (For example, by adding code to check for NULL and using a pointer to a "" string instead.)
So if anyone out there has had to deal with the base problem and can offer other suggestions, those, too, would be welcomed.
Have you considered using static analysis?
Clang has one called clang analyzer that is extensible.
You can write a custom plugin that checks for this particular behavior by implementing a clang ast visitor that looks for string variable declarations and checks for setting it to null.
There is a manual for that here.
See also: https://github.com/facebook/facebook-clang-plugins/blob/master/analyzer/DanglingDelegateFactFinder.cpp
First I'd create a header like this:
#include <string>
class dbg_string : public std::string {
public:
using std::string::string;
dbg_string(const char*) = delete;
};
#define string dbg_string
Then modify your makefile and add "-include dbg_string.h" to cflags to force include on each source file without modification.
You could also check how is NULL defined on your platform and add specific overload for it (eg. dbg_string(int)).
You can try CppDepend and its CQLinq a powerful code query language to detect where some contructors/methods/fields/types are used.
from m in Methods where m.IsUsing ("CClassView.CClassView()") select new { m, m.NbLinesOfCode }

parsing a C structure in C++

I'm looking for a way to parse a C structure in order to get the the name and the type of the variables.
For example I have a structure like this:
struct MyStruct {
int anInt ;
float aFloat ;
}
I need to get the types int and float and the 2 strings "anInt" and "aFloat".
After I have to use these values in another function:
addValue<int>("anInt") ;
add Value<float>("aFloat") ;
Do you know how to do this automatically, I guess, at compilation ?
Thanks.
You probably cannot do that with standard C++11 template code.
You could consider customizing your C++ compiler to get such information. For example, if compiling your code with a recent GCC, you might consider customizing it with your MELT extension (MELT is a domain specific language, implemented by a plugin, to customize GCC). You'll need to understand the details of GCC internal representation (so such an extension might take a week or more of your time).
The Qt MOC facility might perhaps be useful or inspirational. It is parsing a limited form of class declaration.
Alternatively you might consider generating the C++ struct or class representation from some other input; e.g. change your build procedure, perhaps your Makefile, to generate some .h header file (and perhaps some .cc C++ translation unit) from your higher level representation.

Making a parser to extract function name, parameters, return type

I need to parse a C++ class file (.h) and extract the following informations:
Function names
Return types
List of parameter types of each function
Assume that there is a special tag using which I can recognize if I need to parse a function or not.
For eg.
#include <someHeader>
class Test
{
public:
Test();
void fun1();
// *Expose* //
void fun2();
};
So I need to parse only fun2().
I read the basic grammar here, but found it too complex to comprehend.
Q1. I can't make out how complex this task is. Can someone provide a simpler grammar for a function declaration to perform this parsing?
Q2. Is my approach right or should I consider using some library rather than reinventing?
Edit: Just to clarify, I don't have problem parsing, problem is more of understanding the grammar I need to parse.
A C++ header may include arbitrary C++ code. Hence, parsing the header might be as hard as parsing all kinds of C++ code.
Your task becomes easier, if you can make certain assumptions about your header file. For instance, if you always have an EXPOSE-tag in front of your function and the functions are always on a single line, you could first grep for those lines:
grep -A1 EXPOSE <files>
And then you could apply a regular expression to filter out the information you need.
Nevertheless, I'd recommend using existing tools. This seems to be a tutorial on how to do it with clang and Python.
GCC XML is an open source tool that emits the AST (Abstract Syntax Tree). See this other answer where I posted about the usage I made of it.
You should consider to use only if you are proficient (or akin to learn) with an XML analyzer for inspecting the AST. It's a fairly complex structure...
You will need anyway to 'grep' for the comments identifying your required snippets, as comments are lost in output XML.
IF you are doing this just for documentation doxygen could be a good bet.
Either way it may give you some pointers as to how to do this.

Using LLVM to traverse AST

Are there any helper methods to traverse the AST, basic blocks etc. generated by LLVM compiler for a C code ?
If you're trying to load a module (from a .bc file compiled from a .c file by clang -emit-llvm) and traverse its functions, basic blocks, etc., then you might want to start with the llvm::Module class. It has functions for iterating through global variables and functions. Then the llvm::Function class has functions for iterating through basic blocks. Then the llvm::BasicBlock class has functions for iterating through instructions.
Or if you'd prefer, you can traverse the AST structure created by Clang. Here's some example code: http://eli.thegreenplace.net/2012/06/08/basic-source-to-source-transformation-with-clang/.
Basically, it is impossible to do full operations on the AST in LLVM. Because the LLVM pass works on bitcode level not on the AST. I think what you want is an AST iterator.
You could refer to Chapter 3 in Artem Degrachev: Clang Static Analyzer: A Checker Developer's Guide.
Clang now have a page for checker developers. You could find more following the link.

gcc for parsing code

I would like to know how to use GCC as a library to parse C/C++/Java/Objective C/Ada code for my program.
I want to bypass prepocessing and prefix all the functions that are user written with a prefix My.
like so Print(); becomes MyPrint(); I also wish to do this with the variables.
You can look here:
http://codesynthesis.com/~boris/blog/2010/05/03/parsing-cxx-with-gcc-plugin-part-1/
This is description of how to use gcc plugin interface to parse C++ code. Other language should be handled in the same manner.
Also you can try pork from mozilla:
https://wiki.mozilla.org/Pork
When I tried it (pork), I spend hour or so to fix compile problems, but then
I can write scripts like this:
rewrite SyncPrimitiveUpgrade {
type PRLock* => Mutex*
call PR_NewLock() => new Mutex()
call PR_Lock(lock) => lock->Lock()
call PR_Unlock(lock) => lock->Unlock()
call PR_DestroyLock(lock) => delete lock
}
so it found all type PRLock and replate it with Mutex, also it search call of functions
like PR_NewLock and replace it with "new Mutex".
You might wish to investigate the sparse C parser. It understands a lot of C (all the C used in the Linux kernel sources, which is a fairly good subset of legal ANSI-C and GNU-C extensions) and provides a few sample compiler backends to provide a lint-like static analysis tool for type checking.
While the code looks very clean and thorough, your task might be easier done via another mechanism -- the example.c included with the sparse source that demonstrates a compiler is 1955 lines long.
For C, you cannot do that reliably. If you skip preprocessing you will -- in general -- not have valid C code to be parsed. E.g.
#define FOO
#define BAR
#define BAZ
FOO void BAR qux BAZ(void) { }
How is the parser supposed to recognize this a function definition of qux without doing the preprocessing?
First, GCC is not a library, and is not structured to be one (in contrast to LLVM).
Why (i.e. what for) do you want to parse C, C++, Ada source code?
I would consider (assuming a GCC 4.6 version) extending GCC either thru plugins written in C, or preferably using MELT, a high level domain specific language to extend GCC (disclaimer: I am the main author of MELT).
But using GCC as a library is not realistic at all.
I really think that for what you want to achieve, MELT is the right tool. However, it is poorly documented. Please use the gcc-melt#googlegroups.com list to ask questions.
And be aware that extending GCC does take some amount of work (more than a week perhaps), because you need to partly understand the GCC internal representations.
Our DMS Software Reengineering Toolkit can parse C, C++, Java and Ada code (not Objective C at this time) in a wide variety of dialects and carry out transformations on the code. DMS's C and C++ front ends include a preprocessor, so you can you can cause preprocessing before you parse.
I'm probably don't understand what you want to do, because it seems strange to rename every function and (global?) variable with a "My...." prefix. But you could do that with some DMS rules (a rough sketch of renames of user functions for GCC3:
domain C~GCC3.
rule rewrite_function_names(t: type_designator, i: IDENTIFIER, p: parameter_list, s: statements):
function_header->functionheader
"\t \i(\p) { \s } " -> "\t \renamed\(\i\) (\p) { \s }" ;
and a helper function "renames" that takes a tree node containing an identifer, and returns a tree node with the renamed identifier.
Because DMS patterns only match against the parse trees, you won't get any false positives.
You'd need some additional patterns to handle various different syntax cases within each langauge (e.g, for C, "void" return type, because "void" isn't a type designator in the syntax, and global variable declarations), and different rules for different languages (Ada's syntax is not the same as that of C).
This might seem like big hammer for your task, but if you really insist on doing this for a variety of languages in a reliable way, it seems hard to avoid the problem of getting decent parsers for all those languages. (And if you are really going to do this for all these languages, DMS can be taught to handle ObjectiveC the same we we have taught it to handle the other langauges).
Your alternative is some kind of string hacking solution, which might work 95% of the time. If you can live with that, then Perl or something similar is likely your answer.
forget about GCC, its made as a compiler's parser, not an analysis parser, you'd do way better using something like libclang, a C interface to clang, which can process both C & C++