clang libTooling: How to find which header an AST item came out of? - c++

Examples found on the web for clang tools are always run on toy examples, which are usually all really trivial C programs.
I am building a tool which performs source-to-source transformations on C++ code, which is obviously a very, very challenging task, but clang is up to this task.
The issue I am facing now is that the AST that clang generates for any C++ code that utilizes the STL is enormous. For example I have some C++ code for which clang++ -ast-dump ... | wc -l is 67,018 lines of horrifying AST gobbledygook!
99% of this is standard library stuff, which I aim to ignore in my source-to-source metaprogramming task. So, to achieve this I want to simply filter out files. Suppose I want to look at only the class definitions in the headers of the project that I'm analyzing (and ignore all standard library headers's stuff), I will need to just figure out which header each of my CXXRecordDecl's came from!
Can this be done?
Edit: Hopefully this is a way to go about it. Trying this out now... The important bit is that it has to tell me the header that the decls came out of, not the cpp file corresponding to the translation unit.

In my experience so far, the "source" of some given AST node is best retrieved by using Locations. For example every node at least has a start location, and when you print this out it will contain the header file path.
Then it's possible to use this path to decide whether it is a system library or part of your application code that you still are interested in examining.

One route I'm looking at is to narrow matches with things like hasName() (as found here. For example:
recordDecl(hasName("MyBaseClass")) // etc.
However your comment above using -ast-dump is something I tried as well to get a lay of the land on my own CLang tool. I found this post to be extremely helpful. Armed with their suggestion, I used clang-check to filter to a specific class name and fed it my top-level CPP file. The output was a much more manageable few hundred lines representing the class declarations and definitions of interest.

Related

How to process c and c++ source code to calculate metrics for static code analysis?

Iam extending a software tool to calculate metrics for software projects.
The metrics are then used to do a static code analysis.
My task is to implement the calculation of metrics for c and c++ projects.
In the developing process i encountered problems which led to reset and starting over again with a different tool or programming language.
I will state the process, problems and things i tried to solve them in chronological order and as good as possible.
Some metrics:
Lines of Code for Classes, Structs, Unions, Functions/Methods and Sourcefiles
Method Count for Classes and Structs
Complexity for Classes, Structs and Functions/Methods
Dependencies for/between Classes and Structs
Since c++ is a difficult language to parse and writing a c++ parser on my own is out of scale i tend to use an existing c++ parser.
Therefore i began using libraries from the LLVM Project to gather syntactic and semantic information about a source file.
LLVM Tooling link: https://clang.llvm.org/docs/Tooling.html
First i started with LibTooling written in c++ since it promised me "full controll" over the Abstract Syntax Tree (AST).
I tried the RecursiveASTVistor and the Matchfinder approaches without success.
So LibTooling was dismissed because i couldnt retrieve context information about the surrounding of a node in the AST.
I was only able to react on a callback when a specific node in the AST was visited. But i didnt know in what context i currently was.
Eg. When I visit a C++RecordDeclaration (class, struct, union) i did not know if it is a nested record or not.
But that information is needed to calculate the lines of code for a single class.
Second approach was using the LibClang interface via Python Bindings.
With the LibClang interface i was able to traverse the AST node by node recursively and store needed context information on a stack.
Here i encountered a general problem with LibClang:
Before creating the AST for a file the preprocessor is started and resolves all preprocessor directives. Just as he is supposed to do.
This is good because if the preprocessor cant resolve all the include directives the output AST will be incomplete.
This is very bad because i wont be able to provide all the include files or directories for any c++ project.
This is bad because code which is surrounded by conditional preprocessor directives is not part of the AST if a preprocessor variable is defined or not. Parsing the same file multiple times with different setups of defined or undefined preprocessor variable is out of scope.
This lead to the third and current attempt with using a c++ parser generated by Antlr provided a c++14 grammar.
No preprocessor is executed before the parser.
This is good because the full source code is parsed and preprocessor directives are being ignored.
Bad thing is that the parser does not seem to be that tough. It fails on code which can be compiled leading to a broken AST. So this solution is not sufficient aswell.
My questions are:
Is there an option to deactivate the preprocessor before parsing a c/c++ source or header file with libClang?
So the source code is untouched and the AST is complete and detailed.
Is there a way to parse a c/c++ source code file without providing all the necessary include directories but still resulting in a detailed AST?
Since iam running out of options. What other approaches may be worth looking at when it comes to analysing/parsing c/c++ source code?
If you think this is not the right place to ask such questions feel free to redirect me to another place.
To answer your last question,
Since iam running out of options. What other approaches may be worth
looking at when it comes to analysing/parsing c/c++ source code?
Another approach is to parse the source code as if it were merely text. This avoids the need to preprocess the source, and to bring in a complex parser. See this paper for an example/introduction: "The Conceptual Cohesion of Classes" by Andrian Marcus, Denys Poshyvanyk. You can still collect such information as LOC and number of methods from this approach, without needing a full parser.
This approach has drawbacks (as does any approach):
It either 1) parses comments along with the source code, or 2) requires that you remove comments from the source. But the latter is an easy step. The reason that might be OK is that even the comments contain information regarding the code, which may help determine which modules are more closely coupled, etc.
It will lump local variables, method names, parameter names, etc. all into the "bag of words" that you are working with.

Ignore missing headers with clang AST parser

I'm on Windows, using MSVC to compile my project, but I need clang for its neat AST parser, which allow me to write a little code generator.
Problem is, clang cannot parse MSVC headers (a very-well known and understandable problem).
I tried two options :
I include MSVC header folder, parsing the built-in headers included in my code will end-up leading to a fatal error at some point, preventing me from parsing the parts I want correctly.
What I did before is simply not provide any built-in headers and forward declare the types I needed. It worked fine and somehow it doesn't anymore with latest Clang. I don't really know if the parser policy on missing header changed, but it is causing complete failure every time something like <string> is included and not much get parsed.
I am using the python bindings (libclang), but I would consider switching to C/C++ API if there would be a solution there.
Is there anyway I can alter this behavior and make clang continue parsing even when some headers are not found ?
Use SetSuppressIncludeNotFoundError. Took me an hour to find! You can imagine how glad I was to find it!
https://clang.llvm.org/doxygen/classclang_1_1Preprocessor.html#ac7bafe67fc32e41460855b39d20ff6af
One way to ignore the errors due to missing headers is to set SetSuppressIncludeNotFoundError to true in your definition of ASTFrontendAction. An example for the same is given below.
{
public:
virtual std::unique_ptr<clang::ASTConsumer> CreateASTConsumer(
clang::CompilerInstance &Compiler, llvm::StringRef InFile)
{
Compiler.getPreprocessor().SetSuppressIncludeNotFoundError(true);
return std::unique_ptr<clang::ASTConsumer>(
new CustomASTConsumer(&Compiler.getASTContext()));
}
};
For a complete example using ASTFrontendAction, please visit at https://clang.llvm.org/docs/RAVFrontendAction.html
So you want to process C++ code that uses MS headers, and you want access to ASTs so that you can generate code. And Clang won't handle MS headers.
So Clang can't be the answer unless it gets a radical upgrade.
You asked for "any solution that can make this work".
Our DMS Software Reengineering Tookit with its C++14 Front End can do this.
DMS provides general parsing, AST construction/inspection/transformation/generation, and inverse parsing (conversion of ASTs back into compilable code), parameterized by language definitions.
The C++ front end provides a full C++14 parser, preprocessor handling, AST construction, and full name and type resolution. It has been tested with GCC and MS VS 2013 header files; we're testing with 2015 header files now.
(It also handles MS VS 2013 syntax, too).
It handles the tough parsing cases completely, including the C++ famous "most vexing parse". You can see parse trees at get human readable AST from c++ code.
DMS does not provide Python bindings, nor a direct C++ interface. Rather, it is a standalone tool designed to support the construction of custom tools (e.g., your "little code generator"). It has its own very extensive set of internal APIs, coded in metaprogramming language PARLANSE, which is LISP-like. Other aspects of DMS are managed by using DSLs for lexers, grammars, and transformations. See below.
A word of caution: any tool that can process C++ is gauranteed to be complex. DMS is correspondingly complex, and it takes a while to learn to use it, so you're not going to get instant answers. The good news here
is that some things are easier to do. Your code generation problem
is likely "read a skeleton file, and then replace key entries in it with problem specific code". If that's the case, a DMS tool with the following code (simplified for presentation here) will likely do the trick:
...
(= myAST (Registry:ParseFile (. filename) (. `CppVisualStudio2013') ...)
(Registry:ApplyTransforms myAST (. `MyTransforms.rsl'))
(Registry:PrettyPrint myAST (concat filename `.modified'))
...
with a transforms file MyTransforms.rsl containing source-to-source surface-syntax (e.g, C++ syntax) transformation rules of the conceptual form
rule rulename if_you_see THIS then replace_by ("-->") THAT
An actual C++ rule might look like (making this up because I don't
know your actual code generation goals)
rule replace_abstraction(s: STRING_LITERAL):
" abstraction_place_holder(\s) "
-> " my_DSL_library(\s,17); "
The ApplyTransforms call above will apply all the rules in this file until none apply any further.
Writing surface syntax transforms, where you can do it, is way easier than making calls on a procedure library (which, like Clang, DMS offers) that hack at the tree.
You can write more complex metaprograms using PARLANSE to apply some rules in one place, other rules someplace else, and you can mix source-to-source transforms with procedural transforms that hack directly at the tree if you want.
If you want more details on what transforms look like, ask and I'll provide a link.

building c++ from header-include information

With Haskell I can "ghc --make Main.hs" and with Ada I can just "gnatmake Main.adb" and that is it.
Isn't there anything like that for C++? Why not?
I do not want to write buildscripts nor makefiles for C++ projects. I have those damn #include lines there. Why isn't that information enough?
note: I vaguely remember a feature like that mentioned once in the context of Clang.
update:
It seems possible to have a C++ compiler (or write a wrapper script), that recursively looks for included headers and expects either sourcefile or objectfile to be in the same dir; compiles and links everything automatically. Skipping if source and object file have same timestamp. Link-time-decisions are left as a special case necessiating a compiler-flag/switch to select one from multiple source/object-files for the single header, or specify dynamic linking. E.g.: awesomecompiler Main.cpp --link-choice=DrawStuff.h-->DrawStuffGL.o.
Hence there must be another reason for using make or its alternatives. What is it?
To rephrase the question as suggested by martin:
Why can't we just get all the build-information from the header files, and a few commandline flags for special cases?
Some languages have a system whereby the "main" file is specifying everything else that makes up that program as "modules" or some such. Ada certainly does, I don't know enough Haskell to comment there.
C and C++ rely on modules being compiled separately and linked at the end, and the software developer decides exactly what the process is here. This has some advantages, such as that you can build a module for one solution, and a different module for another solution. This is not possible if all modules are specified by the source file (you then have to make the files appear/disappear in the filesystem instead, which of course means some other "work outside the compiler", so you end up with a makefile or some such anyway).
Say for example, we make a game, and we encapsulate all the drawing, then we can choose whether we use DirectX9, DirectX10, OpenGL or OpenGLES by simply linking with the relevant "DrawStuffDX9.o" or "DrawStuffGL.o" etc.
As always, freedom means more choice, but also a bit more work. Just like buying a ready made piece of furniture is simple, but if you want it to fit exactly to your house, floor to ceiling, you have to be lucky. A bespoke piece of furniture will cost more and require some detailed measurements, but will be a perfect fit for your home.
[gcc -MM somefile(s) will give you a rudimentary makefile for the source file(s) you specified].

Easy way to get function prototypes?

A friend and I were discussing imaginary and real languages and a question that came up was if one of us wanted to generate headers for another language (perhaps D which already has a tool) what would be an easy and very good way to do this?
One of us said to scan C files and headers and ignore function bodies and only count the braces within to figure out when a function is finished. The counter to that was typedefs, defines (which braces but defines were considered as a trivial problem) and templates + specialization.
Another solution was to read binaries produce, not the actual exe but the object files the linker uses. The counter to that was the format and complexity. None of us knew anything of any object format so we couldnt estimate (we were thinking of gcc and VS c++).
What do you guys think? Which is easier? This should be backed up with reasonable logic and fact.
If someone can link to a helpful project, one that parses C files/headers and outputs it or one that reads in elf data and displays info in an example project would be useful. I tried googling but I didnt know what it would be called. I found libelf but at this moment I couldn't get it to compile. I might be able to soon.
You can use clang libraries to parse C/C++ source code and extract any information you want in particular function prototypes.
Due to library-based architecture it is easy to reuse parts of clang that you need. In your case these are frontend libraries (liblex, libparse, libsema). I think this is a more feasible approach then using hand-written scanner considering the difficulties that you mentioned (typedefs, defines, etc).
clang can also be used as a tool to parse the source code and output AST in XML form, for example if you have the file test.cpp:
void foo() {}
int main()
{
foo();
}
and invoke clang++ -Xclang -ast-print-xml -fsyntax-only test.cpp you'll get the file test.xml similar to the following (here irrelevant parts skipped for brevity):
<?xml version="1.0"?>
<CLANG_XML>
<TranslationUnit>
<Function id="_1D" file="f2" line="1" col="6" context="_2"
name="foo" type="_12" function_type="_1E" num_args="0">
</Function>
<Function id="_1F" file="f2" line="3" col="5" context="_2"
name="main" type="_21" function_type="_22" num_args="0">
</Function>
</TranslationUnit>
<ReferenceSection>
<Types>
<FunctionType result_type="_12" id="_1E"/>
<FundamentalType kind="int" id="_21"/>
<FundamentalType kind="void" id="_12"/>
<FunctionType result_type="_21" id="_22"/>
<PointerType type="_12" id="_10"/>
</Types>
<Files>
<File id="f2" name="test.cpp"/>
</Files>
</ReferenceSection>
</CLANG_XML>
I don't think that extracting this information from binaries is possible at least for symbols with C linkage, because they don't have name mangling.
ctags' output is quite easy to read/parse
if you want to simply generate a binding, try swig
What you're talking about is compiling: The act of transforming code in one formal language to another. There's a good solid science behind this that, if followed carefully, will guarantee your program halts with a correct analogous code.
Granted, you don't want to parse the whole of the C++ language (hooray for that!), so you just need to define the relevant grammar and define everything else as acceptable noise or comments.
Don't use regular expressions. These won't do because C++ is not a regular language.
One way to do this is to define your interfaces in an abstract language (an IDL), and generate headers for all languages that you're interested in. You can limit the scope of your IDL to those features that are possible in each target language.
Windows takes this approach in its MIDL language, for example.
Doxygen can help with this. It's an advanced topic, and somewhat documented.
In a perfect world...
Each compiler of any programming language would emit such information as part of its output. It could be an ELF extension or a new generally accepted file format, which contains an ELF/COFF/whatever section.
This would spare the (across the globe) thousands of man hours to generate "language bindings" for dozens of languages. Dynamic languages, such as Common Lisp would not need FFI libraries - as it all would happen under the hood and loading a shared library in that new format could automatically be inspected and functions in it could be made available on the fly, without any further ado.
And all, which actually would have to be stored as extra information for each (exported) function is:
Generic name
Return type
Argument list
calling convention (there are not that many in practice)
A reference to the symbol table entry, referred to.
Why has it not been done during the past 30 years?
Because language designers still see the compiler output as their "private affair", whereas IMHO, this should be part of the OS/Runtime/Loader.
Many workarounds exist, some listed here in other answers (IDL, binding generators, such as SWIG and a myriad of others, mostly ad hoc and of varying and typically insufficient quality).
The ELF guys are Unix guys and as such C guys, who still live in a bubble, thinking C is some sort of "golden standard". But even they would not have any problems, using the information to generate their beloved header files.
RUST might not have invented crates in that perfect world and you would not even care if some shared library had been written in C or C++ or Rust or Zig or Haskell - anyone could just use it.
This domain is still dominated by "computer scientists", not engineers; while in theory, there could be an infinite number of calling conventions, in fact, those in active use are very countable few (LLVM supports many of those actually relevant).
Another reason, it has not been done yet, is that there is a danger, some committee would create a monster, trying make it "open, flexible, ..." and no one would eventually dare using it. DCE (distributed computing environment) gives that idea.

Refactoring C++ code to use forward declarations

I've got a largeish codebase that's been around for a while and I'm trying to tidy it up a bit by refactoring it. One thing I'd like to do is find all the headers where I could forward declare members, instead of including the entire header file.
This is a fairly labour intensive process and I'm looking for a tool that could help me figure out which headers have members that could be forward declared.
Is there a compiler setting that could emit a warning or suggestion that the following code could use a forward declaration? I'm using the following compilers icc, gcc, sun studio and HP's aCC
Is there a standalone tool that could do the same job?
#include "Foo.h"
...//more includes
class Bar {
.......
private:
Foo* m_foo;
};
Anything involving the precise analysis of C++ requires essentially an entire C++ front end somewhere (otherwise you won't get answers, or they'll be wrong, and that works badly when you have "largish" applications). There aren't many practical answers available here.
Already mentioned is GCCXML is a GCC-derived package, so it has the requisite C++ front end. It produces XML, thus it will produce a LOT of output that you'll have to read back in to form "the in memory data structure" suggested in another answer. Its unfortunate that GCCXML builds that memory data structure already, then exports it as XML, and forces you to build it again. Of course, you could just use GCC, which builds the in memory data structure, but then you have to hack GCC to be what you want, and it really, really wants to be a compiler. That means you'll have a fight on your hands to bend it to your will (and explains why GCCXML exists: most people don't want that fight).
Not mentioned is the Edison Design Group C++ (EDG) front end, which builds that in memory data structure directly. It is a front end; you'll have to do all the analysis stuff yourself but your task may be simple enough so that isn't hard.
The last solution I know is mine: C++ FrontEnd for DMS. DMS is foundation for building program analysis, and its C++ FrontEnd is a complete Front end for C++ (e.g., does everything the GCC and Edison front ends do: parsing, tree building, name/type resolution). And you'll have to code your special analysis much the way you would for GCCXML and EDG by walking over the "in memory" datastructures produced by DMS.
What is really different is that DMS could then be used to actually modify your source code by updating those in memory data structures, and regenerate compilable code from those memory structures, including the original comments.
I'm not sure you'll find anything that does this out-of-the box, but one option would be to write a script using Python and the pygccxml package that could do some of this analysis for you.
Basically you'd use pygccxml to build an in-memory graph of your source code, then use it to query your classes and functions to find out what they actually need to include to function.
So for example you could ask each class, give me the members that are pointer types: then for each of those pointer types you could work out if a real instance (non pointer) of the class was used in the interface, and if not you could mark it as a candidate for forward declaration.
The downside is that the script would take some time to get right so the cost might outweigh the benefits but it would be an interesting exercise at least. You could post your code to Github if you got something that worked and maybe others would find it useful.
What you could do is call gcc with -MM. This will produce dependency files that Make can read. Instead of having make use them, you could parse them (with perl, or something) to determine which includes are needed and which could be replaced with forward declarations.