Related
Iam extending a software tool to calculate metrics for software projects.
The metrics are then used to do a static code analysis.
My task is to implement the calculation of metrics for c and c++ projects.
In the developing process i encountered problems which led to reset and starting over again with a different tool or programming language.
I will state the process, problems and things i tried to solve them in chronological order and as good as possible.
Some metrics:
Lines of Code for Classes, Structs, Unions, Functions/Methods and Sourcefiles
Method Count for Classes and Structs
Complexity for Classes, Structs and Functions/Methods
Dependencies for/between Classes and Structs
Since c++ is a difficult language to parse and writing a c++ parser on my own is out of scale i tend to use an existing c++ parser.
Therefore i began using libraries from the LLVM Project to gather syntactic and semantic information about a source file.
LLVM Tooling link: https://clang.llvm.org/docs/Tooling.html
First i started with LibTooling written in c++ since it promised me "full controll" over the Abstract Syntax Tree (AST).
I tried the RecursiveASTVistor and the Matchfinder approaches without success.
So LibTooling was dismissed because i couldnt retrieve context information about the surrounding of a node in the AST.
I was only able to react on a callback when a specific node in the AST was visited. But i didnt know in what context i currently was.
Eg. When I visit a C++RecordDeclaration (class, struct, union) i did not know if it is a nested record or not.
But that information is needed to calculate the lines of code for a single class.
Second approach was using the LibClang interface via Python Bindings.
With the LibClang interface i was able to traverse the AST node by node recursively and store needed context information on a stack.
Here i encountered a general problem with LibClang:
Before creating the AST for a file the preprocessor is started and resolves all preprocessor directives. Just as he is supposed to do.
This is good because if the preprocessor cant resolve all the include directives the output AST will be incomplete.
This is very bad because i wont be able to provide all the include files or directories for any c++ project.
This is bad because code which is surrounded by conditional preprocessor directives is not part of the AST if a preprocessor variable is defined or not. Parsing the same file multiple times with different setups of defined or undefined preprocessor variable is out of scope.
This lead to the third and current attempt with using a c++ parser generated by Antlr provided a c++14 grammar.
No preprocessor is executed before the parser.
This is good because the full source code is parsed and preprocessor directives are being ignored.
Bad thing is that the parser does not seem to be that tough. It fails on code which can be compiled leading to a broken AST. So this solution is not sufficient aswell.
My questions are:
Is there an option to deactivate the preprocessor before parsing a c/c++ source or header file with libClang?
So the source code is untouched and the AST is complete and detailed.
Is there a way to parse a c/c++ source code file without providing all the necessary include directories but still resulting in a detailed AST?
Since iam running out of options. What other approaches may be worth looking at when it comes to analysing/parsing c/c++ source code?
If you think this is not the right place to ask such questions feel free to redirect me to another place.
To answer your last question,
Since iam running out of options. What other approaches may be worth
looking at when it comes to analysing/parsing c/c++ source code?
Another approach is to parse the source code as if it were merely text. This avoids the need to preprocess the source, and to bring in a complex parser. See this paper for an example/introduction: "The Conceptual Cohesion of Classes" by Andrian Marcus, Denys Poshyvanyk. You can still collect such information as LOC and number of methods from this approach, without needing a full parser.
This approach has drawbacks (as does any approach):
It either 1) parses comments along with the source code, or 2) requires that you remove comments from the source. But the latter is an easy step. The reason that might be OK is that even the comments contain information regarding the code, which may help determine which modules are more closely coupled, etc.
It will lump local variables, method names, parameter names, etc. all into the "bag of words" that you are working with.
I'm on Windows, using MSVC to compile my project, but I need clang for its neat AST parser, which allow me to write a little code generator.
Problem is, clang cannot parse MSVC headers (a very-well known and understandable problem).
I tried two options :
I include MSVC header folder, parsing the built-in headers included in my code will end-up leading to a fatal error at some point, preventing me from parsing the parts I want correctly.
What I did before is simply not provide any built-in headers and forward declare the types I needed. It worked fine and somehow it doesn't anymore with latest Clang. I don't really know if the parser policy on missing header changed, but it is causing complete failure every time something like <string> is included and not much get parsed.
I am using the python bindings (libclang), but I would consider switching to C/C++ API if there would be a solution there.
Is there anyway I can alter this behavior and make clang continue parsing even when some headers are not found ?
Use SetSuppressIncludeNotFoundError. Took me an hour to find! You can imagine how glad I was to find it!
https://clang.llvm.org/doxygen/classclang_1_1Preprocessor.html#ac7bafe67fc32e41460855b39d20ff6af
One way to ignore the errors due to missing headers is to set SetSuppressIncludeNotFoundError to true in your definition of ASTFrontendAction. An example for the same is given below.
{
public:
virtual std::unique_ptr<clang::ASTConsumer> CreateASTConsumer(
clang::CompilerInstance &Compiler, llvm::StringRef InFile)
{
Compiler.getPreprocessor().SetSuppressIncludeNotFoundError(true);
return std::unique_ptr<clang::ASTConsumer>(
new CustomASTConsumer(&Compiler.getASTContext()));
}
};
For a complete example using ASTFrontendAction, please visit at https://clang.llvm.org/docs/RAVFrontendAction.html
So you want to process C++ code that uses MS headers, and you want access to ASTs so that you can generate code. And Clang won't handle MS headers.
So Clang can't be the answer unless it gets a radical upgrade.
You asked for "any solution that can make this work".
Our DMS Software Reengineering Tookit with its C++14 Front End can do this.
DMS provides general parsing, AST construction/inspection/transformation/generation, and inverse parsing (conversion of ASTs back into compilable code), parameterized by language definitions.
The C++ front end provides a full C++14 parser, preprocessor handling, AST construction, and full name and type resolution. It has been tested with GCC and MS VS 2013 header files; we're testing with 2015 header files now.
(It also handles MS VS 2013 syntax, too).
It handles the tough parsing cases completely, including the C++ famous "most vexing parse". You can see parse trees at get human readable AST from c++ code.
DMS does not provide Python bindings, nor a direct C++ interface. Rather, it is a standalone tool designed to support the construction of custom tools (e.g., your "little code generator"). It has its own very extensive set of internal APIs, coded in metaprogramming language PARLANSE, which is LISP-like. Other aspects of DMS are managed by using DSLs for lexers, grammars, and transformations. See below.
A word of caution: any tool that can process C++ is gauranteed to be complex. DMS is correspondingly complex, and it takes a while to learn to use it, so you're not going to get instant answers. The good news here
is that some things are easier to do. Your code generation problem
is likely "read a skeleton file, and then replace key entries in it with problem specific code". If that's the case, a DMS tool with the following code (simplified for presentation here) will likely do the trick:
...
(= myAST (Registry:ParseFile (. filename) (. `CppVisualStudio2013') ...)
(Registry:ApplyTransforms myAST (. `MyTransforms.rsl'))
(Registry:PrettyPrint myAST (concat filename `.modified'))
...
with a transforms file MyTransforms.rsl containing source-to-source surface-syntax (e.g, C++ syntax) transformation rules of the conceptual form
rule rulename if_you_see THIS then replace_by ("-->") THAT
An actual C++ rule might look like (making this up because I don't
know your actual code generation goals)
rule replace_abstraction(s: STRING_LITERAL):
" abstraction_place_holder(\s) "
-> " my_DSL_library(\s,17); "
The ApplyTransforms call above will apply all the rules in this file until none apply any further.
Writing surface syntax transforms, where you can do it, is way easier than making calls on a procedure library (which, like Clang, DMS offers) that hack at the tree.
You can write more complex metaprograms using PARLANSE to apply some rules in one place, other rules someplace else, and you can mix source-to-source transforms with procedural transforms that hack directly at the tree if you want.
If you want more details on what transforms look like, ask and I'll provide a link.
Examples found on the web for clang tools are always run on toy examples, which are usually all really trivial C programs.
I am building a tool which performs source-to-source transformations on C++ code, which is obviously a very, very challenging task, but clang is up to this task.
The issue I am facing now is that the AST that clang generates for any C++ code that utilizes the STL is enormous. For example I have some C++ code for which clang++ -ast-dump ... | wc -l is 67,018 lines of horrifying AST gobbledygook!
99% of this is standard library stuff, which I aim to ignore in my source-to-source metaprogramming task. So, to achieve this I want to simply filter out files. Suppose I want to look at only the class definitions in the headers of the project that I'm analyzing (and ignore all standard library headers's stuff), I will need to just figure out which header each of my CXXRecordDecl's came from!
Can this be done?
Edit: Hopefully this is a way to go about it. Trying this out now... The important bit is that it has to tell me the header that the decls came out of, not the cpp file corresponding to the translation unit.
In my experience so far, the "source" of some given AST node is best retrieved by using Locations. For example every node at least has a start location, and when you print this out it will contain the header file path.
Then it's possible to use this path to decide whether it is a system library or part of your application code that you still are interested in examining.
One route I'm looking at is to narrow matches with things like hasName() (as found here. For example:
recordDecl(hasName("MyBaseClass")) // etc.
However your comment above using -ast-dump is something I tried as well to get a lay of the land on my own CLang tool. I found this post to be extremely helpful. Armed with their suggestion, I used clang-check to filter to a specific class name and fed it my top-level CPP file. The output was a much more manageable few hundred lines representing the class declarations and definitions of interest.
This question already has answers here:
Is there anyway to figure out what STL header file has not been included directly?
(2 answers)
Closed 9 years ago.
On Linux, what is a fast way to identify what are the necessary #include statements that I need for a C++ project?
I mean, let's say someone gives you a snippet from the web, but fails to provide the necessary #include statements. Is there potentially a way where you can run a Linux command or compiler command option and identify which functions or classes are missing, and, as a bonus, identify on the hard drive where I might have these things in a header file.
Basically you need some analyzer to parse your sources and headers and build a complete dependency graph which it spits out in the end for you to read and process further.
I'd follow john's advice on g++ and Clang for this purpose but I highly doubt they got what it takes.
What you actually can do, at least with g++, is print out a graph for already existing includes. Use the -H option to print a tree or -M to get a list.
I also refer you to this related topic: Tool to track #include dependencies
Not exactly what you want, but the tools mentioned there might be helpful.
I think Clang's "include-what-you-use" tool is what you want.
If, by necessary you mean minimal (i.e. if A includes B and B includes C then A doesn't need to include C) I don't know of a fast way.
One good approach, however, is for each cpp file to include its own header file first (after any precompiled headers.) That insures that each header file includes (directly or indirectly) all the header files it needs to define the symbols used in the header.
Also a project of reasonable size should be designed in layers such that Layer A knows about/depends on layer B which depends on layer C, etc, but lower layers never include higher layers (i.e. C never includes anything from layer A)
In that case the includes in each cpp or hpp should be in Layer order (A, B, C). If you do this it is fairly easy to check to see if any of the layer C headers can be eliminated (comment them out temporarily) because one of the includes that comes before them has already included them. This happens quite a lot and can significantly reduce the number of #includes in each file.
Having said all of that, this is a much less critical issue than it used to be because compilers are smarter. A combination of #pragma once and precompiled headers can keep build times down without requiring that you spend a lot of time optimizing includes.
The best way I know of to find undefined identifiers in a program is just to try to compile it. Depending on exactly what compiler you’re using, you might be able simply to pipe the output of GCC or Clang into grep, looking for phrases like “undeclared identifier.”
As for determining where the symbols are defined, I would recommend as a starting point looking at Ctags to parse your system headers (best managed using a Makefile) and using the resulting tags table to look up anything grep catches from GCC.
The fastest way... that's not how you should think of it.
https://stackoverflow.com/a/18544093/2112028
I wrote a lovely (I'm quite proud :P) answer there talking about how linking works (with templates) and proving it works and such, understand that.
The goal of #include directives is to create a "translation unit" where every symbol is declared (even if not defined) there's an example in my answer where I simply copy and paste the prototype into a code file, rather than use include.
You ought not worry about the "fastest" way if you use something called "Header guards" (these are mentioned briefly right at the bottom, but this isn't sufficient detail) they go like this:
#ifndef __WHATEVER_H
#define __WHATEVER_H
/*Your code here*/
#endif
So now you can include "whatever.h" AS MANY times as you like. the first time IN THE TRANSLATION UNIT, will define __WHATEVER_H, so the next file that includes it (however many includes deep from the file being compiled) will be empty. as everything between the #ifndef and #endif will be gone.
Hope this helps.
Also if you have unnecessary inputs, use -Wextra and -Wall, GCC will tell you about unused functions, typedefs and so forth. you can use the pragma error push and pop things to control this. For example wxWidget's header files may contain a lot of unused things, so you push the warnings onto the stack, remove the unused warning flags, include the file, pop the warnings stack (turning them back on), less you get thousands of lines of warnings.
I often find that the headers section of a file get larger and larger all the time but it never gets smaller. Throughout the life of a source file classes may have moved and been refactored and it's very possible that there are quite a few #includes that don't need to be there and anymore. Leaving them there only prolong the compile time and adds unnecessary compilation dependencies. Trying to figure out which are still needed can be quite tedious.
Is there some kind of tool that can detect superfluous #include directives and suggest which ones I can safely remove?
Does lint do this maybe?
Google's cppclean (links to: download, documentation) can find several categories of C++ problems, and it can now find superfluous #includes.
There's also a Clang-based tool, include-what-you-use, that can do this. include-what-you-use can even suggest forward declarations (so you don't have to #include so much) and optionally clean up your #includes for you.
Current versions of Eclipse CDT also have this functionality built in: going under the Source menu and clicking Organize Includes will alphabetize your #include's, add any headers that Eclipse thinks you're using without directly including them, and comments out any headers that it doesn't think you need. This feature isn't 100% reliable, however.
Also check out include-what-you-use, which solves a similar problem.
It's not automatic, but doxygen will produce dependency diagrams for #included files. You will have to go through them visually, but they can be very useful for getting a picture of what is using what.
The problem with detecting superfluous includes is that it can't be just a type dependency checker. A superfluous include is a file which provides nothing of value to the compilation and does not alter another item which other files depend. There are many ways a header file can alter a compile, say by defining a constant, redefining and/or deleting a used macro, adding a namespace which alters the lookup of a name some way down the line. In order to detect items like the namespace you need much more than a preprocessor, you in fact almost need a full compiler.
Lint is more of a style checker and certainly won't have this full capability.
I think you'll find the only way to detect a superfluous include is to remove, compile and run suites.
I thought that PCLint would do this, but it has been a few years since I've looked at it. You might check it out.
I looked at this blog and the author talked a bit about configuring PCLint to find unused includes. Might be worth a look.
The CScout refactoring browser can detect superfluous include directives in C (unfortunately not C++) code. You can find a description of how it works in this journal article.
Sorry to (re-)post here, people often don't expand comments.
Check my comment to crashmstr, FlexeLint / PC-Lint will do this for you. Informational message 766. Section 11.8.1 of my manual (version 8.0) discusses this.
Also, and this is important, keep iterating until the message goes away. In other words, after removing unused headers, re-run lint, more header files might have become "unneeded" once you remove some unneeded headers. (That might sound silly, read it slowly & parse it, it makes sense.)
I've never found a full-fledged tool that accomplishes what you're asking. The closest thing I've used is IncludeManager, which graphs your header inclusion tree so you can visually spot things like headers included in only one file and circular header inclusions.
You can write a quick script that erases a single #include directive, compiles the projects, and logs the name in the #include and the file it was removed from in the case that no compilation errors occurred.
Let it run during the night, and the next day you will have a 100% correct list of include files you can remove.
Sometimes brute-force just works :-)
edit: and sometimes it doesn't :-). Here's a bit of information from the comments:
Sometimes you can remove two header files separately, but not both together. A solution is to remove the header files during the run and not bring them back. This will find a list of files you can safely remove, although there might a solution with more files to remove which this algorithm won't find. (it's a greedy search over the space of include files to remove. It will only find a local maximum)
There may be subtle changes in behavior if you have some macros redefined differently depending on some #ifdefs. I think these are very rare cases, and the Unit Tests which are part of the build should catch these changes.
I've tried using Flexelint (the unix version of PC-Lint) and had somewhat mixed results. This is likely because I'm working on a very large and knotty code base. I recommend carefully examining each file that is reported as unused.
The main worry is false positives. Multiple includes of the same header are reported as an unneeded header. This is bad since Flexelint does not tell you what line the header is included on or where it was included before.
One of the ways automated tools can get this wrong:
In A.hpp:
class A {
// ...
};
In B.hpp:
#include "A.hpp
class B {
public:
A foo;
};
In C.cpp:
#include "C.hpp"
#include "B.hpp" // <-- Unneeded, but lint reports it as needed
#include "A.hpp" // <-- Needed, but lint reports it as unneeded
If you blindly follow the messages from Flexelint you'll muck up your #include dependencies. There are more pathological cases, but basically you're going to need to inspect the headers yourself for best results.
I highly recommend this article on Physical Structure and C++ from the blog Games from within. They recommend a comprehensive approach to cleaning up the #include mess:
Guidelines
Here’s a distilled set of guidelines from Lakos’ book that minimize the number of physical dependencies between files. I’ve been using them for years and I’ve always been really happy with the results.
Every cpp file includes its own header file first. [snip]
A header file must include all the header files necessary to parse it. [snip]
A header file should have the bare minimum number of header files necessary to parse it. [snip]
If you are using Eclipse CDT you can try http://includator.com which is free for beta testers (at the time of this writing) and automatically removes superfluous #includes or adds missing ones. For those users who have FlexeLint or PC-Lint and are using Elicpse CDT, http://linticator.com might be an option (also free for beta test). While it uses Lint's analysis, it provides quick-fixes for automatically remove the superfluous #include statements.
This article explains a technique of #include removing by using the parsing of Doxygen. That's just a perl script, so it's quite easy to use.
CLion, the C/C++ IDE from JetBrains, detects redundant includes out-of-the-box. These are grayed-out in the editor, but there are also functions to optimise includes in the current file or whole project.
I've found that you pay for this functionality though; CLion takes a while to scan and analyse your project when first loaded.
Here is a simple brute force way of identifying superfluous header includes. It's not perfect but eliminates the "obvious" unnecessary includes. Getting rid of these goes a long way in cleaning up the code.
The scripts can be accessed directly on GitHub.
Maybe a little late, but I once found a WebKit perl script that did just what you wanted. It'll need some adapting I believe (I'm not well versed in perl), but it should do the trick:
http://trac.webkit.org/browser/branches/old/safari-3-2-branch/WebKitTools/Scripts/find-extra-includes
(this is an old branch because trunk doesn't have the file anymore)
There is a free tool Include File Dependencies Watcher which can be integrated in the visual studio. It shows superfluous #includes in red.
There's two types of superfluous #include files:
A header file actually not needed by
the module(.c, .cpp) at all
A header file is need by the module
but being included more than once, directly, or indirectly.
There's 2 ways in my experience that works well to detecting it:
gcc -H or cl.exe /showincludes (resolve problem 2)
In real world,
you can export CFLAGS=-H before make,
if all the Makefile's not override
CFLAGS options. Or as I used, you
can create a cc/g++ wrapper to add -H
options forcibly to each invoke of
$(CC) and $(CXX). and prepend the
wrapper's directory to $PATH
variable, then your make will all
uses you wrapper command instead. Of
course your wrapper should invoke the
real gcc compiler. This tricks
need to change if your Makefile uses
gcc directly. instead of $(CC) or
$(CXX) or by implied rules.
You can also compile a single file by tweaking with the command line. But if you want to clean headers for the whole project. You can capture all the output by:
make clean
make 2>&1 | tee result.txt
PC-Lint/FlexeLint(resolve problem
both 1 and 2)
make sure add the +e766 options, this warning is about:
unused header files.
pclint/flint -vf ...
This will cause pclint output included header files, nested header files will be indented appropriately.
clangd is doing that for you now. Possibly clang-tidy will soon be able to do that as well.
To end this discussion: the c++ preprocessor is turing complete. It is a semantic property, whether an include is superfluous. Hence, it follows from Rice's theorem that it is undecidable whether an include is superfluous or not. There CAN'T be a program, that (always correctly) detects whether an include is superfluous.