C++ Introspection: Enumerate available classes and methods in a C++ codebase

C++ Introspection: Enumerate available classes and methods in a C++ codebase - c++

I'm working on some custom C++ static code analysis for my PHD thesis. As part of an extension to the C++ type system, I want to take a C++ code base and enumerate its available functions, methods, and classes, along with their type signatures, with minimal effort (it's just a prototype). What's the best approach to doing something like this quickly and easily? Should I be hacking on Clang to spit out the information I need? Should I look at parsing header files with something like SWIG? Or is there an even easier thing I could be doing?

GCCXML, based on GCC, might be the ticket.
As I understand it, it collects and dumps all definitions but not the content of functions/methods.
Others will likely mention CLANG, which certainly parses code and must have access to the definitions of the symbols in a compilation unit. (I have no experience here).
For completeness, you should know about our DMS Software Reengineering Toolkit
with its C++ Front End. (The CLANG answers seem to say "walk the AST"). The DMS solution provides an enumerable symbol table containing all the type information. You can walk the AST, too, if you want.
Often a static analysis leads to a diagnosis, and a desire to change the source code.
DMS can apply source-to-source program transformations to carry out such changes conditioned
by the analysis.

I heartily recommend LLVM for statical analysis (see also Clang Static Analyzer)

I think your best bet is hacking on clang and getting the AST. There is a good tutorial for that here. Its very easy to modify its syntax and it also has a static analyzer.

At my work, I use the API from a software package called "Understand 4 C++" by scitools. I use this to write all my static analysis tools. I even wrote a .NET API to wrap their C API. Which I put on codeplex.
Once you have that, dumping all class types is easy:
ClassType[] allclasses = Database.GetAllClassTypes()
foreach (ClassType c in allclasses)
{
Console.WriteLine("Class Name: {0}", c.NameLong);
}
Now for a little backstory about a task I had that is similar to yours.
In some years we have to keep our SDK binary backwards compatible with the previous years SDK. In that case it's useful to compare the SDK code between releases to check for potential breaking changes. However with a couple of hundred files, and tens of thousands of lines of comments that can be a big headache using a text diff tool like Beyond Compare or Araxis. So what I really need to look at is actual code changes, not re-ordering, not moving code up and down in the file, not adding comments etc...
So, a tool I wrote to dump out all the code.
In one text file I dump all all the classes. For each class I print its inheritance tree, its member functions both virtual and non-virtual. For each virtual function I print what parent class virtual methods it overrides (if any). I also print out its member variables.
Same goes with the structs.
In another file I print all the macro's.
In another file I print all the typedefs.
Then using this I can diff these files with files from a previous release. It then becomes apparent instantly what has changed from release to release. For instance it's easy to see where a function parameter was changed from TCHAR* to const TCHAR* for instance.

You might consider developping a GCC Plugin for your purpose.
And GCC MELT is a high level domain specific language (that I designed & implemented) for easily extend GCC.
The paper at GROW09 workshop by Peter Collingbourne and Paul Kelly on a A Compile-Time Infrastructure for GCC Using Haskell might be relevant for your work.

Related

Creating a modular language in LLVM?

I'm developing a new language in LLVM using the C++ API which compiles down to target the C ABI.
I would like to support modular compilation by allowing end users to build what are effectively static libraries. I noticed the LLVM C++ API has a llvm::Linker class that I can use during compilation to combine source files (llvm::Module), however I wanted to guarantee library compatibility via metadata version numbers or at least the publicly exposed interface between separate compilation runs.
Much of the information available on metadata in LLVM suggest that it should only be used for extended information that would not break correctness when silently removed.
llvm
blog
IntrinsicsMetadataAttributes
pdf
I wouldn't think this would be a deal breaker as it could be global metadata, but it would be good to get a second opinion on that point.
I also know there is a method in IRReader to parseIRFile so I can load some previously built bc files. I would be curious if it would be reasonable practice to include size and CRC information for comparison when loading these files.
My language has concepts similar to C# including interfaces. I figure I could allow modular compilation by importing/exporting an interface type along with external functions (Much like C++, I don't restrict the language to only methods of classes).
This approach allows me to include language specific information in the interface without needing to encode it in the IR as both the library and the calling code would be required to build with the interface. This again requires the interfaces to be compatible.
One language feature that would require extended information would be named parameters in functions.
My language is very type-safe and also mandates named parameters so there is no predetermined function parameter order. This allows call sites to be more explicit, the compiler to catch erroneous parameter usage, and authors have more liberty in determining default parameters as they are not restricted to the last parameters to the function.
The compiler will need to know names, modifiers, defaults, etc. of these parameters to correctly map calls at compile time, so I figure the interface approach would work well here.
TL;DR
Does LLVM have any predefined facilities for building static libraries?
Is version number, size, and CRC information reasonable use cases for LLVM's metadata?

This is probably not QUITE an answer... Or at least not a complete answer.
I like this question, as I'm going to need a solution in the future too (some time in the next few months or years) for my Pascal compiler. It supports "units" which is meant to be a separately compiled object, but currently what I do is simply drag in the source file and compile it into the main llvm::Module - that's neither efficient nor flexible (can't use the linker to choose between the "Linux" and "Windows" version of some code, for example - not that I think there is 5% chance that my compiler will work on Windows without modification anyway...)
However, I'm not sure storing the "object" file as LLVM IR would be the right thing to do. I was thinking that a better way would be to store your AST in some serialized form - then
you don't depend on LLVM versions changing the IR format.
You can add whatever metadata you like. There won't be much
difference in generating LLVM-IR from this during your link phase or
building the IR at compile and then reading the IR to figure out if
the metadata is correct. [The slow part, as you may have already found out, is the optimisation and MC generation, and you'd still have to do that either way]
Like I started out, I'm not sure this is an answer, but it's my thoughts so far on the subject. Now I'll go back to adding debug symbol stuff to my Pascal compiler... Before Christmas, I couldn't see the source in GDB. Now I can step, but no viewing of variables yet...

Are there convenient tools to automatically check C++ coding conventions beyond style checks?

Are there good tools to automatically check C++ projects for coding conventions like e.g.:
all thrown objects have to be classes derived from std::exception (i.e. throw 42; or throw "runtime error"; would be flagged as errors, just like throw std::string("another runtime error"); or throwing any other type not derived from std::exception)
In the end I'm looking for something like Cppcheck but with a simpler way to add new checks than hacking the source code of the check tool... May be even something with a nice little GUI which allows you to set up the rules, write them to disk and use the rule set in an IDE like Eclipse or an continuous integration server like Jenkins.

I ran a number of static analysis tools on my current project and here are some of the key takeaways:
I used Visual Lint as a single entry point for running all these tools. VL is a plug-in for VS to run third-party static analysis tools and allows a single click route from the report to the source code. Apart from supporting a GUI for selecting between the different levels of errors reported it also provides automated background analysis (that tells you how many errors have been fixed as you go), manual analysis for a single file, color coded error displays and charting facility. The VL installer is pretty spiffy and extremely helpful when you're trying to add new static analysis tools (it even helps you download Python from ActiveState should you want to use Google cpplint and don't have Python pre-installed!). You can learn more about VL here: http://www.riverblade.co.uk/products/visual_lint/features.html
Of the numerous tools that can be run with VL, I chose three that work with native C++ code: cppcheck, Google cpplint and Inspirel Vera++. These tools have different capabilities.
Cppcheck: This is probably the most common one and we have all used it. So, I'll gloss over the details. Suffice to say that it catches errors such as using postfix increment for non-primitive types, warns about using size() when empty() should be used, scope reduction of variables, incorrect name qualification of members in class definition, incorrect initialization order of class members, missing initializations, unused variables, etc. For our codebase cppcheck reported about 6K errors. There were a few false positives (such as unused function) but these were suppresed. You can learn more about cppcheck here: http://cppcheck.sourceforge.net/manual.pdf
Google cpplint: This is a python based tool that checks your source for style violations. The style guide against which this validation is done can be found here: http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml (which is basically Google's C++ style guide). Cpplint produced ~ 104K errors with our codebase of which most errors are related to whitespaces (missing or extra), tabs, brace position etc. A few that are probably worth fixing are: C-style casts, missing headers.
Inspirel Vera++: This is a programmable tool for verification, analysis and transformation of C++ source code. This is similar to cpplint in functionality. A list of the available rules can be found here: http://www.inspirel.com/vera/ce/doc/rules/index.html and a similar list of available transformations can be found here: http://www.inspirel.com/vera/ce/doc/transformations/index.html. Details on how to add your own rule can be found here: http://www.inspirel.com/vera/ce/doc/tclapi.html. For our project, Vera++ found about 90K issues (for the 20 odd rules).

In the upcoming state: Manuel Klimek, from Google, is integrating in the Clang mainline a tool that has been developed at Google for querying and transforming C++ code.
The tooling infrastructure has been layed out, it may fill up but it is already functional. The main idea is that it allows you to define actions and will run those actions on the selected files.
Google has created a simple set of C++ classes and methods to allow querying the AST in a friendly way: the AST Matcher framework, it is being developped and will allow very precise matching in the end.
It requires creating an executable at the moment, but the code is provided as libraries so it's not necessary to edit it, and one-off transformation tools can be dealt with in a single source file.
Example of the Matcher (found in this thread): the goal is to find calls to the constructor overload of std::string formed from the result of std::string::c_str() (with the default allocator), because it can be replaced by a simple copy instead.
ConstructorCall(
HasDeclaration(Method(HasName(StringConstructor))),
ArgumentCountIs(2),
// The first argument must have the form x.c_str() or p->c_str()
// where the method is string::c_str(). We can use the copy
// constructor of string instead (or the compiler might share
// the string object).
HasArgument(
0,
Id("call", Call(
Callee(Id("member", MemberExpression())),
Callee(Method(HasName(StringCStrMethod))),
On(Id("arg", Expression()))
))
),
// The second argument is the alloc object which must not be
// present explicitly.
HasArgument(1, DefaultArgument())
)
It is very promising compared to ad-hoc tool because it uses the Clang compiler AST library, so not only it is guaranteed that no matter how complicated the macros and template stuff that are used, as long as your code compiles it can be analyzed; but it also means that intricates queries that depend on the result of overload resolution can be expressed.
This code returns actual AST nodes from within the Clang library, so the programmer can locate the bits and nits precisely in the source file and edit to tweak it according to her needs.
There has been talk about using a textual matching specification, however it was deemed better to start with the C++ API as it would have added much complexity (and bike-shedding). I hope a Python API will emerge.

The key problem with "style checkers" is that style is like art: everybody has a different opinion about what is good style and what is not. The implication is that style checkers will always need to be customized to the local art tastes.
To do this right, one needs a full C++ parser with access to symbol definitions, scoping rules and ideally various kinds of flow analyses. AFAIK, CppCheck does not provide accurate parsing or symbol table definitions, so its error checking can't be both deep and correct. I think Coverity and Fortify offer something along these lines using the EDG front end; I don't know if their tools offer access to symbol tables or data flow analyses. Clang is coming along.
You also need a way to write the style checks. I think all the tools offer access to an AST and perhaps symbol tables, and you can hand code your own checks, at the cost of knowing the AST intimately, which is hard for a big language like C++. I think Coverity and Fortify have some DSL-like scheme for specifying some of the checks.
If you want to fix code that is style incorrect, you need something that can modify the code representation. Coverity and Fortify do not offer this AFAIK. I believe Clang does offer the ability to modify the AST and regenerate code; you still have to have pretty intimate knowledge of the AST structure to code the tree hacking logic and get it right.
Our DMS Software Reengineering Toolkit and its C++ front end provide most of these capabilities. Using its C++ front end, DMS can parse ANSI C++11, GCC4 (with C++11 extensions) and MSVS 2010 (with its C++11 extensions) [update May 2021: now full C++17 and most of C++20] build ASTs and symbol tables with full type information. One can also ask for the type of an arbitrary expression AST node. At present, DMS computes control flow but not data flow for C++.
An AST API lets you procedurally code arbitrary checks; or make changes to the AST to fix problems, and then DMS's prettyprinter can regenerate complete, compilable source text with comments and preserved literal format information (eg., radix of numbers, etc.). You have to know the AST structure to do this, just like other tools, but it is a lot easier to know, because it is isomorphic to the DMS C++ grammar rules. The C++ front end comes with the our C++ grammar. [DMS uses GLR parsers to make this possible].
In addition, one can write patterns and transformations using DMS's Rule Specification Language, using the surface syntax of C++ itself. One might code OPs "dont throw nonSTL exceptions" as
pattern nonSTLexception(i: IDENTIFIER):statement
= " throw \i; " if ~derived_from_STD_exception(i);
The stuff inside the (meta)quotes is C++ source code with some pattern-matching escapes, e.g, "\i" refers to the placeholder varible "i" which must be a C++ IDENTIFIER according the rule; the entire "throw \i;" clause must be a C++ "statement" (a nonterminal in the C++ grammar). The rule itself mainly expresses syntax to be matched, but can invoke semantic checks (such as "~is_derived_from_STD_exception") applied to matched subtrees (in this case, whatever "\i" matched).
In writing such patterns, you don't have to know the shape of the AST; the pattern knows it, and it is automatically matched. If you've ever coded AST walkers, you will appreciate how convenient this is.
A match knows the AST node and therefore the precision position (file/line/column) which makes it easy to generate reports with precise location information.
You need to add a custom routine to DMS, "inherits_from_STD_exception", to verify the identifier tree node passed to that routine is (as OP desired) a class derived from
std::exception. This requires finding "std::exception" in the symbol table,
and verifying that the symbol table entry for the identifier tree node is a class
declaration and transitively inherits from other class declarations (by following symbol table links) until the std::exception symbol table entry is found.
A DMS transformation rule is a pair of patterns stating in essence, "if you see this, then replace it by that".
We've built several custom style checkers with DMS for both COBOL and C++. Its still a fair amount of work, mostly because C++ is a pretty complex language and you have to think carefully about the precise meaning of your check.
Trickier checks and those tests that start to fall into deep static analysis require access to control and data flow information. DMS computes control flow for C++ now, and we're working on data flow analysis (we've already done this for Java, IBM Enterprise COBOL and a variety of C dialects). Analysis results are tied back to the AST nodes so that one can use patterns to look for elements of the style check, and then follow the data flows to tie the elements together if needed.
When all is said and done with DMS, (or indeed with any of the other tools that deal with C++ in any halfway accurate way), is that coding additional or complex style checks is unlikely to be "convenient". You should hope for "possible with good technical background."

Open-source C++ scanning library

Rationale: In my day-to-day C++ code development, I frequently need to
answer basic questions such as who calls what in a very large C++ code
base that is frequently changing. But, I also need to have some
automated way to exactly identify what the code is doing around a
particular area of code. "grep" tools such as Cscope are useful (and
I use them heavily already), but are not C++-language-aware: They
don't give any way to identify the types and kinds of lexical
environment of a given use of a type or function a such way that is
conducive to automation (even if said automation is limited to
"read-only" operations such as code browsing and navigation, but I'm
asking for much more than that below).
Question: Does there exist already an open-source C/C++-based library
(native, not managed, not Microsoft- or Linux-specific) that can
statically scan or analyze a large tree of C++ code, and can produce
result sets that answer detailed questions such as:
What functions are called by some supplied function?
What functions make use of this supplied type?
Ditto the above questions if C++ classes or class templates are involved.
The result set should provide some sort of "handle". I should be able
to feed that handle back to the library to perform the following types
of introspection:
What is the byte offset into the file where the reference was made?
What is the reference into the abstract syntax tree (AST) of that
reference, so that I can inspect surrounding code constructs? And
each AST entity would also have file path, byte-offset, and
type-info data associated with it, so that I could recursively walk
up the graph of callers or referrers to do useful operations.
The answer should meet the following requirements:
API: The API exposed must be one of the following:
C or C++ and probably is "C handle" or C++-class-instance-based
(and if it is, must be generic C o C++ code and not Microsoft- or
Linux-specific code constructs unless it is to meet specifics of
the given platform), or
Command-line standard input and standard output based.
C++ aware: Is not limited to C code, but understands C++ language
constructs in minute detail including awareness of inter-class
inheritance relationships and C++ templates.
Fast: Should scan large code bases significantly faster than
compiling the entire code base from scratch. This probably needs to
be relaxed, but only if Incremental result retrieval and Resilient
to small code changes requirements are fully met below.
Provide Result counts: I should be able to ask "How many results
would you provide to some request (and no don't send me all of the
results)?" that responds on the order of less than 3 seconds versus
having to retrieve all results for any given question. If it takes
too long to get that answer, then wastes development time. This is
coupled with the next requirement.
Incremental result retrieval: I should be able to then ask "Give me
just the next N results of this request", and then a handle to the
result set so that I can ask the question repeatedly, thus
incrementally pulling out the results in stages. This means I
should not have to wait for the entire result set before seeing
some subset of all of the results. And that I can cancel the
operation safely if I have seen enough results. Reason: I need to
answer the question: "What is the build or development impact of
changing some particular function signature?"
Resilient to small code changes: If I change a header or source
file, I should not have to wait for the entire code base to be
rescanned, but only that header or source file
rescanned. Rescanning should be quick. E.g., don't do what cscope
requires you to do, which is to rescan the entire code base for
small changes. It is understood that if you change a header, then
scanning can take longer since other files that include that header
would have to be rescanned.
IDE Agnostic: Is text editor agnostic (don't make me use a specific
text editor; I've made my choice already, thank you!)
Platform Agnostic: Is platform-agnostic (don't make me only use it
on Linux or only on Windows, as I have to use both of those
platforms in my daily grind, but I need the tool to be useful on
both as I have code sandboxes on both platforms).
Non-binary: Should not cost me anything other than time to download
and compile the library and all of its dependencies.
Not trial-ware.
Actively Supported: It is likely that sending help requests to mailing lists
or associated forums is likely to get a response in less than 2
days.
Network agnostic: Databases the library builds should be able to be used directly on
a network from 32-bit and 64-bit systems, both Linux and Windows
interchangeably, at the same time, and do not embed hardcoded paths
to filesystems that would otherwise "root" the database to a
particular network.
Build environment agnostic: Does not require intimate knowledge of my build environment, with
the notable exception of possibly requiring knowledge of compiler
supplied CPP macro definitions (e.g. -Dmacro=value).

I would say that CLang Index is a close fit. However I don't think that it stores data in a database.
Anyway the CLang framework offer what you actually need to build a tool tailored to your needs, if only because of its C, C++ and Objective-C parsing / indexing capabitilies. And since it's provided as a set of reusable libraries... it was crafted for being developed on!

I have to admit that I haven't used either because I work with a lot of Microsoft-specific code that uses Microsoft compiler extensions that i don't expect them to understand, but the two open source analyzers I'm aware of are Mozilla Pork and the Clang Analyzer.

If you are looking for results of code analysis (metrics, graphs, ...) why not use a tool (instead of API) to do that? If you can, I suggest you to take a look at Understand.
It's not free (there's a trial version) but I found it very useful.

Maybe Doxygen with GraphViz could be the answer of some of your constraints but not all,for example the analysis of Doxygen is not incremental.

Easy way to get function prototypes?

A friend and I were discussing imaginary and real languages and a question that came up was if one of us wanted to generate headers for another language (perhaps D which already has a tool) what would be an easy and very good way to do this?
One of us said to scan C files and headers and ignore function bodies and only count the braces within to figure out when a function is finished. The counter to that was typedefs, defines (which braces but defines were considered as a trivial problem) and templates + specialization.
Another solution was to read binaries produce, not the actual exe but the object files the linker uses. The counter to that was the format and complexity. None of us knew anything of any object format so we couldnt estimate (we were thinking of gcc and VS c++).
What do you guys think? Which is easier? This should be backed up with reasonable logic and fact.
If someone can link to a helpful project, one that parses C files/headers and outputs it or one that reads in elf data and displays info in an example project would be useful. I tried googling but I didnt know what it would be called. I found libelf but at this moment I couldn't get it to compile. I might be able to soon.

You can use clang libraries to parse C/C++ source code and extract any information you want in particular function prototypes.
Due to library-based architecture it is easy to reuse parts of clang that you need. In your case these are frontend libraries (liblex, libparse, libsema). I think this is a more feasible approach then using hand-written scanner considering the difficulties that you mentioned (typedefs, defines, etc).
clang can also be used as a tool to parse the source code and output AST in XML form, for example if you have the file test.cpp:
void foo() {}
int main()
{
foo();
}
and invoke clang++ -Xclang -ast-print-xml -fsyntax-only test.cpp you'll get the file test.xml similar to the following (here irrelevant parts skipped for brevity):
<?xml version="1.0"?>
<CLANG_XML>
<TranslationUnit>
<Function id="_1D" file="f2" line="1" col="6" context="_2"
name="foo" type="_12" function_type="_1E" num_args="0">
</Function>
<Function id="_1F" file="f2" line="3" col="5" context="_2"
name="main" type="_21" function_type="_22" num_args="0">
</Function>
</TranslationUnit>
<ReferenceSection>
<Types>
<FunctionType result_type="_12" id="_1E"/>
<FundamentalType kind="int" id="_21"/>
<FundamentalType kind="void" id="_12"/>
<FunctionType result_type="_21" id="_22"/>
<PointerType type="_12" id="_10"/>
</Types>
<Files>
<File id="f2" name="test.cpp"/>
</Files>
</ReferenceSection>
</CLANG_XML>
I don't think that extracting this information from binaries is possible at least for symbols with C linkage, because they don't have name mangling.

ctags' output is quite easy to read/parse

if you want to simply generate a binding, try swig

What you're talking about is compiling: The act of transforming code in one formal language to another. There's a good solid science behind this that, if followed carefully, will guarantee your program halts with a correct analogous code.
Granted, you don't want to parse the whole of the C++ language (hooray for that!), so you just need to define the relevant grammar and define everything else as acceptable noise or comments.
Don't use regular expressions. These won't do because C++ is not a regular language.

One way to do this is to define your interfaces in an abstract language (an IDL), and generate headers for all languages that you're interested in. You can limit the scope of your IDL to those features that are possible in each target language.
Windows takes this approach in its MIDL language, for example.

Doxygen can help with this. It's an advanced topic, and somewhat documented.

In a perfect world...
Each compiler of any programming language would emit such information as part of its output. It could be an ELF extension or a new generally accepted file format, which contains an ELF/COFF/whatever section.
This would spare the (across the globe) thousands of man hours to generate "language bindings" for dozens of languages. Dynamic languages, such as Common Lisp would not need FFI libraries - as it all would happen under the hood and loading a shared library in that new format could automatically be inspected and functions in it could be made available on the fly, without any further ado.
And all, which actually would have to be stored as extra information for each (exported) function is:
Generic name
Return type
Argument list
calling convention (there are not that many in practice)
A reference to the symbol table entry, referred to.
Why has it not been done during the past 30 years?
Because language designers still see the compiler output as their "private affair", whereas IMHO, this should be part of the OS/Runtime/Loader.
Many workarounds exist, some listed here in other answers (IDL, binding generators, such as SWIG and a myriad of others, mostly ad hoc and of varying and typically insufficient quality).
The ELF guys are Unix guys and as such C guys, who still live in a bubble, thinking C is some sort of "golden standard". But even they would not have any problems, using the information to generate their beloved header files.
RUST might not have invented crates in that perfect world and you would not even care if some shared library had been written in C or C++ or Rust or Zig or Haskell - anyone could just use it.
This domain is still dominated by "computer scientists", not engineers; while in theory, there could be an infinite number of calling conventions, in fact, those in active use are very countable few (LLVM supports many of those actually relevant).
Another reason, it has not been done yet, is that there is a danger, some committee would create a monster, trying make it "open, flexible, ..." and no one would eventually dare using it. DCE (distributed computing environment) gives that idea.

Refactoring C++ code to use forward declarations

I've got a largeish codebase that's been around for a while and I'm trying to tidy it up a bit by refactoring it. One thing I'd like to do is find all the headers where I could forward declare members, instead of including the entire header file.
This is a fairly labour intensive process and I'm looking for a tool that could help me figure out which headers have members that could be forward declared.
Is there a compiler setting that could emit a warning or suggestion that the following code could use a forward declaration? I'm using the following compilers icc, gcc, sun studio and HP's aCC
Is there a standalone tool that could do the same job?
#include "Foo.h"
...//more includes
class Bar {
.......
private:
Foo* m_foo;
};

Anything involving the precise analysis of C++ requires essentially an entire C++ front end somewhere (otherwise you won't get answers, or they'll be wrong, and that works badly when you have "largish" applications). There aren't many practical answers available here.
Already mentioned is GCCXML is a GCC-derived package, so it has the requisite C++ front end. It produces XML, thus it will produce a LOT of output that you'll have to read back in to form "the in memory data structure" suggested in another answer. Its unfortunate that GCCXML builds that memory data structure already, then exports it as XML, and forces you to build it again. Of course, you could just use GCC, which builds the in memory data structure, but then you have to hack GCC to be what you want, and it really, really wants to be a compiler. That means you'll have a fight on your hands to bend it to your will (and explains why GCCXML exists: most people don't want that fight).
Not mentioned is the Edison Design Group C++ (EDG) front end, which builds that in memory data structure directly. It is a front end; you'll have to do all the analysis stuff yourself but your task may be simple enough so that isn't hard.
The last solution I know is mine: C++ FrontEnd for DMS. DMS is foundation for building program analysis, and its C++ FrontEnd is a complete Front end for C++ (e.g., does everything the GCC and Edison front ends do: parsing, tree building, name/type resolution). And you'll have to code your special analysis much the way you would for GCCXML and EDG by walking over the "in memory" datastructures produced by DMS.
What is really different is that DMS could then be used to actually modify your source code by updating those in memory data structures, and regenerate compilable code from those memory structures, including the original comments.

I'm not sure you'll find anything that does this out-of-the box, but one option would be to write a script using Python and the pygccxml package that could do some of this analysis for you.
Basically you'd use pygccxml to build an in-memory graph of your source code, then use it to query your classes and functions to find out what they actually need to include to function.
So for example you could ask each class, give me the members that are pointer types: then for each of those pointer types you could work out if a real instance (non pointer) of the class was used in the interface, and if not you could mark it as a candidate for forward declaration.
The downside is that the script would take some time to get right so the cost might outweigh the benefits but it would be an interesting exercise at least. You could post your code to Github if you got something that worked and maybe others would find it useful.

What you could do is call gcc with -MM. This will produce dependency files that Make can read. Instead of having make use them, you could parse them (with perl, or something) to determine which includes are needed and which could be replaced with forward declarations.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js