Options for parsing/processing C++ files

Options for parsing/processing C++ files - c++

So I have a need to be able to parse some relatively simple C++ files with annotations and generate additional source files from that.
As an example, I may have something like this:
//# service
struct MyService
{
int getVal() const;
};
I will need to find the //# service annotation, and get a description of the structure that follows it.
I am looking at possibly leveraging LLVM/Clang since it seems to have library support for embedding compiler/parsing functionality in third-party applications. But I'm really pretty clueless as far as parsing source code goes, so I'm not sure what exactly I would need to look for, or where to start.
I understand that ASTs are at the core of language representations, and there is library support for generating an AST from source files in Clang. But comments would not really be part of an AST right? So what would be a good way of finding the representation of a structure that follows a specific comment annotation?
I'm not too worried about handling cases where the annotation would appear in an inappropriate place as it will only be used to parse C++ files that are specifically written for this application. But of course the more robust I can make it, the better.

One way I've been doing this is annotating identifiers of:
classes
base classes
class members
enumerations
enumerators
E.g.:
class /* #ann-class */ MyClass
: /* #ann-base-class */ MyBaseClass
{
int /* #ann-member */ member_;
};
Such annotation makes it easy to write a python or perl script that reads the header line by line and extracts the annotation and the associated identifier.
The annotation and the associated identifier make it possible to generate C++ reflection in the form of function templates that traverse objects passing base classes and members to a functor, e.g:
template<class Functor>
void reflect(MyClass& obj, Functor f) {
f.on_object_start(obj);
f.on_base_subobject(static_cast<MyBaseClass&>(obj));
f.on_member(obj.member_);
f.on_object_end(obj);
}
It is also handy to generate numeric ids (enumeration) for each base class and member and pass that to the functor, e.g:
f.on_base_subobject(static_cast<MyBaseClass&>(obj), BaseClassIndex<MyClass>::MyBaseClass);
f.on_member(obj.member_, MemberIndex<MyClass>::member_);
Such reflection code allows to write functors that serialize and de-serialize any object type to/from a number of different formats. Functors use function overloading and/or type deduction to treat different types appropriately.

Parsing C++ code is an extremely complex task. Leveraging a C++ compiler might help but it could be beneficial to restrict yourself to a more domain-specific less-powerful format i.e., to generate the source and additional C++ files from a simpler representation something like protobufs proto files or SOAP's WSDL or even simpler in your specific case.

I did some very similar work recently. The research I did indicated that there wasn't any out-of-the-box solutions available already, so I ended up hand-rolling one.
The other answers are dead-on regarding parsing C++ code. I needed something that could get ~90% of C++ code parsed correctly; I ended up using srcML. This tool takes C++ or Java source code and converts it to an XML document, which makes it easier for you to parse. It keeps the comments in-tact. Furthermore, if you need to do a source code transformation, it comes with an reverse tool which will take the XML document and produce source code.
It works in 90% of the cases correctly, but it trips on complicated template metaprogramming and the darkest corners of C++ parsing. Fortunately, my input source code is fairly consistent in design (not a lot of C++ trickery), so it works for us.
Other items to look at include gcc-xml and reflex (which actually uses gcc-xml). I'm not sure if GCC-XML preserves comments or not, but it does preserve GCC attributes and pragmas.
One last item to look at is this blog on writing GCC plugins, written by the author of the CodeSynthesis ODB tool.
Good luck!

Related

TMP: how to write template code which converts any struct into a tuple?

Is it possible to use template meta-programming to convert any struct or class into a tuple?
For instance:
struct Foo
{
char c;
int i;
std::string s;
};
typedef std::tuple< char, int, std::string > Foo_Tuple;
It would be nice to have some template code which will generate Foo_Tuple automagically for me.
ANSWER
This is overkill for such a simple case, but for more elaborate cases (eg ORM or any time you need to write a lot of boiler-plate code, and a mere template or macro is inadequate for the task), Boost Mirror looks like it may be extremely useful. I have dug into Boost Mirror a bit more: the basic reflection functionality (in Mirror and Puddle) are not hard to understand, are quite easy to set-up and seem to be quite extensive (can handle many constructs, including C++11 enum classes, etc...). I find this basic functionality to be more than adequate - I can just use the MACROS to the extent that I want to expose my classes to Reflection (so that I don't have to write boiler-plate code). The Factory generators also seem to be very powerful (with the same initial macros set up, you can swap in any factory generator you like to output JSON, SOCI, or to a stream etc...), but has a larger learning curve/setup, if you want to write your own factory generators. One last couple of notes: with some minor tweaks, I was able to get it to work with C++11 on gcc 4.7.2; also, the documentation has been well DOxygenated and there seem to be more than sufficient examples to get going quickly.

I don't think that there's a way to do this in C++.
I don't know a way to enumerate the fields/types in a struct - if you could do that, I would think that constructing such a tuple would be fairly straightforward.
I believe that Boost.Fusion has a macro that helps with this called FUSION_ADAPT_STRUCT, but that's all manual.
The technical term for this is "reflection", and you can find lots of information about it by searching for "C++ reflection".
Here's one such article: How can I add reflection to a C++ application?

Parsing different xml messages. Versions

Say we want to Parse a XML messages to Business Objects. We split the process in two parts, namely:
-Parsing the XML messages to XML Grammar Objects.
-Transform XML Objects to Business Objects.
The first part is done automatically, generation a grammar object for each node.
The second part is done following the XML architecture so far. Example:
If we have the XML Message(Simplified):
<Main>
<ChildA>XYZ</ChildA>
<ChildB att1="0">
<InnerChild>YUK</InnerChild>
</ChildB>
</Main>
We could find the following classes:
DecodeMain(Calls DecodeChildA and B)
DecodeChildA
DecodeChildB(Calls DecodeInnerChild)
DecodeInnerChild
The main problem arrives when we need to handle versions of the same messages. Say we have a new version where only DecodeInnerChild changes(e.g.: We need to add an "a" at the end of the value)
It is really important that the solutions agile for further versions and as clean as possible. I considered the following options:
1)Simple Inheritance:Create two classes of DecodeInnerChild. One for each version.
Shortcomming: I will need to create different classes for every parent class to call the right one.
2)Version Parameter: Add to each method an Object with the version as a parameter. This way we will know what to do within each method according to each version.
Shortcoming: Not clean at all. The code of different versions is mixed.
3)Inheritance + Version Parameter: Create 2 classes with a base class for the common code for the nodes that directly changes (Like InnerChild) and add the version as a parameter in each method. When a node call the another class to decode the child object, it will use one or another class depending on the Version parameter.
4)Some kind of executor pattern(I do not know how to do it): Define at the start some kind of specifications object, where all the methods that are going to be used are indicated and I pass this object to a class that is in charge of execute them.
How would you do it? Other ideas are welcomed.
Thanks in advance. :)

How would you do it? Other ideas are welcomed.
Rather than parse XML myself I would as first step let something like CodesynthesisXSD to generate all needed classes for me and work on those. Later when performance or something becomes issue I would possibly start to look aound for more efficient parsers and if that is not fruitful only then i would start to design and write my own parser for specific case.
Edit:
Sorry, I should have been more specific :P, the first part is done
automatically, the whole code is generated from the XML schema.
OK, lets discuss then how to handle the usual situation that with evolution of software you will eventually have evolved input too. I put all silver bullets and magic wands on table here. If and what you implement of them is totally up to you.
Version attribute I have anyway with most things that I create. It is sane to have before backward-compatibility issue that can not be solved elegantly. Most importantly it achieves that when old software fails to parse newer input then it will produce complaint that makes immediately sense to everybody.
I usually also add some interface for converter. So old software can be equipped with converter from newer version of input when it fails to parse that. Also new software can use same converter to parse older input. Plus it is place where to plug converter from totally "alien" input. Win-win-win situation. ;)
On special case of minor change I would consider if it is cheap to make new DecodeInnerChild to be internally more flexible so accepts the value with or without that "a" in end as valid. In converter I have still to get rid of that "a" when converting for older versions.
Often what actually happens is that InnerChild does split and both versions will be used side-by-side. If there is sufficient behavioral difference between two InnerChilds then there is no point to avoid polymorphic InnerChilds. When polymorphism is added then indeed like you say in your 1) all containing classes that now have such polymorphic members have to be altered. Converter should usually on such cases either produce crippled InnerChild or forward to older version that the input is outside of their capabilities.

How do I get a list of all declared classes inheriting from a particular class in C++

I know that isn't exactly possible in C++, but maybe a toolchain that can generate code which has a function, which when called gives me a list of all those classes. For example, across multiple files I have stuff like:
class MyClass : public ParticularClass {
....
}
class MyClass2 : public ParticularClass {
....
}
Then, during runtime, I just want a pointer to single instances of the class. Let's say my generated code looks something like this:
void __populate_classes() {
superList.append(new MyClass());
superList.append(new MyClass2());
}
Also, superList would be of type List<ParticularClass*>. Plus, I'll be using Qt and ParticularClass will be QObject derived, so I can fetch the name of the class anyways. I need to basically introspect the class, so my internal code doesn't really bother much about the newly defined type.
So, is there a way to generate this code with some toolchain? If it is possible with qmake alone, that'd be like icing on the freaking cake :)
Thanks a lot for your time.

Doxygen does a nice job at doing this -- offline. Various IDEs do a nice job at this -- offline. The compiler does not do this. Such knowledge is not needed or used by the compiler.

Here at work I use a tool called Understand 4 C++. It is a tool that helps you analyze your code. It will do this quite easily.
But my favorite part is it comes with a C and Perl API which allows you to take advantage of the abstract syntax tree that 'understand' encapsulates and write your own static analysis tools. I have written tons of tools using this API.
Anyways, it's written by SciTools. http://scitools.com and I don't work for them. I just wholeheartedly like their product. In fact I wrote a C# API that wraps their C API and posted it on CodePlex a few years ago. Sure beats using C or Perl to write static analysis tools.

I don't think what you're trying to do is a good idea. Those who will maintain code after you will have hard times to understand it.
Maybe instead of it you'll try see how you can do it in plan C++. One possible solution which comes to mind i to implement factory design pattern. Than you can iterate over all data types in factory and add then to superList.
Any way, using ack (simple grep replacement) can do the job if you always declare the inheritence in one line:
ack ": *public ParticularClass" *.h

Going through members of a C++ class

As far as I know, if I have a class such as the following:
class TileSurface{
public:
Tile * tile;
enum Type{
Top,
Left,
Right
};
Type type;
Point2D screenverts[4]; // it's a rectangle.. so..
TileSurface(Tile * thetile, Type thetype);
};
There's no easy way to programatically (using templates or whatever) go through each member and do things like print their types (for example, typeinfo's typeid(Tile).name()).
Being able to loop through them would be a useful and easy way to generate class size reports, etc. Is this impossible to do, or is there a way (even using external tools) for this?

Simply not possible in C++. You would need something like Reflection to implement this, which C++ doesn't have.
As far as your code is concerned after it is compiled, the "class" doesn't exist -- the names of the variables as well as their types have no meaning in assembly, and therefore they aren't encoded into the binary.
(Note: When I say "Not possible in C++" I mean "not possible to do built into the language" -- you could of course write a C++ parser in C++ which could implement this sort of thing...)

No. There are no easy way. If to put "easy way" aside then with C++ you can do anything imaginable.
If you want just to dump your data contents run-time then simplest way is to implement operator<<(ostream&,YourClass const&) for each YourClass you are interested in. Bit more complex is to implement visitor pattern, but with visitor pattern you may have different reports done by different visitors and also the visitors may do other things, not only generate reports.
If you want it as static analysis (program is not running, you want to generate reports) then you can use debugger database. Alternatively you may analyze AST generated by some compilers (g++ and CLang have options to generate it) and generate reports from it.
If you really need run-time reflection then you have to build it into your classes. That involves overhead. For example you may use common base-classes and put all data members of classes into array too. It is often done to communicate with applications written in languages that have reflection on more equal grounds (oldest example is Lisp).

I beg to differ from the conventional wisdom. C++ does have it; it's not part of the C++ standard, but every single C++ compiler I've seen emits metadata of this sort for use by the debugger.
Moreover, two formats for the debug database cover almost all modern compilers: pdb (the Microsoft format) and dwarf2 (just about everything else).

Our DMS Software Reengineering Toolkit is what you call an "external tool" for extractingt/transforming arbitrary code. DMS is generalized compiler technology parameterized by explicit langauge definitions. It has language definitions for C, C++, Java, COBOL, PHP, ...
For C, C++, Java and COBOL versions, it provides complete access to parse trees, and symbol table information. That symbol table information includes the kind of data you are likely to want from "reflection". If you goal is to enumerate some set of fields or methods and do something with them, DMS can be used to transform the code (or generate derived code) according to what you find in the symbol tables in arbitrary ways.

If you derive all types of the member variables from your common typeinfo-provider-baseclass, then you can get that. It is a bit more work than like in Java, but possible.

External tools: you mentioned that you need reports like class size, etc.--
Doxygen could help http://www.doxygen.nl/manual/features.html to generate class member lists (including inherited members).

C++ functions exposed to scripting system - self-describing parameter types

A C++ rules engine defines rules in XML where each rule boils down to "if X, then Y" where X is a set of tests and Y a set of actions.
In C++ code, 'functions' usable in tests/actions are created as a class for each 'function', each having a "run(args)" method... each takes its own set of parameters.
This works fine.
But, a separate tool is wanted to save users hand-crafting XML; the rules engine is aimed at non-programmers. The tool needs to know all the 'functions' available, as well as their required input parameters. What's the best way to consider doing this? I considered a couple of possibilities:
A config file describes the 'functions' and their parameters, and is read by the tool. This is pretty easy, and the actual C++ code can use it to perform argument validation, but still the C++ and XML are not guaranteed to be in sync - a programmer could modify C++ and forget to update the XML leading to validation bugs
Each 'function' class has methods which describe it. Somehow the tool loads the C++ classes... this would be easy in a language supporting reflection but messier in C++, probably you'd have to build a special DLL with all 'functions' or something. Which means extra overhead.
What makes sense given the nature of C++ specifically?
EDIT: is the title descriptive? I can't think of a better one.

There's a 3rd way - IDL.
Imagine you have a client-server app, and you have a code generator that produces wrapper classes that you can deploy on client and server so the user can write an app using the client API and the processing occurs on the server... this is a typical RPC scenario and is used in DCE-RPC, ONC-RPC, CORBA, COM and others.
The trick here is to define the signatures of the methods the client can call, which is done in an Interface Definition Language. This doesn't have to be difficult, but it is the source for the client/server API, you run it through a generator and it produces the C++ classes that you compile up for the client to use.
In your case, it sounds like the XML is the IDL. so you can create a tool that takes the XML and produces the C++ headers describing the functions that your code exposes. You don't really have to generate the cpp files (you could) but its easier to just generate the headers, so the programmer who adds a new function/parameter cannot forget to update the implementation - it just won't compile once the headers have been re-generated.
You can generate a header that is #included into the existing c++ headers if there is more there than just the function definitions.
So - that's my suggestion, #3: generate the definitions from your definitive XML signatures.

There is one other way:
Add a constraint that the argument types be uniform in a function call.
define some max number of arguments
describe the types and the precedence i.e. double converrts to String but not vice-versa
then you have
void f(int a1) .. f(int a1 .. int aN)
void f(double a1) .. f(double a1 .. double aN)
..
void f(T a1) ..
And other concrete data types like String, Date, etc.
Advantages:
Variations in signature fixed and regular
it's possible to only provide the "biggest" type signature (T)
works well with templates and language bridges
can warn action f with 2 Integer parameters undefined

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js