I'm playing with libclang to parse small c++ files. I've seen examples about how to parse the AST trees.
As I understood AST constitutes of ASTNodes which has the type of either Decl or Stmt. To traverse the tree I can either use the ASTConsumer which visits ASTNodes or CxCursors.
What is the difference between these two traversal methods?
Both are part of the same method of AST traversal as the cursors are pointers on AST nodes. If you look for a different method of AST traversal you should look into AST matchers. With that method you define a model of AST that you want to match against the AST of a source file. It can be a powerful method.
Here is an introduction to matching with Clang: https://clang.llvm.org/docs/LibASTMatchers.html
After having read some pages, I found some information about this subject. According to the clang Tooling documentation there are two tools (among other clang tools) :
- Libclang
- Libtooling
The CxCursors belongs to the Libclang library. The Libclang library provides the cursors to traverse the AST although the control on the AST is limited.
The classes Decl and Stmt are the part of the Libtooling. By defining a RecursiveASTVisitor and its corresponding visiting function one can traverse these nodes. With this method one can have full control over on the AST.
Related
The RecursiveASTVisitor class implements a stateless visitor for the AST of some source. I would like to process the AST into some JSON that describes the parent classes, methods and their arguments, etc. RecursiveASTVisitor doesn't really seem to have facilities for returning stuff up to parent node visits or passing stuff down to child node visitors, so I'm not sure how I'm meant to maintain the context of the tree traversal.
What's the right way to traverse the AST while building up my own tree?
To be more specific: Microsoft.CodeAnalysis.Editing.SyntaxGenerator and Microsoft.CodeAnalysis.CSharp.SyntaxFactory.
1) My understanding is that they are both part of Roslyn, and both used for generating code, is that correct? If yes, are there others in that category?
2) Is one better and/or newer than the other? I only suspect that because I found this.
Each SyntaxFactory (there's one for VB and one for C#) is tightly coupled to its language. In a sense the SyntaxFactory is a "lower level" API which you can use to create syntax trees. You can control every syntax node, token and trivia in your tree when using a SyntaxFactory. This comes at a cost: It's harder to read and understand source code that uses the SyntaxFactory because it's very verbose.
The SyntaxGenerator provides a language agnostic API which can be used to create pieces of either C# or VB .NET syntax. It provides an API that is easier to use and less verbose than the SyntaxFactory. Note that to create a SyntaxGenerator we need either a Document, Project, or Workspace.
If you read through the implementations of the SyntaxGenerator you'll see that they are using the SyntaxFactory under the hood.
We can compare the two APIs when generating the following code:
void Method()
{
}
SyntaxFactory
var method = SyntaxFactory.MethodDeclaration(
SyntaxFactory.PredefinedType(
SyntaxFactory.Token(SyntaxKind.VoidKeyword)),
SyntaxFactory.Identifier("Method"))
.WithBody(
SyntaxFactory.Block()))))
SyntaxGenerator
SyntaxGenerator generator = SyntaxGenerator.GetGenerator(...); //We need a Document, Project or Workspace
var method = generator.MethodDeclaration("Method");
The generator is just a cleaner way to build up syntax trees.
I need to add logging to a legacy c++ project, which contains hundreds of user defined structs/classes. These structs only contain primary types as int, float, char[], enum.
Content of objects need to be logged ,preferred in human readable way , but not a must, as long as the object could be reconstructed.
Instead of writing different serialization methods for each class, is there any alternative method?
What you want is a Program Transformation System (PTS). These are tools that can read source code, build compiler data structures (usually ASTs) that represent the source code, and allow you to modify the ASTs and regenerate source code from the modified AST.
These are useful because they "step outside" the language, and thus have no language-imposed limitations on what you can analyze or transform. So it doesn't matter if your langauge doesn't have reflection for everything; a good PTS will give you full access to every detail of the language, including such arcana as comments and radix on numeric literals.
Some PTSes are specific to a targeted language (e.g, "Jackpot" is only usuable for Java). A really good PTS is provided a description of an arbitrary programming langauge, and can then manipulate that language. That description has to enable the PTS to parse the code, analyze it (build symbol tables at least) and prettyprint the parsed/modified result.
Good PTSes will allow you write the modifications you want to make using source-to-source transformations. These are rules specifying changes written in roughly the following form:
if you see *this*, replace it by *that* when *condition*
where this and that are patterns using the syntax of the target language being processed, and condition is a predicate (test) that must be true to enable the rule to be applied. The patterns represent well-formed code fragmens, and typically allow metavariables to represent placeholders for arbitrary subfragments.
You can use PTSes for a huge variety of program manipulation tasks. For OP's case, what he wants is to enumerate all the structs in the program, pick out the subset of interest, and then generate a serializer for each selected struct as a modification to the original program.
To be practical for this particular task, the PTS must be able to parse and name resolve (build symbol tables) C++. There are very few tools that can do this: Clang, our DMS Software Reengineering Toolkit, and the Rose compiler.
A solution using DMS looks something like this:
domain Cpp~GCC5; -- specify the language and specific dialect to process
pattern log_members( m: member_declarations ): statements = TAG;
-- declares a marker we can place on a subtree of struct member declarations
rule serialize_typedef_struct(s: statement, m: member_declarations, i: identifier):
statements->statements
= "typedef struct { \m } \i;" ->
"typedef struct { \m } \i;
void \make_derived_name\(serialize,\i) ( *\i argument, s: stream )
{ s << "logging" << \toString\(\i\);
\log_members\(\m\)
}"
if selected(i); -- make sure we want to serialize this one
rule generate_member_log_list(m: member_declarations, t: type_specification, n: identifier): statements -> statements
" \log_members\(\t \n; \m\)" -> " s << \n; \log_members\(\m\) ";
rule generate_member_log_base(t: type_specification, n: identifier): statements -> statements
" \log_members\(\t \n; \)" -> " s << \n; ";
ruleset generate_logging {
serialize_typedef struct,
generate_member_log_list,
generate_member_log_base
}
The domain declaration tells DMS which specific language front-end to use. Yes, GCC5 as a dialect is different than VisualStudio2013, and DMS can handle either.
The pattern log_members is used as a kind of transformational pointer, to remember that there is some work to do. It wraps a sequence of struct member_declarations as an agenda (tag). What the rules do is first mark structs of interest with log_members to establish the need to generate the logging code, and then generate the member logging actions. The log_members pattern acts as a list; it is processed one element at a time until a final element is processed, and then the log_members tag vanishes, having served its purpose.
The rule serialize_typedef_struct is essentially used to scan the code looking for suitable structs to serialize. When it finds a typedef for a struct, it checks that struct is one that OP wants serialized (otherwise one can just leave off the if conditional). The meta-function selected is custom-coded (not shown here) to recognize the names of structs of interest. When a suitable typedef statement is found, it is replaced by the typedef (thus preserving it), and by the shell of a serializing routine containing the agenda item log_members holding the entire list of members of the struct. (If the code declares structs in some other way, e.g., as a class, you will need additional rules to recognize the syntax of those cases). Processing the agenda item by rewriting it repeatedly produces the log actions for the individual members.
The rules are written in DMS rule-syntax; the C++ patterns are written inside metaquotes " ... " to enable DMS to distinguish rule syntax from C++ syntax. Placeholder variables v are declared in the rule header according thier syntactic categories, and show up in the meta-quoted patterns using an escape notation \v. [Note the unescaped i in the selected function call: it isn't inside metaquotes]. Similarly, meta-functions and patterns references inside the metaquotes are similarly escaped, thus initially odd looking \log\( ... \) including the escaped pattern name, and escaped meta-parentheses.
The two rules generate_member_log_xxx hand the general and final cases of log generation. The general case handles one member with more members to do; the final case handles the last member. (A slight variant would be to process an empty members list by rewriting to the trivial null statement ;). This is essentially walking down a list until you fall off the end. We "cheat" and write rather simple logging code, counting on overloading of stream writes to handle the different datatypes that OP claims he has. If he has more complex types requiring special treatment (e.g., pointer to...) he may want to write specialized rules that recognize those cases and produce different code.
The ruleset generate_logging packages these rules up into a neat bundle. You can trivially ask DMS to run this ruleset over entire files, applying rules until no rules can be further applied. The serialize_typdef_structure rule finds the structs of interest, generating the serializing function shell and the log_members agenda item, which are repeatedly re-written to produce the serialization of the members.
This is the basic idea. I haven't tested this code, and there is usually some surprising syntax variations you end up having to handle which means writing a few more rules along the same line.
But once implemented, you can run this rule over the code to get serialized results. (One might implement selected to reject named structs that already have a serialization routine, or alternatively, add rules that replace any existing serialization code with newly generated code, ensuring that the serialization procedures always match the struct definition). There's the obvious extension to generating a serialized struct reader.
You can arguably implement these same ideas with Clang and/or the Rose Compiler. However, those systems do not offer you source-to-source rewrite rules, so you have to write procedural code to climb up and down trees, inspect individual nodes, etc. It is IMHO a lot more work and a lot less readable.
And when you run into your next "C++ doesn't reflect that", you can tackle the problem with the same tool :-}
Since C++ does not have reflection there is no way for you to dynamically inspect the members of an object at runtime. Thus it follows that you need to write a specific serialization/streaming/logging function for each type.
If all the different types had members of the same name, then you could write a template function to handle them, but I assume that is not the case.
As C++ does not have reflection this is not that easy.
If you want to avoid a verbose solution you can use a variadic template.
E.g.
`class MyStruct {
private:
int a;
float f;
public:
void log()
{
log_fields(a, f);
}
};`
where log_fields() is the variadic template. It would need to be specialized for all the basic types found on those user defined types and also for a recursive case.
So I have a need to be able to parse some relatively simple C++ files with annotations and generate additional source files from that.
As an example, I may have something like this:
//# service
struct MyService
{
int getVal() const;
};
I will need to find the //# service annotation, and get a description of the structure that follows it.
I am looking at possibly leveraging LLVM/Clang since it seems to have library support for embedding compiler/parsing functionality in third-party applications. But I'm really pretty clueless as far as parsing source code goes, so I'm not sure what exactly I would need to look for, or where to start.
I understand that ASTs are at the core of language representations, and there is library support for generating an AST from source files in Clang. But comments would not really be part of an AST right? So what would be a good way of finding the representation of a structure that follows a specific comment annotation?
I'm not too worried about handling cases where the annotation would appear in an inappropriate place as it will only be used to parse C++ files that are specifically written for this application. But of course the more robust I can make it, the better.
One way I've been doing this is annotating identifiers of:
classes
base classes
class members
enumerations
enumerators
E.g.:
class /* #ann-class */ MyClass
: /* #ann-base-class */ MyBaseClass
{
int /* #ann-member */ member_;
};
Such annotation makes it easy to write a python or perl script that reads the header line by line and extracts the annotation and the associated identifier.
The annotation and the associated identifier make it possible to generate C++ reflection in the form of function templates that traverse objects passing base classes and members to a functor, e.g:
template<class Functor>
void reflect(MyClass& obj, Functor f) {
f.on_object_start(obj);
f.on_base_subobject(static_cast<MyBaseClass&>(obj));
f.on_member(obj.member_);
f.on_object_end(obj);
}
It is also handy to generate numeric ids (enumeration) for each base class and member and pass that to the functor, e.g:
f.on_base_subobject(static_cast<MyBaseClass&>(obj), BaseClassIndex<MyClass>::MyBaseClass);
f.on_member(obj.member_, MemberIndex<MyClass>::member_);
Such reflection code allows to write functors that serialize and de-serialize any object type to/from a number of different formats. Functors use function overloading and/or type deduction to treat different types appropriately.
Parsing C++ code is an extremely complex task. Leveraging a C++ compiler might help but it could be beneficial to restrict yourself to a more domain-specific less-powerful format i.e., to generate the source and additional C++ files from a simpler representation something like protobufs proto files or SOAP's WSDL or even simpler in your specific case.
I did some very similar work recently. The research I did indicated that there wasn't any out-of-the-box solutions available already, so I ended up hand-rolling one.
The other answers are dead-on regarding parsing C++ code. I needed something that could get ~90% of C++ code parsed correctly; I ended up using srcML. This tool takes C++ or Java source code and converts it to an XML document, which makes it easier for you to parse. It keeps the comments in-tact. Furthermore, if you need to do a source code transformation, it comes with an reverse tool which will take the XML document and produce source code.
It works in 90% of the cases correctly, but it trips on complicated template metaprogramming and the darkest corners of C++ parsing. Fortunately, my input source code is fairly consistent in design (not a lot of C++ trickery), so it works for us.
Other items to look at include gcc-xml and reflex (which actually uses gcc-xml). I'm not sure if GCC-XML preserves comments or not, but it does preserve GCC attributes and pragmas.
One last item to look at is this blog on writing GCC plugins, written by the author of the CodeSynthesis ODB tool.
Good luck!
As far as I know, if I have a class such as the following:
class TileSurface{
public:
Tile * tile;
enum Type{
Top,
Left,
Right
};
Type type;
Point2D screenverts[4]; // it's a rectangle.. so..
TileSurface(Tile * thetile, Type thetype);
};
There's no easy way to programatically (using templates or whatever) go through each member and do things like print their types (for example, typeinfo's typeid(Tile).name()).
Being able to loop through them would be a useful and easy way to generate class size reports, etc. Is this impossible to do, or is there a way (even using external tools) for this?
Simply not possible in C++. You would need something like Reflection to implement this, which C++ doesn't have.
As far as your code is concerned after it is compiled, the "class" doesn't exist -- the names of the variables as well as their types have no meaning in assembly, and therefore they aren't encoded into the binary.
(Note: When I say "Not possible in C++" I mean "not possible to do built into the language" -- you could of course write a C++ parser in C++ which could implement this sort of thing...)
No. There are no easy way. If to put "easy way" aside then with C++ you can do anything imaginable.
If you want just to dump your data contents run-time then simplest way is to implement operator<<(ostream&,YourClass const&) for each YourClass you are interested in. Bit more complex is to implement visitor pattern, but with visitor pattern you may have different reports done by different visitors and also the visitors may do other things, not only generate reports.
If you want it as static analysis (program is not running, you want to generate reports) then you can use debugger database. Alternatively you may analyze AST generated by some compilers (g++ and CLang have options to generate it) and generate reports from it.
If you really need run-time reflection then you have to build it into your classes. That involves overhead. For example you may use common base-classes and put all data members of classes into array too. It is often done to communicate with applications written in languages that have reflection on more equal grounds (oldest example is Lisp).
I beg to differ from the conventional wisdom. C++ does have it; it's not part of the C++ standard, but every single C++ compiler I've seen emits metadata of this sort for use by the debugger.
Moreover, two formats for the debug database cover almost all modern compilers: pdb (the Microsoft format) and dwarf2 (just about everything else).
Our DMS Software Reengineering Toolkit is what you call an "external tool" for extractingt/transforming arbitrary code. DMS is generalized compiler technology parameterized by explicit langauge definitions. It has language definitions for C, C++, Java, COBOL, PHP, ...
For C, C++, Java and COBOL versions, it provides complete access to parse trees, and symbol table information. That symbol table information includes the kind of data you are likely to want from "reflection". If you goal is to enumerate some set of fields or methods and do something with them, DMS can be used to transform the code (or generate derived code) according to what you find in the symbol tables in arbitrary ways.
If you derive all types of the member variables from your common typeinfo-provider-baseclass, then you can get that. It is a bit more work than like in Java, but possible.
External tools: you mentioned that you need reports like class size, etc.--
Doxygen could help http://www.doxygen.nl/manual/features.html to generate class member lists (including inherited members).