ParseTree traversal in antlr4 - c++

I am using antlr4 c++.
I have a ParseTree and I am trying to recreate the tree structure.
To to this I am using a Visitor my_Visitor and my own node object(s).
My problem is that visitChildren(tree::RuleNode*) calls the visit functions of all children, so I am losing the information when one child tree is traversed and the next child tree is visited.
Assume a tree like this:
A
/ \
B C
When I call visitChildren(A) (using overloaded visitExpression(ExpressionContext*) functions for B and C), I can extract the information that the visit sequence is A,B,C.
This sequence could also result from:
A
|
B
|
C
To recreate the tree I think I would need something like
antlrcpp::Any my_Visitor::my_visitChildren(tree::RuleNode* A){
for(int i=0;i<A->children.size();i++){
//create a new node in my own tree representation as child of A
visit(A->children[i]);
}
}
and call my_visitChildren in my overloaded visitExpression functions.
The problem here is that A->children[i] is a Tree and visit(.) needs a ParseTree.
Can I somehow create a ParseTree from children[i] or is there a better approach to do this?
I am also thinking about using a map from tree->parent to my objects and just attach my new node there, but this is not optimal if I wanted to omit some nodes (e.g. for an AST).

The differentiation between a ParseTree and a Tree is purely artificial and in fact of no real use. The children of a tree node are in fact all ParseTree instances. There are a few tree classes that are never used in the runtime, except for building the base classes of ParseTree. Hence later on I have removed Tree, SyntaxTree and RuleNode and put all that together into the ParseTree class.
To answer your question: you can safely cast the child nodes to ParseTree for your tree walk.

Related

Translating an AST with clang RecursiveASTVisitor

The RecursiveASTVisitor class implements a stateless visitor for the AST of some source. I would like to process the AST into some JSON that describes the parent classes, methods and their arguments, etc. RecursiveASTVisitor doesn't really seem to have facilities for returning stuff up to parent node visits or passing stuff down to child node visitors, so I'm not sure how I'm meant to maintain the context of the tree traversal.
What's the right way to traverse the AST while building up my own tree?

Can I somehow marry enum type with struct of strings?

I am currently writing code that shall interprete xml and extract relevant data for later processing. For this task I've included the pugixml module.
As my code will grow with time surely requiring further data I have no idea of yet I consider my stump xml structure as pretty preliminary and likely to evolve. For that reason I would like my code to be flexible enough so I can easily manage changes to my xml file. To make it short: my xml file is not fixed and I may like to change node names and node structure later.
I am not yet fiddling with flexibilizing node structure (though I have some ideas) but I concentrate on the more simple task to flexibilize node names that are accepted by my code.
To give an example to highlight this with code: I've created a class that shall contain the node "shape" which is child of a node "case_definition" which is child of the document root node "case". I later may introduce a child of case_definition that will have the child "shape" so that "shape" descended one step in the hierarchy of nodes. I also might change the name of the shape node to something else that may proove to be more striking than "shape".
So here's the method to take the node "shape" using pugixml:
//returns the node shape with a given shape_id number
pugi::xml_node nXml::tXmlShape::GetShapeNode(const int shape_id, const pugi::xml_node& root_node)
{
std::string tmp = std::to_string(shape_id);
char const* char_shape_id = tmp.c_str();
nXml::sXmlNodeNames sNodeNames;
nXml::sXmlShapeTypes sShapeTypes;
return root_node.child(sNodeNames.case_def_name.c_str()).find_child_by_attribute(sNodeNames.shape_name.c_str(),"item_id", char_shape_id);
}
;
In the return value you can see the structure of my xml file and this coding makes my structure currently pretty rigid. I have some ideas to flexibilize this my using recursive functions but I am not sure how to implement this at this moment.
But my core question right now is how more elegantly flexibilize the node names. At the moment I use a structure that contains all possible node names as strings so I can simply put different names later on.
const struct sXmlNodeNames
{
std::string root_name = "case";
std::string case_def_name = "case_definition";
std::string shape_name = "shape";
std::string shape_name = "somethingelse";
}
;
However, for some reason I'd prefer enum but I cannot figure out how I then could pass it to the pugixml method that wants c_strings. I would like to have less hard-coded node extraction methods where I simply could pass an enum instance that has been initialized with the proper enum option so it's either a shape or something else and then passes the correct string to the pugixml method. So in short I'd like to marry enum with the above struct functionality but I have not the slightest clue how to approach this.

AST traversal in visitor or in the nodes?

Update accepted Ira Baxter's answer since it pointed me into the right direction: I first figured out what I actually needed by starting the implementation of the compiling stage, and it became obvious pretty soon that traversal within the nodes made thie an impossible approach. Not all nodes should be visited, and some of them in reverse order (for example, first the rhs of an assignment so the compiler can check if the type matches with the rhs/operator). Putting traversal in the visitor makes this all very easy.
I'm playing around with ASTs and the likes before deciding a major refactory of the handling of a mini-language used in an applicaiton.
I've built a Lexer/Parser and can get the AST just fine. There's also a Visitor and as concrete implementation I made an ASTToOriginal which just recreates the original source file. Eventually there's goin to be some sort of compiler that also implements the Vsisitor and creates the actual C++ code at runtime so I want to make sure everything is right from the start.
While everything works fine now, there is some similar/duplicate code since the traversal order is implemented in the Visitor itself.
When looking up more information, it seems that some implementations prefer keeping the traversal order in the visited objects themselves instead, in order not to repeat this in each concrete visitor.
Even the GoF only talks briefly about this, in the same way. So I wanted to give this approach a try as well but got stuck pretty soon.. Let me explain.
Sample source line and corresponding AST nodes:
if(t>100?x=1;sety(20,true):x=2)
Conditional
BinaryOp
left=Variable [name=t], operator=[>], right=Integer [value=100]
IfTrue
Assignment
left=Variable [name=x], operator=[=], right=Integer [value=1]
Method
MethodName [name=sety], Arguments( Integer [value=20], Boolean [value=true] )
IfFalse
Assignment
left=Variable [name=x], operator=[=], right=Integer [value=1]
Some code:
class BinaryOp {
void Accept( Visitor* v ){ v->Visit( this ); }
Expr* left;
Op* op;
Expr* right;
};
class Variable {
void Accept( Visitor* v ){ v->Visit( this ); }
Name* name;
};
class Visitor { //provide basic traversal, terminal visitors are abstract
void Visit( Variable* ) = 0;
void Visit( BinaryOp* p ) {
p->left->Accept( this );
p->op->Accept( this );
p->right->Accept( this );
}
void Visit( Conditional* p ) {
p->cond->Accept( this );
VisitList( p->ifTrue ); //VisitList just iterates over the array, calling Accept on each element
VisitList( p->ifFalse );
}
};
Implementing ASTToOriginal is pretty straightforward: all abstract Visitor methods just print out the name or value member of the terminal.
For the non-terminals it depends; printing an Assignment works ok with the default Visitor traversal, for a Conditional extra code is needed:
class ASTToOriginal {
void Visit( Conditional* p ) {
str << "if(";
p->cond->Accept( this );
str << "?";
//VisitListWithPostOp is like VisitList but calls op for each *except the last* iteration
VisitListWithPostOp( p->ifTrue, AppendText( str, ";" ) );
VisitListWithPostOp( p->ifFalse, AppendText( str, ";" ) );
str << ")";
}
};
So as one can see both the Visit methods for a Conditional in Visitor and ASTToOriginal are indeed very similar.
However trying to solve this by putting traversal into the nodes made things not just worse, but rather a complete mess.
I tried an approach with PreVisit and PostVisit methods which solved some problems, but just introduced more and more code into the Nodes.
It also started to look like I would have to keep track of a number of states inside the Visitor to be able to know when to add closing brackets etc.
class BinaryOp {
void Accept( Conditional* v ) {
v->Visit( this );
op->Accept( v )
VisitList( ifTrue, v );
VisitList( ifFalse, v );
};
class Vistor {
//now all methods are pure virtual
};
class ASTToOriginal {
void Visit( Conditional* p ) {
str << "if(";
//now what??? after returning here, BinaryOp will visit the op automatically so I can't insert the "?"
//If I make a PostVisit( BinaryOp* ), and call it it BinaryOp::Accept, I get the chance to insert the "?",
//but now I have to keep a state: my PostVisit method needs to know it's currently being called as part of a Conditional
//Things are even worse for the ifTrue/ifFalse statement arrays: each element needs a ";" appended, but not the last one,
//how am I ever going to do that in a clean way?
}
};
Question: is this approach just not suited for my case, or am I overlooking something essential? Is there a common design to cope with these problems? What if I also need traversal in a different direction?
There are two problems:
What children nodes are possible to visit?
What order should you visit them?
Arguably the actual children nodes should be known by node type; actually, it should be known by the grammar, and "reflected" from the grammar into the general visitor.
The order in which the nodes are visited depend completely on what you need to do. If you are doing prettyprinting, a left-to-right child order makes sense (if the children nodes are listed in the order of the grammar, which they might not be). If you are constructing symbol tables, you surely want to visit the declaration children before you visit the statement body child.
In addition, you need to worry about what information flows up or down the tree. A list-of-variable accesses would flow up the tree. A constructed symbol table flows up the tree from the declarations, and back down into statement body child. And this information flow forces the visit order; to have a symbol table to pass down into the statement body, you first have to have a symbol table constructed and passed up from the declarations child.
I think these issues are the ones that are giving you greif. You are trying to impose a single structure on your visitors, when in fact the visit order is completely task dependent, and there are lots of different tasks you might do with a tree, each with its own information flow and thus order dependency.
One of the ways this is solved is with the notion of an attribute(d) grammar (AG), in which one decorates the grammar rules with various types of attributes and how they are computed/used. You literally write a computation as an annotation to the grammar rule, for instance:
method = declarations statements ;
<<ResolveSymbols>>: { declarations.parentsymbols=method.symboltable;
statements.symboltable = declarations.symboltable;
}
The grammar rule tells you what node types you have to have. The atrribute computation tells you what value are being passed down the tree (reference to method.symboltable is to something coming form the parent), up the tree (reference to declarations.symbol table is to something computed by that child), or across the tree (statements.symboltable is passed down to the statements child, from the value computed by declarations.symboltable). The attribute computation defines the visitor. The executed computation is a called "attribute evaluation".
This notation for this particular attribute grammar is part of our DMS Software Reengineering Toolkit. Other AG tools use similar notations. Like all (AG) schemes, the particular rule is used to manufacture the purpose-specific ("ResolveSymbols") visitor for the specific node ("method"). With a set of such specifications for each node, you get a set of purpose-specific visitors that can be executed.
The value of an AG scheme is that you can write this easily and all the boilerplate gunk gets generated.
You can think about your problem in the abstract like this, and then simply generate the purpose-specific visitors by hand, as you have been doing.
For recursive generic traversing of trees, Visitor and Composite are usually used together, like in (first relevant google link) there. I first read about this idea there. There are also visitor combinators which are a nice idea.
And by the way...
this is where functional languages shine, with their Algebraic Data Types and pattern matching. If you can, switch to a functional language. Composite and Visitor are only ugly workarounds for lack of language support for respectively ADT and pattern matching.
IMO, I would have each concrete class (e.g. BinaryOp, Variable) extend the Visitor class. That way, all the logic necessary to create a BinaryOp object, would reside in the BinaryOp class. This approach is similar to the Walkabout pattern. It may make your task easier.

Is "node" an ADT? If so, what is its interface?

Nodes are useful for implementing ADTs, but is "node" itself an ADT? How does one implement "node"? Wikipedia uses a plain old struct with no methods in its (brief) article on nodes. I googled node to try and find an exhaustive article on them, but mostly I found articles discussing more complex data types implemented with nodes.
Just what is a node? Should a node have methods for linking to other nodes, or should that be left to whatever owns the nodes? Should a node even be its own standalone class? Or is it enough to include it as an inner struct or inner class? Are they too general to even have this discussion?
A node is an incredibly generic term. Essentially, a node is a vertex in a graph - or a point in a network.
In relation to data structures, a node usually means a single basic unit of data which is (usually) connected to other units, forming a larger data structure. A simple data structure which demonstrates this is a linked list. A linked list is merely a chain of nodes, where each node is linked (via a pointer) to the following node. The end node has a null pointer.
Nodes can form more complex structures, such as a graph, where any single node may be connected to any number of other nodes, or a tree where each node has two or more child nodes. Note that any data structure consisting of one or more connected nodes is a graph. (A linked list and a tree are both also graphs.)
In terms of mapping the concept of a "node" to Object Oriented concepts like classes, in C++ it is usually customary to have a Data Structure class (sometimes known as a Container), which will internally do all the work on individual nodes. For example, you might have a class called LinkedList. The LinkedList class then would have an internally defined (nested) class representing an individual Node, such as LinkedList::Node.
In some more cruder implementations you may also see a Node itself as the only way to access the data structure. You then have a set of functions which operate on nodes. However, this is more commonly seen in C programs. For example, you might have a struct LinkedListNode, which is then passed to functions like void LinkedListInsert(struct LinkedListNode* n, Object somethingToInsert);
In my opinion, the Object Oriented approach is superior, because it better hides details of implementation from the user.
Generally you want to leave node operations to whatever ADT owns them. For example a list should have the ability to traverse its own nodes. It doesn't need to the node to have that ability.
Think of the node as a simple bit of data that the ADT holds.
In the strictest terms, any assemblage of one or more primitive types into some kind of bundle, usually with member functions to operate on the data, is an Abstract Data Type.
The grey area largely comes from which language you operate under. For example, in Python, some coders consider the list to be a primitive type, and thus not an ADT. But in C++, the STL List is definitely an ADT. Many would consider the STL string to be an ADT, but in C# it's definitely a primitive.
To answer your question more directly: Any time you are defining a data structure, be it struct or class, with or without methods, it is necessarily an ADT because you are abstracting primitive data types into some kind of construct for which you have another purpose.
An ADT isn't a real type. That's why it's called an ADT. Is 'node' an ADT? Not really, IMO. It can be a part of one, such as a linked list ADT. Is 'this node I just created to contain thingys' an ADT? Absolutely not! It's, at best, an example of an implementation of an ADT.
There's really only one case in which ADT's can be shown expressed as code, and that's as templated classes. For example, std::list from the C++ STL is an actual ADT and not just an example of an instance of one. On the other hand, std::list<thingy> is an example of an instance of an ADT.
Some might say that a list that can contain anything that obeys some interface is also an ADT. I would mildly disagree with them. It's an example of an implementation of an ADT which can contain a wide variety of objects that all have to obey a specific interface.
A similar argument could be made about the requirements of the std::list's "Concepts". For instance that type T must be copyable. I would counter that by saying that these are simply requirements of the ADT itself while the previous version actually requires a specific identity. Concepts are higher level than interfaces.
Really, an ADT is quite similar to a "pattern" except that with ADT's we're talking about algorithms, big O, etc... With patterns we're talking about abstraction, reuse, etc... In other words, patterns are a way to build something that's implementations solve a particular type of problem and can be extended/reused. An ADT is a way to build an object that can be manipulated through algorithms but isn't exactly extensible.
Nodes are a detail of implementing the higher class. Nodes don't exist or operate on their own- they only exist because of the need for separate lifetimes and memory management than the initial, say, linked list, class. As such, they don't really define themselves as their own type, but happily exist with no encapsulation from the owning class, if their existence is effectively encapsulated from the user. Nodes typically also don't display polymorphism or other OO behaviours.
Generally speaking, if the node doesn't feature in the public or protected interface of the class, then don't bother, just make them structs.
In the context of ADT a node is the data you wish to store in the data structure, plus some plumbing metadata necessary for the data structure to maintain its integrity. No, a node is not an ADT. A good design of an ADT library will avoid inheritance here because there is really no need for it.
I suggest you read the code of std::map in your compiler's standard C++ library to see how its done properly. Granted, you will probably not see an ADT tree but a Red-Black tree, but the node struct should be the same. In particular, you will likely see a lightweight struct that remains private to the data structure and consisting of little other than data.
You're mixing in three mostly orthogonal concepts in your question: C++, nodes, ADTs.
I don't think it's useful to try to sort out what can be said in general about the intersection of those concepts.
However, things can be said about e.g. singly linked list nodes in C++.
#include <iostream>
template< class Payload >
struct Node
{
Node* next;
Payload value;
Node(): next( 0 ) {}
Node( Payload const& v ): next( 0 ), value( v ) {}
void linkInFrom( Node*& aNextPointer )
{
next = aNextPointer;
aNextPointer = this;
}
static Node* unlinked( Node*& aNextPointer)
{
Node* const result = aNextPointer;
aNextPointer = result->next;
return result;
}
};
int main()
{
using namespace std;
typedef Node<int> IntNode;
IntNode* pFirstNode = 0;
(new IntNode( 1 ))->linkInFrom( pFirstNode );
(new IntNode( 2 ))->linkInFrom( pFirstNode );
(new IntNode( 3 ))->linkInFrom( pFirstNode );
for( IntNode const* p = pFirstNode; p != 0; p = p->next )
{
cout << p->value << endl;
}
while( pFirstNode != 0 )
{
delete IntNode::unlinked( pFirstNode );
}
}
I first wrote these operations in Pascal, very early eighties.
It continually surprises me how little known they are. :-)
Cheers & hth.,

Issue while designing B+Tree template class in C++

I'm trying to write a generic C++ implementation of B+Tree. My problem comes from the fact that there are two kinds of nodes in a B+Tree; the internal nodes, which contain keys and pointers to child nodes, and leaf nodes, which contain keys and values, and pointers in an internal node can either point to other internal nodes, or leaf nodes.
I can't figure how to model such a relationship with templates (I don't want to use casts or virtual classes).
Hope there is a solution to my problem or a better way to implement B+Tree in C++.
The simplest way:
bool mIsInternalPointer;
union {
InternalNode<T>* mInternalNode;
LeafNode<T>* mLeafNode;
};
This can be somewhat simplified by using boost::variant :)