Specifying C++11 Grammar Action Functions in Shift/Reduce Parser Generator? - c++

I'm working on a shift/reduce parser generator in C++11 and I am not sure how to specify the interface type of the input productions and reduction action functions such that they will hold the information I want to put in them.
I want to specify the grammar statically but using C++ types (not a separate build tool).
For each symbol (terminals and non-terminals) the user provides a string name and a type.
Then each production specifies a head symbol name and one or more body symbol names.
For each production an action function is provided by the user (the hard part) that returns the head nonterminal type and has parameters corresponding to the production body symbols (of their corresponding types).
The main problem is statically binding the parameter types and return type of these action functions to the corresponding symbol types
So for example:
Suppose we have nonterminals X, A B C
Their names/types might be:
"X" Foo
"A" string
"B" string
"C" int
And in the grammar there might be a production:
X -> A B C
And there will be an action function provided by the user for that production:
Foo f(string A, string B, int C)
If that production is reduced than the function f should be called with the production body parameters. The value returned by f is then stored for when that symbol is used in a higher up reduction.
So to specify the grammar to the parser generator I need to provide something like:
(I know the following is invalid)
struct Symbol
{
string name;
type T;
}
struct Production
{
string head;
vector<string> body;
function<head.T(body[0].T, body[1].T, ..., body[n].T)> action;
}
struct Grammar
{
vector<Symbol> symbols;
vector<Production> productions;
}
And to specify the earlier example would be:
Grammar example =
{
// symbols
{
{ "X", Foo },
{ "A", string },
{ "B", string },
{ "C", int }
},
// productions
{
{
"X",
{ "A", "B", "C" },
[](string A, string B, int C) { ... return Foo(...); }
}
}
}
This won't work of course, you can't mix type parameters with runtime parameters like that.
One solution would be to have some generic base:
struct SymbolBase
{
...
}
template<class SymbolType>
struct SymbolDerived<SymbolType> : SymbolBase
{
SymbolType value;
}
and then make all action functions of type:
typedef function<SymbolBase(vector<SymbolBase>)> ActionFunction;
and sort it out at runtime. But this makes usage more difficult, and all the casting is slow. I'd rather have the function signatures checked at compile-time and keep the mechanics hidden from the user.
How can I restructure the Symbol, Production and Grammar types to carry the information I am trying to convey in legal C++11?
(Yes I have looked at Boost Spirit and friends, it is a fine framework but it is recursive descent so the languages it can handle in a single pass are fewer than a LALR parser and because it uses backtracking the reduction actions will get called multiple times, etc, etc)

I've been playing around with precisely this problem. Once possibility I've been looking at, which looks like it should work, is to use a stack of variant objects, perhaps boost::variant or boost::any. Since each reduction knows what it's expecting from the stack, the access will be type-safe; unfortunately, the type-check will be at run-time, but it should be very cheap. This has the advantage of catching bugs :) and it will also correctly destruct objects as they're popped from the stack.
I threw together some sample code as a PoC, available upon request. The basic style for writing a reduction rule is something like this:
parse.reduce<Expression(Expression, _, Expression)>
( [](Expression left, Expression right){
return BinaryOperation(Operator::Times, left, right);
});
which corresponds to the rule:
expression: expression TIMES expression
Here, BinaryOperation is the AST node-type, and must be convertible to Expression; the template argument Expression(Expression, _, Expression) is exactly the left-hand-side and right-hand-side of the production, expressed as types. (Because the second RHS type is _, the templates don't bother feeding the value to the reduction rule: with a proper parser generator, there would actually be no reason to even push punctuation tokens onto the stack in the first place.) I implemented both the tagged union Expression and the tagged type of the parser stack using boost::variant. In case you try this, it's worth knowing that using a variant as one of the option types of another variant doesn't really work. In the end, it was easiest to wrap the smaller union as a struct. You also really have to read the section about recursive types.

Related

Decoupling algorithm from data, when the algorithm needs knowledge of derived classes

Sorry for the complicated title, but it's a bit hard to explain in just one sentence.
So I'm writing a simple interpreted language to help with some stuff that I often do. I have a lexer set up, feeding into an abstract syntax tree generator.
The Abstract Syntax Tree spits out Expressions. (Which I'm passing around using unique_ptrs). There's several types of expressions that are derived from this base class, which include:
Numbers
Variables
Function calls / prototypes
Binary operations
etc. Each derived class contains the info it needs for that expression, i.e. variables contain a std::string of their identifier, binary operations contain unique_ptrs to the left and right hand side as well as a char of the operator.
Now this is working perfectly, and expressions are parsed just as they should be.
This is what an AST would look like for 'x=y*6^(z-4)+5'
+--Assignment (=)--+
| |
Var (x) +--------BinOp (+)----+
| |
5 +------------BinOp (*)---+
| |
+---------BinOp (^)-------+ Var (y)
| |
Num (6) +------BinOp (-)-----+
| |
Var (z) Num (4)
The issue arises when trying to decouple the AST from the interpreter. I want to keep it decoupled in case I want to provide support for compilation in the future, or whatever. Plus the AST is already getting decently complex and I don't want to add to it. I only want the AST to have information about how to take tokens and convert them, in the right order, into an expression tree.
Now, the interpreter should be able to traverse this list of top down expressions, and recursively evaluate each subexpression, adding definitions to memory, evaluating constants, assigning definitions to their functions, etc. But, each evaluation must return a value so that I can recursively traverse the expression tree.
For example, a binary operation expression must recursively evaluate the left hand side and the right hand side, and then perform an addition of the two sides and return that.
Now, the issue is, the AST returns pointers to the base class, Expr – not the derived types. Calling getExpression returns the next expression regardless of it's derived type, which allows me to easily recursively evaluate binary operations and etc. In order for the interpreter to get the information about these expressions (the number value, or identifier for example), I would have to basically dynamically cast each expression and check if it works, and I'd have to do this repeatedly. Another way would be to do something like the Visitor pattern – the Expr calls the interpreter and passes this to it, which allows the interpreter to have multiple definitions for each derived type. But again, the interpreter must return a value!
This is why I can't use the visitor pattern – I have to return values, which would completely couple the AST to the interpreter.
I also can't use a strategy pattern because each strategy returns wildly different things. The interpreter strategy would be too different from the LLVM strategy, for example.
I'm at a complete loss of what to do here. One really gumpy solution would be to literally have an enum of each expression type as a member of the expr base class, and the interpreter could check the type and then make the appropriate typecast. But that's ugly. Really ugly.
What are my options here? Thanks!
The usual answer (as done with most parser generators) is to have both a token type value and associated data (called attributes in discussion of such things). The type value is generally a simple integer and says "number", "string" "binary op" etc. When deciding what production the use you examine only the token types and when you get a match to a production rule you then know what kind of tokens feed into that rule.
If you want to implement this yourself look up parsing algorithms (LALR and GLR are a couple examples), or you could switch to using a parser generator and only have to worry about getting your grammar correct and then proper implementation of the productions and not have to concern yourself with implementing the parsing engine yourself.
Why can't you use the visitor pattern? Any return results simply become local state:
class EvalVisitor
{
void visit(X x)
{
visit(x.y);
int res1 = res();
visit(x.z);
int res2 = res();
res(res1 + res2);
}
....
};
The above can be abstracted away so that the logic lies in proper eval
functions:
class Visitor
{
public:
virtual void visit(X) = 0;
virtual void visit(Y) = 0;
virtual void visit(Z) = 0;
};
class EvalVisitor : public Visitor
{
public:
int eval(X);
int eval(Y);
int eval(Z);
int result;
virtual void visit(X x) { result = eval(x); }
virtual void visit(Y y) { result = eval(y); }
virtual void visit(Z z) { result = eval(z); }
};
int evalExpr(Expr& x)
{
EvalVisitor v;
x.accept(v);
return x.result;
}
Then you can do:
Expr& expr = ...;
int result = evalExpr(expr);

Can't extend generic struct for specific type

Wanted to toy with adding some sugar in Swift3. Basically, I wanted to be able to do something like:
let randomAdjust = (-10...10).random
To do that, I decided I would need to extend ClosedRange. But then found it would probably be even better for my case, I really just plan on doing Int's for now, to use CountableClosedRange. My latest of multiple attempts looked like:
extension CountableClosedRange where Bound == Int {
var random:Int {
return Int(arc4random_uniform(UInt32(self.count) + 1)) + self.lowerBound
}
}
But the playground complains:
error: same-type requirement makes generic parameter 'Bound' non-generic
extension CountableClosedRange where Bound == Int {
I don't even know what it's telling me there.
The way this roadblock is commonly encountered is when attempting to extend Array. This is legal:
extension Array where Element : Comparable {
}
But this is illegal:
extension Array where Element == Int {
}
The compiler complains:
Same-type requirement makes generic parameter 'Element' non-generic
The problem is the use of == here in combination with Array's parameterized type Element, because Array is a generic struct.
One workaround with Array is to rise up the hierarchy of Array's inheritance to reach something that is not a generic struct:
extension Sequence where Iterator.Element == Int {
}
That's legal because Sequence and Iterator are generic protocols.
Another solution, though, is to rise up the hierarchy from the target type, namely Int. If we can find a protocol to which Int conforms, then we can use the : operator instead of ==. Well, there is one:
extension CountableClosedRange where Bound : Integer {
}
That's the real difference between our two attempts to implement random on a range. The reason your attempt hits a roadblock and mine doesn't is that you are using == whereas I am using :. I can do that because there's a protocol (FloatingPoint) to which Double conforms.
But, as you've been told, with luck all this trickery will soon be a thing of the past.
In Swift 4, what you are attempting is now completely supported. Hooray!
extension Stack where Element: Equatable {
func isTop(_ item: Element) -> Bool {
guard let topItem = items.last else {
return false
}
return topItem == item
}
}
Example from Swift docs: https://docs.swift.org/swift-book/LanguageGuide/Generics.html#ID553

boost function with optional parameters

I have a map containing boost::function values, as defined below:
std::map <std::string, boost::function<std::string (std::string, int)> > handlers;
Let us say I define the following function:
using namespace std;
string substring (string input, int index = 0){
if (index <= 0){
return input;
}
stringstream ss;
for (int j = index; j<input.length(); j++){
ss << input[j];
}
return ss.str();
}
I would like to be able to store this in the handlers map, but WITH it's optional parameter. Does boost have a way to perform this? I have looked at boost::optional, but that doesn't seem to do what I want.
EDIT
To give a little more background, there are a few handlers that require extra arguments, such as a pointer to a dictionary (typedef std::map < std::string, std::string > dictionary) or something, because they make changes to that dictionary. However, the majority of the handlers do not touch the dictionary in question, but, in order to store them all in the same map, they all must take the same arguments (have the same template for boost::function). The goal is to make the functions that don't deal with the dictionary at all usable without having to either A) create a dictionary for the sole purpose of passing it and not using it or B) copy the code verbatim into another function that doesn't require that argument.
The code above is a simplified example of what I am doing.
The short answer: This is not possible in C++ without a lot of additional code.
The long answer:
Default values for function arguments in C++ are only used when they are needed in a context where the function's name appears. If you call a function through other means (like a function pointer, or boost::function/std::function, the information about there possibly being default arguments is not available to the compiler, so it can't fill them in for you.
As a background, this is how default arguments work in C++:
When you have the expression substring(MyString) (with std::string MyString = "something"), then the compiler looks for all functions called substring and finds string substring(string, int=0). This function takes two parameters, one of which can have a default value, which makes the function viable. To actually call the function, the compiler changes the source code so that it reads substring(MyString, 0) and proceeds to generate code based on that adaptation.
To be able to use default values with an indirect call, like through boost::function, you effectively have to emulate the default argument mechanism of the compiler.

Concise initialization syntax for nested variants?

I'm working a small C++ JSON library to help sharpen my rusty C++ skills, and I'm having trouble understanding some behavior with initialization lists.
The core of the library is a variant class (named "var") that stores any of the various JSON datatypes (null, boolean, number, string, object, array).
The goal is for var to work as closely as possible to a JavaScript variable, so there's lots of operator overloading going on. The primitive datatypes are easy to take care of...
var fee = "something";
var fie = 123.45;
var foe = false;
The problem is with objects (maps) and arrays (vectors).
To get something close to a JavaScript object and array literal syntax, I'm using initialization lists. It looks like this:
// in my headers
typedef var object[][2];
typedef var array[];
// in user code
var foo = (array){ 1, "b", true };
var bar = (object){ { "name", "Bob" }, { "age", 42 } };
This works out pretty nicely. The problem comes in with nested lists.
var foo = (array){ 1, "b", (array){ 3.1, 3.2 } };
For some reason my variant class interprets the nested "array" as a boolean, giving:
[1, "b", true]
Instead of:
[1, "b", [3.1, 3.2]]
If I explicitly cast the inner list to a var, it works:
var foo = (array){ 1, "b", (var)(array){ 3.1, 3.2 } };
Why do I have to explicitly cast the inner list to a var after I cast it to an array, and how can I get around this extra cast? As far as I can tell it should be implicitly converting the array to my var class anyway, since it's using the constructor for an array of vars:
template <size_t length>
var(const var(&start)[length]) {
// internal stuff
init();
setFromArray(vector<var>(start, start + length));
}
It seems that without the explicit cast to var, the initialization list somehow gets cast to something else on its way from being cast from an array to a var. I'm trying to understand why this happens, and how to avoid it.
Here's a gist with the complete source. Let me know if I should add anything relevant to the question.
Update
Apparently (foo){1, "two"} does not actually cast an initialiation list; it's a complete expression called a compound literal. It seems that it's only available in C, although g++ doesn't complain unless you give it -pedantic.
It looks like my options are:
Find another concise initialization syntax that is officially supported.
Use compound literals and hope they work in other compilers.
Drop support for C++ < 11 and use initializer_list.
Don't offer a concise initialization syntax.
Any help with the first option would be the sort of answer I'm looking for at this point.
Macros are another sort of last-ditch option, and I've written some that do the job, but I'd like to not have to use them.
You need to use the facilities already provided to you by Boost.
typedef boost::optional<boost::make_recursive_variant<
float, int, bool, //.. etc
std::unordered_map<std::string, boost::optional<boost::recursive_variant_>>,
std::vector<boost::recursive_variant_>
> JSONType;
They can easily define recursive variant types.

Boost::Tuples vs Structs for return values

I'm trying to get my head around tuples (thanks #litb), and the common suggestion for their use is for functions returning > 1 value.
This is something that I'd normally use a struct for , and I can't understand the advantages to tuples in this case - it seems an error-prone approach for the terminally lazy.
Borrowing an example, I'd use this
struct divide_result {
int quotient;
int remainder;
};
Using a tuple, you'd have
typedef boost::tuple<int, int> divide_result;
But without reading the code of the function you're calling (or the comments, if you're dumb enough to trust them) you have no idea which int is quotient and vice-versa. It seems rather like...
struct divide_result {
int results[2]; // 0 is quotient, 1 is remainder, I think
};
...which wouldn't fill me with confidence.
So, what are the advantages of tuples over structs that compensate for the ambiguity?
tuples
I think i agree with you that the issue with what position corresponds to what variable can introduce confusion. But i think there are two sides. One is the call-side and the other is the callee-side:
int remainder;
int quotient;
tie(quotient, remainder) = div(10, 3);
I think it's crystal clear what we got, but it can become confusing if you have to return more values at once. Once the caller's programmer has looked up the documentation of div, he will know what position is what, and can write effective code. As a rule of thumb, i would say not to return more than 4 values at once. For anything beyond, prefer a struct.
output parameters
Output parameters can be used too, of course:
int remainder;
int quotient;
div(10, 3, &quotient, &remainder);
Now i think that illustrates how tuples are better than output parameters. We have mixed the input of div with its output, while not gaining any advantage. Worse, we leave the reader of that code in doubt on what could be the actual return value of div be. There are wonderful examples when output parameters are useful. In my opinion, you should use them only when you've got no other way, because the return value is already taken and can't be changed to either a tuple or struct. operator>> is a good example on where you use output parameters, because the return value is already reserved for the stream, so you can chain operator>> calls. If you've not to do with operators, and the context is not crystal clear, i recommend you to use pointers, to signal at the call side that the object is actually used as an output parameter, in addition to comments where appropriate.
returning a struct
The third option is to use a struct:
div_result d = div(10, 3);
I think that definitely wins the award for clearness. But note you have still to access the result within that struct, and the result is not "laid bare" on the table, as it was the case for the output parameters and the tuple used with tie.
I think a major point these days is to make everything as generic as possible. So, say you have got a function that can print out tuples. You can just do
cout << div(10, 3);
And have your result displayed. I think that tuples, on the other side, clearly win for their versatile nature. Doing that with div_result, you need to overload operator<<, or need to output each member separately.
Another option is to use a Boost Fusion map (code untested):
struct quotient;
struct remainder;
using boost::fusion::map;
using boost::fusion::pair;
typedef map<
pair< quotient, int >,
pair< remainder, int >
> div_result;
You can access the results relatively intuitively:
using boost::fusion::at_key;
res = div(x, y);
int q = at_key<quotient>(res);
int r = at_key<remainder>(res);
There are other advantages too, such as the ability to iterate over the fields of the map, etc etc. See the doco for more information.
With tuples, you can use tie, which is sometimes quite useful: std::tr1::tie (quotient, remainder) = do_division ();. This is not so easy with structs. Second, when using template code, it's sometimes easier to rely on pairs than to add yet another typedef for the struct type.
And if the types are different, then a pair/tuple is really no worse than a struct. Think for example pair<int, bool> readFromFile(), where the int is the number of bytes read and bool is whether the eof has been hit. Adding a struct in this case seems like overkill for me, especially as there is no ambiguity here.
Tuples are very useful in languages such as ML or Haskell.
In C++, their syntax makes them less elegant, but can be useful in the following situations:
you have a function that must return more than one argument, but the result is "local" to the caller and the callee; you don't want to define a structure just for this
you can use the tie function to do a very limited form of pattern matching "a la ML", which is more elegant than using a structure for the same purpose.
they come with predefined < operators, which can be a time saver.
I tend to use tuples in conjunction with typedefs to at least partially alleviate the 'nameless tuple' problem. For instance if I had a grid structure then:
//row is element 0 column is element 1
typedef boost::tuple<int,int> grid_index;
Then I use the named type as :
grid_index find(const grid& g, int value);
This is a somewhat contrived example but I think most of the time it hits a happy medium between readability, explicitness, and ease of use.
Or in your example:
//quotient is element 0 remainder is element 1
typedef boost:tuple<int,int> div_result;
div_result div(int dividend,int divisor);
One feature of tuples that you don't have with structs is in their initialization. Consider something like the following:
struct A
{
int a;
int b;
};
Unless you write a make_tuple equivalent or constructor then to use this structure as an input parameter you first have to create a temporary object:
void foo (A const & a)
{
// ...
}
void bar ()
{
A dummy = { 1, 2 };
foo (dummy);
}
Not too bad, however, take the case where maintenance adds a new member to our struct for whatever reason:
struct A
{
int a;
int b;
int c;
};
The rules of aggregate initialization actually mean that our code will continue to compile without change. We therefore have to search for all usages of this struct and updating them, without any help from the compiler.
Contrast this with a tuple:
typedef boost::tuple<int, int, int> Tuple;
enum {
A
, B
, C
};
void foo (Tuple const & p) {
}
void bar ()
{
foo (boost::make_tuple (1, 2)); // Compile error
}
The compiler cannot initailize "Tuple" with the result of make_tuple, and so generates the error that allows you to specify the correct values for the third parameter.
Finally, the other advantage of tuples is that they allow you to write code which iterates over each value. This is simply not possible using a struct.
void incrementValues (boost::tuples::null_type) {}
template <typename Tuple_>
void incrementValues (Tuple_ & tuple) {
// ...
++tuple.get_head ();
incrementValues (tuple.get_tail ());
}
Prevents your code being littered with many struct definitions. It's easier for the person writing the code, and for other using it when you just document what each element in the tuple is, rather than writing your own struct/making people look up the struct definition.
Tuples will be easier to write - no need to create a new struct for every function that returns something. Documentation about what goes where will go to the function documentation, which will be needed anyway. To use the function one will need to read the function documentation in any case and the tuple will be explained there.
I agree with you 100% Roddy.
To return multiple values from a method, you have several options other than tuples, which one is best depends on your case:
Creating a new struct. This is good when the multiple values you're returning are related, and it's appropriate to create a new abstraction. For example, I think "divide_result" is a good general abstraction, and passing this entity around makes your code much clearer than just passing a nameless tuple around. You could then create methods that operate on the this new type, convert it to other numeric types, etc.
Using "Out" parameters. Pass several parameters by reference, and return multiple values by assigning to the each out parameter. This is appropriate when your method returns several unrelated pieces of information. Creating a new struct in this case would be overkill, and with Out parameters you emphasize this point, plus each item gets the name it deserves.
Tuples are Evil.