Symbolic specification (grammar) parser in C++ (maybe using Spirit?) - c++

I am trying to write a small and simple parser for symbolic specifications in C++. I first tried using tools such as Yacc/Bison (which I already know) and ANTLR, but these are much too unwieldy and complex for what I am trying to do.
My language has the following binary operators: = (specification), + (sum), . (product); and the following unary operators: Seq (sequence of an expression), Z(###) (unit element which can be numbered), and E(###) (neutral element which can also be numbered). It should also contain variables. I think the binding priority on sum and product should be the standard priority.
Here is an example of a system of rules:
A = Seq B
B = Z(1) + Z(2).Z(2).C
C = E(1) + Z(4).Z(4).B
which I would like to parse to a associative map, indexing expression trees (Seq B) to a variable name (A).
After several auxiliary attempts (using Python, for instance), I have identified an ideal tool for this, Spirit in the Boost Library. I would like something more tightly integrated in the code. And the following StackOverflow gives most of the elements that I should need:
Boolean expression (grammar) parser in c++
However Spirit is pretty complicated, and I am not helped by the fact that I am not comfortable with the advanced C++ traits that the library uses - or more to the fact, I am not sure how to correct errors given by my compiler. My latest attempt:
https://gist.github.com/anonymous/354bb84e54042a3b54f3
It will not compile. Even if it did, I am not sure I understand how I would manipulate the parsed structure. I am not sure if I am on the right track.
Can somebody help get back on point? Or suggest an alternate reasonable solution?
UPDATE 1: Here is a more precise description of what I want.
A list of rules
"VAR = EXPR"
Where EXPR is:
EXPR := EXPR + EXPR
| EXPR . EXPR
| Seq EXPR
| Z<digits*> (defaults to zero if not placed)
| E<digits*> (defaults to zero if not placed)
| VAR
VAR := alnum+ (excludes Z## and E##)
For instance:
PP = Z + L1 . R1 + L2 . R2 + L3 . R3 + Seq Z3 . Seq Z4
L1 = Z10.Z9
R1 = Z12.PP + E4
...
Would give me an associative table:
[ "PP" -> parsing tree that I can navigate for (Z + L1...)
"L1" -> parsing tree for ...
]
Where PP is:
PP = (((((Z + (L1 . R1)) + (L2 . R2)) + (L3 . R3))) + ((Seq Z3) . (Seq Z4)))
I should say that cv_and_he's debugging was already incredibly useful, but:
-- I am not sure my grammar is correct;
-- I don't know how to parse the list from a system of equations;
-- I have no clue how to exploit/navigate the parsing tree once its been parsed.
My endgoal is to write recursive functions that traverse the tree and make some simulations accordingly.
UPDATE 2: I have managed to get most of the parsing done:
https://gist.github.com/anonymous/b6a63a1aa5caa6461dd6
I wonder if it is correct, or if I am missing a mistake. After that, what I cannot figure out is:
- how to best parse a list of such rules (instead of just one);
- how to traverse the parsed rules, and modify states (for instance, I would like to renumber the variables, and associate each variable with an integer that is always the same - lambda renumbering).
UPDATE 3: I guess part of my question now is: how can I make a "static visitor" not static, i.e., how can I make it have a state, that is modifiable.
UPDATE 4: I found an answer to the last question here:
How can I use the boost visitor concept with a class containing state variables?
I wonder if it is a sound solution. And I am still left wondering what is the best way of parsing a list of rules.

Related

Haskell check if the regular expression r made up of the single symbol alphabet Σ = {a} defines language L(r) = a*

I have got to write an algorithm programatically using haskell. The program takes a regular expression r made up of the unary alphabet Σ = {a} and check if the regular expression r defines the language L(r) = a^* (Kleene star). I am looking for any kind of tip. I know that I can translate any regular expression to the corresponding NFA then to the DFA and at the very end minimize DFA then compare, but is there any other way to achieve my goal? I am asking because it is clearly said that this is the unary alphabet, so I suppose that I have to use this information somehow to make this exercise much easier.
This is how my regular expression data type looks like
data Reg = Epsilon | -- epsilon regex
Literal Char | -- a
Or Reg Reg | -- (a|a)
Then Reg Reg | -- (aa)
Star Reg -- (a)*
deriving Eq
Yes, there is another way. Every DFA for regular languages on the single-letter alphabet is a "lollipop"1: an initial string of nodes that each point to each other (some of which are marked as final and some not) followed by a loop of nodes (again, some of which are marked as final and some not). So instead of doing a full compilation pass, you can go directly to a DFA, where you simply store two [Bool] saying which nodes in the lead-in and in the loop are marked final (or perhaps two [Integer] giving the indices and two Integer giving the lengths may be easier, depending on your implementation plans). You don't need to ensure the compiled version is minimal; it's easy enough to check that all the Bools are True. The base cases for Epsilon and Literal are pretty straightforward, and with a bit of work and thought you should be able to work out how to implement the combining functions for "or", "then", and "star" (hint: think about gcd's and stuff).
1 You should try to prove this before you begin implementing, so you can be sure you believe me.
Edit 1: Hm, while on my afternoon walk today, I realized the idea I had in mind for "then" (and therefore "star") doesn't work. I'm not giving up on this idea (and deleting this answer) yet, but those operations may be trickier than I gave them credit for at first. This approach definitely isn't for the faint of heart!
Edit 2: Okay, I believe now that I have access to pencil and paper I've worked out how to do concatenation and iteration. Iteration is actually easier than concatenation. I'll give a hint for each -- though I have no idea whether the hint is a good one or not!
Suppose your two lollipops have a length m lead-in and a length n loop for the first one, and m'/n' for the second one. Then:
For iteration of the first lollipop, there's a fairly mechanical/simple way to produce a lollipop with a 2*m + 2*n-long lead-in and n-long loop.
For concatenation, you can produce a lollipop with m + n + m' + lcm(n, n')-long lead-in and n-long loop (yes, that short!).

How Can I demonstrate this grammar is not ambiguous?

I know I need to show that there is no string that I can get using only left most operation that will lead to two different parsing trees. But how can I do it? I know there is not a simple way of doing it, but since this exercise is on the compilers Dragon book, then I am pretty sure there is a way of showing (no need to be a formal proof, just justfy why) it.
The Gramar is:
S-> SS* | SS+ | a
What this grammar represents is another way of simple arithmetic(I do not remember the name if this technique of anyone knows, please tell me ) : normal sum arithmetic has the form a+a, and this just represents another way of summing and multiplying. So aa+ also mean a+a, aaa*+ is a*a+a and so on
The easiest way to prove that a CFG is unambiguous is to construct an unambiguous parser. If the grammar is LR(k) or LL(k) and you know the value of k, then that is straightforward.
This particular grammar is LR(0), so the parser construction is almost trivial; you should be able to do it on a single sheet of paper (which is worth doing before you try to look up the answer.
The intuition is simple: every production ends with a different terminal symbol, and those terminal symbols appear nowhere else in the grammar. So when you read a symbol, you know precisely which production to use to reduce; there is only one which can apply, and there is no left-hand side you can shift into.
If you invert the grammar to produce Polish (or Łukasiewicz) notation, then you get a trivial LL grammar. Again the parsing algorithm is obvious, since every right hand side starts with a unique terminal, so there is only one prediction which can be made:
S → * S S | + S S | a
So that's also unambiguous. But the infix grammar is ambiguous:
S → S * S | S + S | a
The easiest way to provide ambiguity is to find a sentence which has two parses; one such sentence in this case is:
a + a + a
I think the example string aa actually shows what you need. Can it not be parsed as:
S => SS* => aa OR S => SS+ => aa

What's it for ';;' in OCaml?

This is my simple OCaml code to print out a merged list.
let rec merge cmp x y = match x, y with
| [], l -> l
| l, [] -> l
| hx::tx, hy::ty ->
if cmp hx hy
then hx :: merge cmp tx (hy :: ty)
else hy :: merge cmp (hx :: tx) ty
let rec print_list = function
| [] -> ()
| e::l -> print_int e ; print_string " " ; print_list l
;;
print_list (merge ( <= ) [1;2;3] [4;5;6]) ;;
I copied the print_list function from Print a List in OCaml, but I had to add ';;' after the function's implementation in order to avoid this error message.
File "merge.ml", line 11, characters 47-57:
Error: This function has type int list -> unit
It is applied to too many arguments; maybe you forgot a `;'.
My question is why ';;' is needed for print_list while merge is not?
The ;; is, in essence, a way of marking the end of an expression. It's not necessary in source code (though you can use it if you like). It is useful in the toplevel (the OCaml REPL) to cause the evaluation of what you've typed so far. Without such a symbol, the toplevel has no way to know whether you're going to type more of the expression later.
In your case, it marks the end of the expression representing the body of the function print_list. Without this marker, the following symbols look like they're part of the same expression, which leads to a type error.
For top-level expressions in OCaml files, I prefer to write something like the following:
let () = print_list (merge ( <= ) [1;2;3] [4;5;6])
If you code this way you don't need to use the ;; token in your source files.
This is an expansion of Jeffrey's answer.
As you know, when doing language parsing, a program has to break the flow in manageable lexical elements, and expect these so called lexemes (or tokens) to follow certain syntactic rules, allowing to regroup lexemes in larger units of meaning.
In many languages, the largest syntactic element is the statement, which diversifies in instructions and definitions. In these same languages, the structure of the grammar requires a special lexeme to indicate the end of some of these units, usually either instructions or statements. Others use instead a lexeme to separate instructions (rather than end each of them), but it's basically the same idea.
In OCaml, the grammar follows patterns which, within the algorithm used to parse the language, permits to elide such instruction terminator (or separator) in various circumstances. The keyword let for instance, is always necessary to introduce a new definition. This can be used to detect the end of the preceding statement at the outermost level of program statements (the top level).
However, you can easily see the problem it induces: the interactive version of Ocaml always need the beginning of a new statement to be able to figure out the previous one, and a user would never be able to provide statements one by one! The ;; special token was thus defined(*) to indicate the end of a top level statement.
(*): I actually seem to recall that the token was originally in OCaml ancestors, but then made optional when OCaml was devised; however, don't quote me on that.

How does pattern matching works with lists

I just start learning haskell and pattern matching. I just don't understand how it implemented, is the [] and (x:_) evaluates to the different type and the function implementations for this pattern recognized because of polymorphism, or I just wrong and there is another technique used.
head' :: [a] -> a
head' [] = error "Can't call head on an empty list, dummy!"
head' (x:_) = x
Or lets consider this pattern matching function:
tell :: (Show a) => [a] -> String
tell [] = "The list is empty"
tell (x:[]) = "The list has one element: " ++ show x
tell (x:y:[]) = "The list has two elements: " ++ show x ++ " and " ++ show y
tell (x:y:_) = "This list is long. The first two elements are: " ++ show x ++ " and " ++ show y
I think I wrong because each list with different number of elements can't have deferent type. Can you explain me how haskell knows which pattern is correct for some function implementation? I understand how it works logically but not deep. Explain it to me please.
There's a bit of a level of indirection between the syntax in Haskell and the implementation at runtime. The list syntax that you see in most source is actually a bit of sugar for what is actually a fairly regular data type. The following two lines are equivalent:
data List a = Nil | Cons a (List a)
data [a] = [] | a : [a] -- pseudocode
So when you type in [a,b,c] this translates into the internal representation as (a : (b : (c : []))).
The pattern matching you'll see at the top level bindings in are also a bit of syntactic sugar ( sans some minor details ) for case statements which deconstruct the data types onto the right hand side of the case of the pattern match. The _ symbol is a builtin for the wildcard pattern which matches any expression ( so long as the pattern is well-typed ) but does add a variable to the RHS scope.
head :: [a] -> a
head x = case x of
(a : _) -> a
I think I wrong because each list with different number of elements can't have deferent type. Can you explain me how haskell knows which pattern is correct for some function implementation?
So to answer your question [] and (x:_) are both of the same type (namely [a] or List a in our example above). The list datatype is also recursive so all recursive combinations of Cons'ing values of type a have the same type, namely [a].
When the compiler typechecks the case statement it runs along each of the LHS columns and ensures that all the types are equivalent or can be made equivalent using a process called unification. The same is done on the right hand side of the cases.
Haskell "knows" which pattern is correct by trying each branch sequentially and unpacking the matched value to see if it has the same number of cons elements as the pattern and then proceeding to the branch which does, or failing with a pattern match error if none match. The same process is done for any pattern matching on data types, not just lists.
There are two separate sets of checks:
in compile time (before the program is run) the compiler simply checks that the types of the cases conform to the function's signature and that the calls to the function conform to the function's signature.
In runtime, when we have access to the actual input, a similar, but different, set of checks is executed: these checks take the * actual* input (not its type) at hand and determine which of the cases matches it.
So, to answer your question: the decision is made in compile time where we have not only the type but also the actual value.

How to split a simple Lisp-like code to tokens in C++?

Basically, the language has 3 list and 3 fixed-length types, one of them is string.
This is simple to detect the type of a token using regular expressions, but splitting them into tokens is not that trivial.
String is notated with double-quote, and double-qoute is escaped with backslash.
EDIT:
Some example code
{
print (sum (1 2 3 4))
if [( 2 + 3 ) < 6] : {print ("Smaller")}
}
Lists like
() are argument lists that are only evaluated when necessary.
[] are special list to express 2 operand operations in a prettier
way.
{} are lists that are always evaluated. First element is a function
name, second is a list of arguments, and this repeats.
anything : anything [ : anything [: ...]] translate to argument lists that have the elements joined by the :s. This is only for making loops and conditionals look better.
All functions take a single argument. Argument lists can be used for functions that need more. You can fore and argument list to evaluate using different types of eval functions. (There would be eval functions for each list model)
So, if you understand this, this works very similar like Lisp does, it's only has different list types for prettifying the code.
EDIT:
#rici
[[2 + 3] < 6] is OK too. As I mentioned, argument lists are evaluated only when it's necessary. Since < is a function that requires an argument list of length 2, (2 + 3) must be evaluated somehow, other ways it [(2 + 3) < 6] would translate to < (2 + 3) : 6 which equals to < (2 + 3 6) which is and invalid argument list for <. But I see you point, it's not trivial that how automatic parsing in this case should work. The version that I described above, is that the [...] evaluates arguments list with a function like eval_as_oplist (...) But I guess you are right, because this way, you couldn't use an argument list in the regular way inside a [...] which is problematic even if you don't have a reason to do so, because it doesn't lead to a better code. So [[. . .] . .] is a better code, I agree.
Rather than inventing your own "Lisp-like, but simpler" language, you should consider using an existing Lisp (or Scheme) implementation and embedding it in your C++ application.
Although designing your own language and then writing your own parser and interpreter for it is surely good fun, you will have hard time to come up with something better designed, more powerful and implemented more efficiently and robustly than, say, Scheme and it's numerous implementations.
Chibi Scheme: http://code.google.com/p/chibi-scheme/ is particularly well suited for embedding in C/C++ code, it's very small and fast.
I would suggest using Flex (possibly with Bison) or ANTLR, which has a C++ output target.
Since google is simpler than finding stuff on my own file server, here is someone else's example:
http://ragnermagalhaes.blogspot.com/2007/08/bison-lisp-grammar.html
This example has formatting problems (which can be resolved by viewing the HTML in a text editor) and only supports one type of list, but it should help you get started and certainly shows how to split the items into tokens.
I believe Boost.Spirit would be suitable for this task provided you could construct a PEG-compatible grammar for the language you're proposing. It's not obvious from the examples as to whether or not this is the case.
More specifically, Spirit has a generalized AST called utree, and there is example code for parsing symbolic expressions (ie lisp syntax) into utree.
You don't have to use utree in order to take advantage of Spirit's parsing and lexing capabilities, but you would have to have your own AST representation. Maybe that's what you want?