I am trying to arrange for ocamllex and ocamlyacc code to scan and parse a simple language. I have defined the abstract syntax for the same but am finding difficulty scanning for complex rules. Here's my code
{
type exp = B of bool | Const of float | Iszero of exp | Diff of exp*exp |
If of exp * exp * exp
}
rule scanparse = parse
|"true"| "false" as boolean {B boolean}
|['0'-'9']+ "." ['0'-'9']* as num {Const num}
|"iszero" space+ ['a'-'z']+ {??}
|'-' space+ '(' space* ['a'-'z']+ space* ',' space* ['a'-'z']+ space* ')' {??}
But I am not able to access certain portions of the matched string. Since the expression declaration is recursive, nested functions aren't helping either(?). Please help.
To elaborate slightly on my comment above, it looks to me like you're trying to use ocamllex to do what ocamlyacc is for. I think you need to define very simple tokens in ocamllex (like booleans, numbers, and variable names), then use ocamlyacc to define how they go together to make things like Iszero, Diff, and If. ocamllex isn't powerful enough to parse the structures defined by your abstract syntax.
Update
Here is an ocamlyacc tutorial that I found linked from OCaml.org, which is a pretty good endorsement: OCamlYacc tutorial. I looked through it and it looks good. (When I started using ocamlyacc, I already knew yacc so I was able to get going pretty quickly.)
Related
When parsing a grammar, should RegEx be used to match grammars that can be expressed as regular languages or should the current parser design be used exclusively?
For example, the EBNF grammar for JSON can be expressed as:
object ::= '{' '}' | '{' members '}';
members ::= pair | pair ',' members;
pair ::= string ':' value;
array ::= '[' ']' | '[' elements ']';
elements ::= value | value ',' elements;
value ::= string | number | object | array | 'true' | 'false' | 'null';
So grammar would need to be matched using some type of lexical analyzer (such as a recursive descent parser or ad hoc parser), but the grammar for some of the values (such as the number) can be expressed as a regular language like this RegEx pattern for number:
-?\d+(\.\d+)?([eE][+-]?\d+)?
Given this example, assuming one is creating a recursive descent JSON parser... should the number be matched via the recursive descent technique or should the number be matched via RegEx since it can be matched easily using RegEx?
This is a very broad and opinionated question. Hence, to my knowledge, usually you will want a parser to be as fast as possible and to have the smallest footprint in memory as possible, especially if it needs to parse in real-time (on demand).
A RegEx will surely do the job, but it is like shooting a fly with a nuclear weapon !
This is why, many parsers are written in low-level language like C to take advantage of string pointers and avoid the overhead caused by high-level languages like Java with immutable fields, garbage collector,...
Meanwhile, this heavily depends on your use case and cannot be truly answered in a generic way. You should consider the tradeoff between the developer's convenience to use RegEx versus the performance of the parser.
One additionnal consideration is that usually you will want the parser to indicate where you have a syntax error, and which type of error it is. Using a RegEx, it will simply not match and you will have a hard time finding out why it stopped in order to display a proper error message. When using an old-school parser, you can quickly stop parsing as soon as you encounter a syntax error and you can know precisely what did not match and where.
In your specific case for JSON parsing and using RegEx only for numbers, I suppose you are probably using a high-level language already, so what many implementations do is to rely on the language's native parsing for numbers. So just pick the value (string, number,...) using the delimiters and let the programming language throw an exception for number parsing.
So I am trying to parse an expression using Tcl_ParseExpr:
// Any syntax errors?
Tcl_Interp *myInterpBuild;
Tcl_Parse parseInfo;
std::string expression = "(test1==1) ? 0.0) : (test2.hello+1.0)";
if (Tcl_ParseExpr(myInterpBuild,expression.c_str(),-1,&parseInfo)
== TCL_ERROR)
{
std::string failMsg = Tcl_GetStringResult(myInterpBuild);
std::cout << failMsg;
Now, usually this works and no error is give. However, if the expression contains a . (dot symbol) then it only parses the part of the expression up the the dot.
For example, if I set expression to '(test1==1) ? 0.0) : (test2.hello+1.0)' then only 'test2' is parsed and 'hello' is thrown away.
The output of the above is:
invalid bareword "test2"
It appears only to evaluate the expression up to and not including the dot character.
Does anyone know why this is happening and what I have to do to fix this?
The . character is not an operator in Tcl's expression language as things stand right now. It can be used within a floating point literal, of course, but it simply isn't a legal part of the grammar rules as an operator. Thus, Tcl's parser stops when it encounters it and throws an error: it would do exactly the same if you fed it into the Tcl expr command. What's more, Tcl's expression language doesn't currently support barewords except as function names (and there's a few keywords that look like barewords too, such as true and false).
Changing that would require rewriting the expression parser and (probably) assigning a meaning to that operator in terms of Tcl's internal bytecode. Not exactly a trivial thing (there's quite a few places in the code to change) but possible if you have a good idea for what to do. If you do, please talk to the Tcl Core Team and we'll see what we can do; you might find us quite receptive to a good suggestion!
Or you can use your own parser, of course. Absolutely nothing stopping that.
I know I need to show that there is no string that I can get using only left most operation that will lead to two different parsing trees. But how can I do it? I know there is not a simple way of doing it, but since this exercise is on the compilers Dragon book, then I am pretty sure there is a way of showing (no need to be a formal proof, just justfy why) it.
The Gramar is:
S-> SS* | SS+ | a
What this grammar represents is another way of simple arithmetic(I do not remember the name if this technique of anyone knows, please tell me ) : normal sum arithmetic has the form a+a, and this just represents another way of summing and multiplying. So aa+ also mean a+a, aaa*+ is a*a+a and so on
The easiest way to prove that a CFG is unambiguous is to construct an unambiguous parser. If the grammar is LR(k) or LL(k) and you know the value of k, then that is straightforward.
This particular grammar is LR(0), so the parser construction is almost trivial; you should be able to do it on a single sheet of paper (which is worth doing before you try to look up the answer.
The intuition is simple: every production ends with a different terminal symbol, and those terminal symbols appear nowhere else in the grammar. So when you read a symbol, you know precisely which production to use to reduce; there is only one which can apply, and there is no left-hand side you can shift into.
If you invert the grammar to produce Polish (or Łukasiewicz) notation, then you get a trivial LL grammar. Again the parsing algorithm is obvious, since every right hand side starts with a unique terminal, so there is only one prediction which can be made:
S → * S S | + S S | a
So that's also unambiguous. But the infix grammar is ambiguous:
S → S * S | S + S | a
The easiest way to provide ambiguity is to find a sentence which has two parses; one such sentence in this case is:
a + a + a
I think the example string aa actually shows what you need. Can it not be parsed as:
S => SS* => aa OR S => SS+ => aa
I have implemented the usual combination of lexer/parser/pretty-printer for reading-in/printing a type in my code. I find there is redundancy among the lexer and the pretty-printer when it comes to plain-string regular expressions, usually employed for symbols, punctuation or separators.
For example I now have
rule token = parse
| "|-" { TURNSTILE }
in my lexer.mll file, and a function like:
let pp fmt (l,r) =
Format.fprintf fmt "#[%a |-# %a#]" Form.pp l Form.pp r
for pretty-printing. If I decide to change the string for TURNSTILE, I have to edit two places in the code, which I find less than ideal.
Apparently, the OCaml lexer supports a certain ability to define regular expressions and then refer to them within the mll file. So lexer.mll could be written as
let symb_turnstile = "|-"
rule token = parse
| symb_turnstile { TURNSTILE }
But this will not let me externally access symb_turnstile, say from my pretty-printing functions. In fact, after running ocamllex, there are no occurences of symb_turnstile in lexer.ml. I cannot even refer to these identifiers in the OCaml epilogue of lexer.mll.
Is there any way of achieving this?
In the end, I went for the following style which I stole from the sources of ocamllex itself (so I am guessing it's standard practice). A map from strings to tokens (here an association list) is defined in the preamble of lexer.mll
let symbols =
[
...
(Symb.turnstile, TURNSTILE);
...
]
where Symb is a module defining turnstile as a string. Then, the lexing part of lexer.mll is purposely overly general:
rule token = parse
...
| punctuation
{
try
List.assoc (Lexing.lexeme lexbuf) symbols
with Not_found -> lex_error lexbuf
}
...
where punctuation is a regular expression matching a sequence of symbols.
The pretty-printer can now be written like this.
let pp fmt (l,r) =
Format.fprintf fmt "#[%a %s# %a#]" Form.pp Symb.turnstile l Form.pp r
Although the two tokens both look like strings notationally, they're really very different. I don't think there's a convenient type under which they could be shared for use by ocamllex and Printf.printf. This is possibly the reason that ocamllex doesn't support such external definitions. You could get probably the effect you want with a macro facility (textual inclusion).
I want to be able to specify mathematical rules in an external location to my application and have my application process them at run time and perform actions (again described externally). What is the best way to do this?
For example, I might want to execute function MyFunction1() when the following evaluates to true:
(a < b) & MyFunction2() & (myWord == "test").
Thanks in advance for your help.
(If it is of any relevance, I wish to use C++, C or C++/CLI)
I'd consider not reinventing the wheel --- use an embedded scripting engine. This means you'd be using a standard form for describing the actions and logic. There are several great options out there that will probably be fine for your needs.
Good options include:
Javascript though google v8. (I don't love this from an embedding point of view,
but javascript is easy to work with, and many people already know it)
Lua. Its fast and portable. Syntax is maybe not as nice as Javascript, but embedding is
easy.
Python. Clean syntax, lots of libraries. Not much fun to embed though.
I'd consider using SWIG to help generate the bindings ... I know it works for python and lua, not sure about v8.
I would look at the command design pattern to handle calling external mathematical predicates, and the Factory design pattern to run externally defined code.
If your mathematical expression language is that simple then uou could define a grammar for it, e.g.:
expr = bin-op-expr | rel-expr | func-expr | var-expr | "(" expr ")"
bin-op = "&" | "|" | "!"
bin-op-expr = expr bin-op expr
rel-op = "<" | ">" | "==" | "!=" | "<=" | ">="
rel-expr = expr rel-op expr
func-args = "(" ")"
func-expr = func-name func-args
var-expr = name
and then translate that into a grammar for a parser. E.g. you could use Boost.Spirit which provides a DSL to allow you to express a grammar within your C++ code.
If that calculation happens at an inner loop, you want high performance, you cannot go with scripting languages. Based on how "deployable" and how much platform independent you would like that to be:
1) You could express the equations in C++ and let g++ compile it for you at run-time, and you could link to the resulting shared object. But this method is very much platform dependent at every step! The necessary system calls, the compiler to use, the flags, loading a shared object (or a DLL)... That would be super-fast in the end though, especially if you compile the innermost loop with the equation. The equation would be inlined and all.
2) You could use java in the same way. You can get a nice java compiler in java (from Eclipse I think, but you can embed it easily). With this solution, the result would be slightly slower (depending on how much template magic you want), I would expect, by a factor of 2 for most purposes. But this solution is extremely portable. Once you get it working, there's no reason it shouldn't work anywhere, you don't need anything external to your program. Another down side to this is having to write your equations in Java syntax, which is ugly for complex math. The first solution is much better in that respect, since operator overloading greatly helps math equations.
3) I don't know much about C#, but there could be a solution similar to (2). If there is, I know that there's operator overloading in C#, so your equations would be more pleasant to write and look at.