OCaml + Menhir: How to parse OCaml like tuple-pattern? - ocaml

I'm a quite beginner of menhir.
I'm wondering how to parse OCaml like tuple-pattern in my own language, which is quite similar to OCaml.
For example, in the expression let a,b,c = ...,
a, b, c should be parsed like Tuple (Var "a", Var "b", Var "c").
But, in following definition of parser, the above example is parsed as Tuple (Tuple (Var "a", Var "b"), Var "c").
I am wondering how to fix the following definition to parse patterns like ocaml.
I have checked OCaml's parser.mly, but I'm not sure how to implement to do that.
I think my definition is similar to OCaml's one ...
What a magic they use ?
%token LPAREN
%token RPAREN
%token EOF
%token COMMA
%left COMMA
%token <string> LIDENT
%token UNDERBAR
%nonassoc below_COMMA
%start <Token.token> toplevel
%%
toplevel:
| p = pattern EOF { p }
pattern:
| p = simple_pattern { p }
| psec = pattern_tuple %prec below_COMMA
{ Ppat_tuple (List.rev psec) }
simple_pattern:
| UNDERBAR { Ppat_any }
| LPAREN RPAREN { Ppat_unit }
| v = lident { Ppat_var v }
| LPAREN p = pattern RPAREN { p }
pattern_tuple:
| seq = pattern_tuple; COMMA; p = pattern { p :: seq }
| p1 = pattern; COMMA; p2 = pattern { [p2; p1] }
lident:
| l = LIDENT { Pident l }
The result is the following:
[~/ocaml/error] menhir --interpret --interpret-show-cst ./parser.mly
File "./parser.mly", line 27, characters 2-42:
Warning: production pattern_tuple -> pattern_tuple COMMA pattern is never reduced.
Warning: in total, 1 productions are never reduced.
LIDENT COMMA LIDENT COMMA LIDENT
ACCEPT
[toplevel:
[pattern:
[pattern_tuple:
[pattern:
[pattern_tuple:
[pattern: [simple_pattern: [lident: LIDENT]]]
COMMA
[pattern: [simple_pattern: [lident: LIDENT]]]
]
]
COMMA
[pattern: [simple_pattern: [lident: LIDENT]]]
]
]
EOF
]

It contains a typical shift-reduce conflict, and you made a mistake at its resolution by specifying precedence. Please open any book about parsing by Yacc and check about shift-reduce conflicts and their resolution.
Let's see it using your rules. Suppose we have the following input and the parser is looking ahead at the second ,:
( p1 , p2 , ...
↑
Yacc is looking at this second COMMA token
Their are two possibilities:
Reduce: take p1 , p2 as a pattern. It uses the second rule of pattern
Shift: consumes the token COMMA and shift the look ahead cursor right to try making the comma separated list longer.
You can see the conflict easily removing %prec below_COMMA from the pattern rule:
$ menhir z.mly # the %prec thing is removed
...
File "z.mly", line 4, characters 0-9:
Warning: the precedence level assigned to below_COMMA is never useful.
Warning: one state has shift/reduce conflicts.
Warning: one shift/reduce conflict was arbitrarily resolved.
Many Yacc documents say in this case Yacc prefers shift, and this default usually matches with human intention, including your case. So one of the solutions is simply drop %prec below_COMMA thing and forget the warning.
If you do not like having shift reduce conflict warnings (That's the spirit!), you can explicitly state which rule should be chosen in this case using precedences, just like OCaml's parser.mly does. (BTW, OCaml's parser.mly is a jewel box of shift reduce resolutions. If you are not familiar, you should check one or two of them.)
Yacc chooses the rule with higher precedence at a shift reduce conflict. For shift, its precedence is the one of the token at the look ahead cursor, which is COMMA in this case. The precedence of reduce can be declared by %prec TOKEN suffix at the corresponding rule. If you do not specify it, I guess the precedence of a rule is undefined, and this is why the shift reduce conflict is warned if you remove %prec below_COMMA.
So now the question is: which has a higher precedence, COMMA or below_COMMA? This should be declared in the preamble of the mly file. (And this is why I have asked the questioner show that part.)
...
%left COMMA
...
%nonassoc below_COMMA
I skip what %left and %nonassoc mean, for all the Yacc books should explain them. Here below_COMMA pseudo token is below COMMA. It means below_COMMA has a higher precedence than COMMA. Therefore, the above example chooses Reduce, and gets ( (p1, p2), ... against the intention.
The correct precedence declaration is the opposite. To let Shift happen, below_COMMA must come above of COMMA, to have a lower precedence:
...
%nonassoc below_COMMA
%left COMMA
...
See OCaml's parser.mly. It exactly does like this. Putting "below thing above" sounds totally crazy, but it is not Menhir's fault. It is Yacc's unfortunate tradition. Blame it. OCaml's parser.mly has a comment already:
The precedences must be listed from low to high.

This is how I would rewrite it:
%token LPAREN
%token RPAREN
%token EOF
%token COMMA
%token LIDENT
%token UNDERBAR
%start toplevel
%%
toplevel:
| p = pattern EOF { p }
pattern:
| p = simple_pattern; tail = pattern_tuple_tail
{ match tail with
| [] -> p
| _ -> Ppat_tuple (p :: tail)
}
pattern_tuple_tail:
| COMMA; p = simple_pattern; seq = pattern_tuple_tail { p :: seq }
| { [] }
simple_pattern:
| UNDERBAR { Ppat_any }
| LPAREN RPAREN { Ppat_unit }
| v = lident { Ppat_var v }
| LPAREN p = pattern RPAREN { p }
lident:
| l = LIDENT { Pident l }
A major change is that a tuple element is no longer a pattern but a simple_pattern. A pattern itself is a comma-separated sequence of one or more elements.
$ menhir --interpret --interpret-show-cst parser.mly
LIDENT COMMA LIDENT COMMA LIDENT
ACCEPT
[toplevel:
[pattern:
[simple_pattern: [lident: LIDENT]]
[pattern_tuple_tail:
COMMA
[simple_pattern: [lident: LIDENT]]
[pattern_tuple_tail:
COMMA
[simple_pattern: [lident: LIDENT]]
[pattern_tuple_tail:]
]
]
]
EOF
]

Related

OCaml function application precedence and associativity

I need to give high precedence and left associativity to function application in my OCaml parser. I have a bunch of different tokens it matches for such as
%token LET REC EQ IN FUN ARROW
%token IF THEN ELSE
%token PLUS MINUS MUL DIV LT LE NE AND OR
%token LPAREN RPAREN
and I gave all of these precedence and associativity using %left,right...
However, since exp that I'm using to match with isn't a token I was wondering how I would do that in this case:
exp:
| exp exp { App($1,$2)}
I have all my matches for exp, didnt make a bunch of different exp1 exp2s and so on and want to know if its possible to give exp exp the highest precedence and left assoc it.
I posted this on another forum for my class and got:
You can associate a dummy token with the function application rules as follows:
rule: .... %precc DUMMY_FUN_APP
And then specify associativity using %left and the dummy token.
But im not really sure what this means so if someone could elaborate on this or give me another solution that would be great.
You don't say what parser generator you're using. If you're using ocamlyacc, you might look at the actual grammar of OCaml for ideas. You can find the grammar here: https://github.com/ocaml/ocaml/blob/trunk/parsing/parser.mly
In parser.mly, tokens are listed in precedence order from low to high. Some of the tokens are dummy tokens that are listed only to establish a precedence. These tokens are then referenced from within the syntax rules using %prec token_name.
Here are the last few lines of the token list:
%nonassoc below_SHARP
%nonassoc SHARP /* simple_expr/toplevel_directive */
%nonassoc below_DOT
%nonassoc DOT
/* Finally, the first tokens of simple_expr are above everything else. */
%nonassoc BACKQUOTE BANG BEGIN CHAR FALSE FLOAT INT INT32 INT64
LBRACE LBRACELESS LBRACKET LBRACKETBAR LIDENT LPAREN
NEW NATIVEINT PREFIXOP STRING TRUE UIDENT
Note that the dummy token below_SHARP has very high precedence.
Here are the relevant rules for function application:
expr:
| simple_expr simple_labeled_expr_list
{ mkexp(Pexp_apply($1, List.rev $2)) }
simple_labeled_expr_list:
labeled_simple_expr
{ [$1] }
| simple_labeled_expr_list labeled_simple_expr
{ $2 :: $1 }
labeled_simple_expr:
simple_expr %prec below_SHARP
{ ("", $1) }
For what it's worth, I've always found yacc associativity and precedence delcarations to be extremely tricky to understand except in simple cases.

How to define unrecognized rules in Ocamlyacc

I'm working on company projet, where i have to create a compilator for a language using Ocamlyacc and Ocamllex. I want to know if is it possible to define a rule in my Ocamlyacc Parser that can tell me that no rules of my grammar matching the syntax of an input.
I have to insist that i'am a beginner in Ocamllex/Ocamlyacc
Thank you a lot for your help.
If no rule in your grammar matches the input, then Parsing.Parse_error exception is raised. Usually, this is what you want.
There is also a special token called error that allows you to resynchronize your parser state. You can use it in your rules, as it was a real token produced by a lexer, cf., eof token.
Also, I would suggest to use menhir instead of more venerable ocamlyacc. It is easier to use and debug, and it also comes with a good library of predefined grammars.
When you write a compiler for a language, the first step is to run your lexer and to check if your program is good from a lexical point of view.
See the below example :
{
open Parser (* The type token is defined in parser.mli *)
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip blanks *)
| ['\n' ] { EOL }
| ['0'-'9']+ as lxm { INT(int_of_string lxm) }
| '+' { PLUS }
| '-' { MINUS }
| '*' { TIMES }
| '/' { DIV }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { raise Eof }
It's a lexer to recognize some arithmetic expressions.
If your lexer accepts the input then you give the sequence of lexemes to the parser which try to find if a AST can be build with the specified grammar. See :
%token <int> INT
%token PLUS MINUS TIMES DIV
%token LPAREN RPAREN
%token EOL
%left PLUS MINUS /* lowest precedence */
%left TIMES DIV /* medium precedence */
%nonassoc UMINUS /* highest precedence */
%start main /* the entry point */
%type <int> main
%%
main:
expr EOL { $1 }
;
expr:
INT { $1 }
| LPAREN expr RPAREN { $2 }
| expr PLUS expr { $1 + $3 }
| expr MINUS expr { $1 - $3 }
| expr TIMES expr { $1 * $3 }
| expr DIV expr { $1 / $3 }
| MINUS expr %prec UMINUS { - $2 }
;
This is a little program to parse arithmetic expression. A program can be rejected at this step because there is no rule of the grammar to apply in order to have an AST at the end. There is no way to define unrecognized rules but you need to write a grammar which define how a program can be accepted or rejected.
let _ =
try
let lexbuf = Lexing.from_channel stdin in
while true do
let result = Parser.main Lexer.token lexbuf in
print_int result; print_newline(); flush stdout
done
with Lexer.Eof ->
exit 0
If your compile the lexer, the parser and the last program, you have :
1 + 2 is accepted because there is no error lexical errors and an AST can be build corresponding to this expression.
1 ++ 2 is rejected : no lexical errors but there is no rule to build a such AST.
You can found more documentation here : http://caml.inria.fr/pub/docs/manual-ocaml-4.00/manual026.html

OCaml parser for :: case

I'm having trouble with associativity. For some reason my = operator has higher precedence than my :: operator
So for instance, if I have
"1::[] = []"
in as a string, I would get
1 = []::[]
as my expression instead of
[1] = []
If my string is "1::2::[] = []"
I thought I it would parse it into exp1 EQ exp2, then from then on it will parse exp1 and exp2. But it is parsing as exp1 COLONCOLON exp2 instead
.
.
.
%nonassoc LET FUN IF
%left OR
%left AND
%left EQ NE LT LE
%right SEMI COLONCOLON
%left PLUS MINUS
%left MUL DIV
%left APP
.
.
.
exp4:
| exp4 EQ exp9 { Bin ($1,Eq,$3) }
| exp4 NE exp9 { Bin ($1,Ne,$3) }
| exp4 LT exp9 { Bin ($1,Lt,$3) }
| exp4 LE exp9 { Bin ($1,Le,$3) }
| exp9 { $1 }
exp9:
| exp COLONCOLON exp9 { Bin ($1,Cons,$3) }
| inner { $1 }
.
.
.
It looks like you might have multiple expression rules (exp, exp1, exp2, ... exp9), in which case the precedence of operations is determined by the interrelation of those rules (which rule expands to which other rule), and the %left/%right declarations are largely irrelevant.
The yacc precedence rules are only used to resolve shift/reduce conflicts, and if your grammar doesn't have shift/reduce conflicts (having resolved the ambiguity by using multiple rules), the precedence levels will have no effect.
Rules aren't just applied like functions, so you can't refactor your grammar in a set of rules, at least with ocamlyacc. You can try to use menhir, that allows such refactoring by inlining rules (%inline directive).
To enable menhir you need to install it, and pass an option -use-menhir to ocamlbuild, if you're using it.

how to match only part of the expression to string in ocamllex

I have a simple ocamllex program where the rules section looks somewhat like this-
let digits= ['0'-'9']
let variables= 'X'|'Z'
rule addinlist = parse
|['\n'] {addinlist lexbuf;}
| "Inc" '(' variables+ '(' digits+ ')' ')' as ine { !inputstringarray.(!inputstringarrayi) <-ine;
inputstringarrayi := !inputstringarrayi +1;
addinlist lexbuf}
|_ as c
{ printf "Unrecognized character: %c\n" c;
addinlist lexbuf
}
| eof { () }
My question is suppose I want to match Inc(X(7)) such that I can convert it to my abstract syntax which is "Inc of var of int". I want my lexer to give me the separate strings while reading Inc(X(7)) such that I get "Inc" as a diff string (say inb) followed by "X" as a diff string (say inc) n followed by "7" as a diff string (say ind), so that i can play around with these strings inb, inc, & ind, instead of being stuck with a whole string ine, as is given by my program. How to go about this? I hope my question is clear

Bison dangling else

I have the following rule in my grammar:
block: TLBRACE statements TRBRACE
| TLBRACE TRBRACE
;
statements: statement
| statements statement
;
statement: TIF TLPAREN expression TRPAREN TTHEN statement
| TIF TLPAREN expression TRPAREN TTHEN statement TELSE statement
| TWHILE TLPAREN expression TRPAREN statement
| TDO statement TLPAREN expression TRPAREN
| TFOR TLPAREN forinits TSEMICOLON expression TSEMICOLON expressions TRPAREN statement
| block
| declaration TSEMICOLON
| expression TSEMICOLON
;
I'm aware of the dangling else problem. That's why I specified "%left TELSE" at the top of the grammar file. Anyway, even if I instructed Bison to give precedence to TELSE token a shift/reduce conflict is generated. I tried also to remove the "%left TELSE" command (just to see if it makes any difference) and nothing changes. I always have the same shift/reduce conflict.
The output I get with the --verbose flag to Bison is:
State 117
32 statement: "if" "(" expression ")" "then" statement . ["identifier", "string value", "double", "int", "lint", "message", "string", "double value", "int value", "(", "{", "}", "-", "do", "else", "for", "if", "while", "binary not"]
33 | "if" "(" expression ")" "then" statement . "else" statement
"else" shift and going to state 122
"else" [reducing with rule 32 (statement)]
$default reducing with rule 32 (statement)
I got rid of the conflict when declaring %left TTHEN. I can't give you a full explanation but operator precedence has something to do with comparing operators after one and another.
The best way for solve this problem using %nonassoc.
%nonassoc THEN
%nonassoc ELSE
%%
statement: TIF TLPAREN expression TRPAREN TTHEN statement %prec THEN
| TIF TLPAREN expression TRPAREN TTHEN statement TELSE statement
%%
Here, the parser pushes the tokens into the stack, and when IF expressions are pushed, and the next token is ELSE, there will be a conflict.
The parser should reduce the IF expressions according to if_statement rule, or it should shift the next ELSE token into the stack. You have to determine the priorities of your rules, in this case you have to give the ELSE token more priority than if_statement by using %nonassoc and %prec. In the %nonassoc priority of ELSE is more than THEN, and by using %prec priority of if_statment is less than the if_else_statment and the conflict is resolved.
This is a frequently asked question. Have a look at Reforming the grammar to remove shift reduce conflict in if-then-else which explains how the conflict resolution works.