I've made a complete parser in bison ( and of course complete lexer in flex ), and I noticed only yesterday that I've a problem in my parser. In If structure in fact.
Here are my rules in my parser: http://pastebin.com/TneESwUx
Here the single IF is not recognized, and if I replace "%prec IFX" with "END", by adding a new token "END" ("end" return END; in flex), it works. But I don't want to have a new "end" keyword, that's why I don't use this solution.
Please help me.
'The correct way to handle this kind of rule is not precedence, it is refactoring to use an optional else-part so that the parser can use token lookahead to decide how to parse. I would design it something like this:
stmt : IF '(' expression ')' stmts else_part
| /* other statement productions here */
else_part : /* optional */
| ELSE stmts
stmts : stmt
| '{' stmt_list '}'
| '{' '}'
stmt_list : stmt
| stmt_list ';' stmt
(This method of special-casing stmts instead of allowing stmt to include a block may not be optimal in terms of productions, and may introduce oddities in your language, but without more details it's hard to say for certain. bison can produce a report showing you how the parser it generated works; you may want to study it. Also beware of unexpected shift/reduce conflicts and especially of any reduce/reduce conflicts.)
Note that shift/reduce conflicts are entirely normal in this kind of grammar; the point of an LALR(1) parser is to use these conflicts as a feature, looking ahead by a single token to resolve the conflict. They are reported specifically so that you can more easily detect the ones you don't want, that you introduce by incorrectly factoring your grammar.
Your IfExpression also needs to be refactored to match; the trick is that else_part should produce a conditional expression of some kind to $$, and in the production for IF you test $6 (corresponding to else_part) and invoke the appropriate IfExpression constructor.
Your grammar is ambiguous, so you have to live with the shift/reduce conflict. The END token eliminates the ambiguity by ensuring that an IF statement is always properly closed, like a pair of parentheses.
Parentheses make a good analogy here. Suppose you had a grammar like this:
maybe_closed_parens : '(' stuff
| '(' stuff ')'
;
stuff itself generates some grammar symbols, and one of them is maybe_closed_parens.
So if you have input like ((((((whatever, it is correct. The parentheses do not have to be closed. But what if you add )? Which ( is it closing?
This is very much like not being able to tell which IF matches an ELSE.
If you add END to the syntax of the IF (whether or not there is an ELSE), then that is like having a closing parentheses. IF and END are like ( and ).
Of course, you are stylistically right not to want the word END in your language, because you already have curly braced blocks, which are basically an alternate spelling for Pascal's BEGIN and END. Your } is already an END keyword.
So what you can do is impose the rule that an IF accepts only compound statements (i.e. fully braced):
if_statement : IF condition compound_statement
| IF condition compound_statement ELSE compound_statement
Now it is impossible to have an ambiguity like if x if y w else z because braces must be present: if x { if y { w } else { z } } or if x { if y { w } } else { z }.
I seem to recall that Perl is an example of a language which has made this choice. It's not a bad idea because not only does it eliminate your ambiguity, more importantly it eliminates ambiguities from programs.
I see that you don't have a compound_statement phrase structure rule in your grammar because your statement generates a phrase enclosed in { and } directly. You will have to hack that in if you take this approach.
Related
I am trying to write an antlr4 grammar for a customized language which among its lexer rules originally contained the following:
PLUS : '+' ;
MINUS : '-' ;
NUMBER: ('+'|'-')? [0-9]+ ;
COMMENT : '/*' (COMMENT|~'*'|('*' ~'/'))* '*/' -> skip ;
WS : (' ' | '\t' | '\n') -> skip ;
The parser grammar contains, among other things, an arithmetic expression evaluator, and what I found is that using these lexer rules failed to parse the input '2-2' correctly, which should come out as NUMBER MINUS NUMBER, and instead just returned two NUMBER tokens. I therefore broke out the unary + and - applications into it's own parser rule, as follows:
literal_number : NUMBER
| '-' NUMBER
| '+' NUMBER ;
And defined NUMBER simply as:
NUMBER: [0-9]+ ;
However, with this arrangement, the literal_number parser rule is being activated even if there is whitespace or comments between the plus and minus tokens and the number itself. This should not be valid in parser contexts where I am expecting to see only an integer constant (which is actually anywhere other than when parsing arithmetic expressions). I have another parser rule elsewhere in my parser grammar that handles unary negation already, so I do not need to replicate that in the literal_number parser rule anyways, so all I what I want is for the literal_number parser rule to refer only to places in the text where a real integer constant had been found.
How can I do this? I have already looked at questions on stackoverflow pertaining to rules that are sensitive to whitespace, but I have not been able to figure out how to apply any of those solutions to my problem.
I'm not sure that this matters for my question, but my target language is c++, although I expect I may still be able to generalize from a java-specific example if one is offered.
EDIT:
The response that I've seen so far highlights an issue with my original comment which may have been ambiguous. In my defense I had not wanted to complicate my original question with information that I did not immediately see as relevant, but in light of the response I've seen so far, I can now clearly see that it is. I can only offer my apologies for this initial oversight.
In addition to the literal_number rule, I also have the following rule for expressions, which, in particular, has a rule allowing for negation.
expression : ID # look up value
| literal_number # number
| MINUS expression # negate
| expression (STAR|SLASH) expression # multiply
| expression (PLUS|MINUS) expression # add
;
So to that end, the expression 2-2 should evaluate as literal_number (2) MINUS literal_number (2), 2--2 should evaluate as literal_number (2) MINUS literal_number (-2), while 2-- 2 should evaluate as literal_number (2) MINUS MINUS literal_number (2).
So basically, as I said originally, I only want the literal_number rule to be used at all when the NUMBER is by itself or MINUS and the NUMBER are side by side with no ignored tokens between them, but I cannot just make ('+'|'-') [0-9] a lexical rule for NUMBER without causing the problem I had in the first place.
When parsing a grammar, should RegEx be used to match grammars that can be expressed as regular languages or should the current parser design be used exclusively?
For example, the EBNF grammar for JSON can be expressed as:
object ::= '{' '}' | '{' members '}';
members ::= pair | pair ',' members;
pair ::= string ':' value;
array ::= '[' ']' | '[' elements ']';
elements ::= value | value ',' elements;
value ::= string | number | object | array | 'true' | 'false' | 'null';
So grammar would need to be matched using some type of lexical analyzer (such as a recursive descent parser or ad hoc parser), but the grammar for some of the values (such as the number) can be expressed as a regular language like this RegEx pattern for number:
-?\d+(\.\d+)?([eE][+-]?\d+)?
Given this example, assuming one is creating a recursive descent JSON parser... should the number be matched via the recursive descent technique or should the number be matched via RegEx since it can be matched easily using RegEx?
This is a very broad and opinionated question. Hence, to my knowledge, usually you will want a parser to be as fast as possible and to have the smallest footprint in memory as possible, especially if it needs to parse in real-time (on demand).
A RegEx will surely do the job, but it is like shooting a fly with a nuclear weapon !
This is why, many parsers are written in low-level language like C to take advantage of string pointers and avoid the overhead caused by high-level languages like Java with immutable fields, garbage collector,...
Meanwhile, this heavily depends on your use case and cannot be truly answered in a generic way. You should consider the tradeoff between the developer's convenience to use RegEx versus the performance of the parser.
One additionnal consideration is that usually you will want the parser to indicate where you have a syntax error, and which type of error it is. Using a RegEx, it will simply not match and you will have a hard time finding out why it stopped in order to display a proper error message. When using an old-school parser, you can quickly stop parsing as soon as you encounter a syntax error and you can know precisely what did not match and where.
In your specific case for JSON parsing and using RegEx only for numbers, I suppose you are probably using a high-level language already, so what many implementations do is to rely on the language's native parsing for numbers. So just pick the value (string, number,...) using the delimiters and let the programming language throw an exception for number parsing.
I believe everyone who reads it is familiar with the dangling else ambiguity. Therefore I will skip the explanation.
I found on a compilers book(the dragon book) a not ambiguous grammar that represents IF and ELSE. Here it is.
stmt->matched_stmt | open_stmt
matched_stmt->if exp then matched_stmt else matched_stmt | other
open_stmt->if exp then stmt | if exp then matched_stmt else open_stmt
The problems is:
open_stmt->if exp then stmt | if exp then matched_stmt else open_stmt
In order to make the grammar I am work on a LL(1) grammar, I need to eliminate left common prefix, and in this case the left common prefix is:
if exp then
when I try to factory it then it again become ambiguous, this is what I tried.
stmt->matched_stmt | open_stmt
matched_stmt->if exp then matched_stmt else matched_stmt | other
open_stmt->if exp thenO'
O'->stmt | matched_stmt else open_stmt
Which is ambiguous, can anyone help me make it not ambigous and with no common left prefix? thank you very much
I don't believe it is possible to write an LL(1) grammar which will handle dangling else, although it is trivial to write a recursive descent parser which does so.
In the recursive descent parser, when you've done parsing the first statement after the expression, if you see an else you continue with the production. Otherwise, you've finished parsing the if statement. In other words, you simply parse the else clauses greedily.
But that algorithm cannot be expressed as a CFG, and I've always assumed that it is impossible to write an unambiguous LL(1) CFG which handles dangling elses, because when you reach the beginning of S1 in
if E then S1 ...
you still don't know which production that is part of. In fact, you don't know until you reach the end of S1, which is certainly well too late to make an LL(k) decision, no matter how big k is.
However, that explanation has a lot of hand-waving, and I've never found it completely satisfying. So I was gratified to pick up my battered copy of the Dragon book (1986 edition) and read, on page 192 ("LL(1) grammars" in section 4.4, "Top-down grammars") that grammar 4.13 (the if-then-optional-else grammar) "has no LL(1) grammar at all".
The following paragraph ends with sound advice: "if an LR parser generator… is available, one can get all of the benefits of predictive parsing and operator precedence automatically." My marginal note (from about 1986, I guess) reads "So why did I just study this whole chapter????"; today, I'd be inclined to be more generous with the authors of the Dragon book but not to the point of suggesting that anyone actually use a parser generator which is not at least as powerful as an LALR(1) parser generator.
I am trying to arrange for ocamllex and ocamlyacc code to scan and parse a simple language. I have defined the abstract syntax for the same but am finding difficulty scanning for complex rules. Here's my code
{
type exp = B of bool | Const of float | Iszero of exp | Diff of exp*exp |
If of exp * exp * exp
}
rule scanparse = parse
|"true"| "false" as boolean {B boolean}
|['0'-'9']+ "." ['0'-'9']* as num {Const num}
|"iszero" space+ ['a'-'z']+ {??}
|'-' space+ '(' space* ['a'-'z']+ space* ',' space* ['a'-'z']+ space* ')' {??}
But I am not able to access certain portions of the matched string. Since the expression declaration is recursive, nested functions aren't helping either(?). Please help.
To elaborate slightly on my comment above, it looks to me like you're trying to use ocamllex to do what ocamlyacc is for. I think you need to define very simple tokens in ocamllex (like booleans, numbers, and variable names), then use ocamlyacc to define how they go together to make things like Iszero, Diff, and If. ocamllex isn't powerful enough to parse the structures defined by your abstract syntax.
Update
Here is an ocamlyacc tutorial that I found linked from OCaml.org, which is a pretty good endorsement: OCamlYacc tutorial. I looked through it and it looks good. (When I started using ocamlyacc, I already knew yacc so I was able to get going pretty quickly.)
How can I force the shift\reduce conflict to be resolved by the GLR method?
Suppose I want the parser to resolve the conflict between the right shift operator and two closing angle brackets of template arguments for itself. I make the lexer pass the 2 consecutive ">" symbols, as separate tokens, without merging them into one single ">>" token. Then i put these rules to the grammar:
operator_name:
"operator" ">"
| "operator" ">" ">"
;
I want this to be a shift\reduce conflict. If I have the token declaration for ">" with left associativity, this will not be a conflict. So I have to remove the token precedence\associativity declaration, but this results in many other conflicts that I don't want to solve manually by specifying the contextual precedence for each conflicting rule. So, is there a way to force the shift\reduce conflict while having the token declared?
I believe that using context-dependent precedence on the rules for operator_name will work.
The C++ grammar as specified by the updated standard actually modifies the grammar to accept the >> token as closing two open template declarations. I'd recommend following it to get standard behaviour. For example, you must be careful that "x > > y" is not parsed as "x >> y", and you must also ensure that "foo<bar<2 >> 1>>" is invalid, while "foo<bar<(2 >> 1)>>" is valid.
I worked in Yacc (similar to Bison), with a similar scenario.
Standard grammars are, sometimes, called "parsing directed by syntax".
This case is, sometimes, called something like "parsing directed by semantics".
Example:
...
// shift operator example
if ((x >> 2) == 0)
...
// consecutive template closing tag example
List<String, List<String>> MyList =
...
Lets remember, our mind works like a compiler. The human mind can compile this, but the previous grammars, can't. Mhhh. Lets see how a human mind, would compile this code.
As you already know, the "x" before the consecutive ">" and ">" tokens indicates an expression or lvalue. The mind thinks "two consecutive greater-than symbols, after an expresion, should become a single shift operator token".
And for the "string" token: "two consecutive greater-than symbols, after a type identifier, should become two consecutive template closing tag tokens".
I think this case cannot be handled by the usual operator precedence, shift or reduce, or just grammars, but using ( "hacking" ) some functions provided by the parser itself.
I don't see an error in your example grammar rule. The "operator" symbol avoids confusing the two cases you mention. The parts that should be concern its the grammars where the shift operator its used, and the consecutive template closing tags are used.
operator_expr_example:
lvalue "<<" lvalue |
lvalue ">>" lvalue |
lvalue "&&" lvalue |
;
template_params:
identifier |
template_declaration_example |
array_declaration |
other_type_declaration
;
template_declaration_example:
identifier "<" template_params ">"
;
Cheers.