is antlr parser greedy? - regex

I don't understand why this antlr4 grammar
grammar antmath1;
expr
: '(' expr ')' # parensExpr
| op=('+'|'-') expr # unaryExpr
| left=expr op=('*'|'/') right=expr # infixExpr
| left=expr op=('+'|'-') right=expr # infixExpr
| value=NUM # numberExpr
;
NUM : [0-9]+;
WS : [ \t\r\n] -> channel(HIDDEN);
works properly:
antlr tree produced by -(5+9)+1000; result=986
but why this one:
grammar antmath;
expr
: '(' expr ')' # parensExpr
| left=expr op=('*'|'/') right=expr # infixExpr
| left=expr op=('+'|'-') right=expr # infixExpr
| op=('+'|'-') expr # unaryExpr
| value=NUM # numberExpr
;
NUM : [0-9]+;
WS : [ \t\r\n] -> channel(HIDDEN);
fails:
antlr tree produced by the same expression; result=-1014
I expect the first grammar1 (which outputs correct result) to produce the same result as grammar2 (wrong output). The reasoning behind this: the only rule that admits '-' as first token is #unaryExpr so parser generated by any of the grammars would try to match that rule first. Then, provided the parser is greedy (for any of the two grammars), I would expect it to take the "(5+9)+1000" as a whole and match it with expr which it does because it is a valid expr.
where's the fault in my reasoning?

the grammars would try to match that rule first
It does. However, you've made unary minus have lower precedence than binary plus.
That means that the expression is being interpreted as -((5+9)+1000) instead of (-(5+9))+1000.

Related

Antlr4: Can't understand why breaking something out into a subrule doesn't work

I'm still new at Antlr4, and I have what is probably a really stupid problem.
Here's a fragment from my .g4 file:
assignStatement
: VariableName '=' expression ';'
;
expression
: (value | VariableName)
| bin_op='(' expression ')'
| expression UNARY_PRE_OR_POST
| (UNARY_PRE_OR_POST | '+' | '-' | '!' | '~' | type_cast) expression
| expression MUL_DIV_MOD expression
| expression ADD_SUB expression
;
VariableName
: ( [a-z] [A-Za-z0-9_]* )
;
// Pre or post increment/decrement
UNARY_PRE_OR_POST
: '++' | '--'
;
// multiply, divide, modulus
MUL_DIV_MOD
: '*' | '/' | '%'
;
// Add, subtract
ADD_SUB
: '+' | '-'
;
And my sample input:
myInt = 10 + 5;
myInt = 10 - 5;
myInt = 1 + 2 + 3;
myInt = 1 + (2 + 3);
myInt = 1 + 2 * 3;
myInt = ++yourInt;
yourInt = (10 - 5)--;
The first sample line myInt = 10 + 5; line produces this error:
line 22:11 mismatched input '+' expecting ';'
line 22:14 extraneous input ';' expecting {<EOF>, 'class', '{', 'interface', 'import', 'print', '[', '_', ClassName, VariableName, LITERAL, STRING, NUMBER, NUMERIC_LITERAL, SYMBOL}
I get similar issues with each of the lines.
If I make one change, a whole bunch of errors disappear:
| expression ADD_SUB expression
change it to this:
| expression ('+' | '-') expression
I've tried a bunch of things. I've tried using both lexer and parser rules (that is, calling it add_sub or ADD_SUB). I've tried a variety of combinations of parenthesis.
I tried:
ADD_SUB: [+-];
What's annoying is the pre- and post-increment lines produce no errors as long as I don't have errors due to +-*. Yet they rely on UNARY_PRE_OR_POST. Of course, maybe it's not really using that and it's using something else that just isn't clear to me.
For now, I'm just eliminating the subrule syntax and will embed everything in the main rule. But I'd like to understand what's going on.
So... what is the proper way to do this:
Do not use literal tokens inside parser rules (unless you know what you're doing).
For the grammar:
expression
: '+' expression
| ...
;
ADD_SUB
: '+' | '-'
;
ANTLR will create a lexer rules for the literal '+', making the grammar really look like this:
expression
: T__0 expression
| ...
;
T__0 : '+';
ADD_SUB
: '+' | '-'
;
causing the input + to never become a ADD_SUB token because T__0 will always match it first. That is simply how the lexer operates: try to match as much characters as possible for every lexer rule, and when 2 (or more) match the same amount of characters, let the one defined first "win".
Do something like this instead:
expression
: value
| '(' expression ')'
| expression UNARY_PRE_OR_POST
| (UNARY_PRE_OR_POST | ADD | SUB | EXCL | TILDE | type_cast) expression
| expression (MUL | DIV | MOD) expression
| expression (ADD | SUB) expression
;
value
: ...
| VariableName
;
VariableName
: [a-z] [A-Za-z0-9_]*
;
UNARY_PRE_OR_POST
: '++' | '--'
;
MUL : '*';
DIV : '/';
MOD : '%';
ADD : '+';
SUB : '-';
EXCL : '!';
TILDE : '~';

How to remove ambiguity in EBNF Instaparse grammar

How can i prevent that the "," literal in the structure rule is parsed as a operator in the following EBNF grammar for Instaparse?
Grammar:
structure = atom <"("> term ("," term)* <")">
term = atom | number | structure | variable | "(" term ")" | term operator term
operator = "," | ";" | "\\=" | "=="
Using the comma as a separator and as an operator like you do makes comma context sensitive which Ebnf on its own can't deal with.

Changing how Antlr4 prints a tree

I am using Antlr4 in IntelliJ to make a small compiler for arithmetic expressions.
I want to print the tree and use this code snippet to do so.
JFrame frame = new JFrame("Tree");
JPanel panel = new JPanel();
TreeViewer viewr = new TreeViewer(Arrays.asList(
parser.getRuleNames()),tree);
viewr.setScale(2);//scale a little
panel.add(viewr);
frame.add(panel);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.setSize(400,400);
frame.setVisible(true);
This makes a tree which looks like this, for the input 3*5\n
Is there a way to adjust this so it reads from top to bottom
Statement
Expression /n
INT * INT
3 5
instead?
My grammar is defined as:
grammar Expression;
statement: expression ENDSTATEMENT # printExpr
| ID '=' expression ENDSTATEMENT # assign
| ENDSTATEMENT # blank
;
expression: expression MULTDIV expression # MulDiv
| expression ADDSUB expression # AddSub
| INT # int
| FLOAT # float
| ID # id
| '(' expression ')' # parens
;
ID : [a-zA-Z]+ ; // match identifiers
INT : [0-9]+ ; // match integers
MULTDIV : ('*' | '/'); //match multiply or divide
ADDSUB : ('+' | '-'); //match add or subtract
FLOAT: INT '.' INT; //match a floating point number
ENDSTATEMENT:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WHITESPACE : [ \t]+ -> skip ; // ignore whitespace
No, not without changing the source of the tree viewer yourself.

How to define unrecognized rules in Ocamlyacc

I'm working on company projet, where i have to create a compilator for a language using Ocamlyacc and Ocamllex. I want to know if is it possible to define a rule in my Ocamlyacc Parser that can tell me that no rules of my grammar matching the syntax of an input.
I have to insist that i'am a beginner in Ocamllex/Ocamlyacc
Thank you a lot for your help.
If no rule in your grammar matches the input, then Parsing.Parse_error exception is raised. Usually, this is what you want.
There is also a special token called error that allows you to resynchronize your parser state. You can use it in your rules, as it was a real token produced by a lexer, cf., eof token.
Also, I would suggest to use menhir instead of more venerable ocamlyacc. It is easier to use and debug, and it also comes with a good library of predefined grammars.
When you write a compiler for a language, the first step is to run your lexer and to check if your program is good from a lexical point of view.
See the below example :
{
open Parser (* The type token is defined in parser.mli *)
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip blanks *)
| ['\n' ] { EOL }
| ['0'-'9']+ as lxm { INT(int_of_string lxm) }
| '+' { PLUS }
| '-' { MINUS }
| '*' { TIMES }
| '/' { DIV }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { raise Eof }
It's a lexer to recognize some arithmetic expressions.
If your lexer accepts the input then you give the sequence of lexemes to the parser which try to find if a AST can be build with the specified grammar. See :
%token <int> INT
%token PLUS MINUS TIMES DIV
%token LPAREN RPAREN
%token EOL
%left PLUS MINUS /* lowest precedence */
%left TIMES DIV /* medium precedence */
%nonassoc UMINUS /* highest precedence */
%start main /* the entry point */
%type <int> main
%%
main:
expr EOL { $1 }
;
expr:
INT { $1 }
| LPAREN expr RPAREN { $2 }
| expr PLUS expr { $1 + $3 }
| expr MINUS expr { $1 - $3 }
| expr TIMES expr { $1 * $3 }
| expr DIV expr { $1 / $3 }
| MINUS expr %prec UMINUS { - $2 }
;
This is a little program to parse arithmetic expression. A program can be rejected at this step because there is no rule of the grammar to apply in order to have an AST at the end. There is no way to define unrecognized rules but you need to write a grammar which define how a program can be accepted or rejected.
let _ =
try
let lexbuf = Lexing.from_channel stdin in
while true do
let result = Parser.main Lexer.token lexbuf in
print_int result; print_newline(); flush stdout
done
with Lexer.Eof ->
exit 0
If your compile the lexer, the parser and the last program, you have :
1 + 2 is accepted because there is no error lexical errors and an AST can be build corresponding to this expression.
1 ++ 2 is rejected : no lexical errors but there is no rule to build a such AST.
You can found more documentation here : http://caml.inria.fr/pub/docs/manual-ocaml-4.00/manual026.html

How to parse grammar of XSD Regex with ANTLR4?

Dear Antlr4 community,
I recently started to use ANTLR4 to translate regular expression from XSD / xml to cvc4.
I use the grammar as specified by w3c, see http://www.w3.org/TR/xmlschema11-2/#regexs .
For this question I have simplified this grammar (by removing charClass) to:
grammar XSDRegExp;
regExp : branch ( '|' branch )* ;
branch : piece* ;
piece : atom quantifier? ;
quantifier : Quantifiers | '{'quantity'}' ;
quantity : quantRange | quantMin | QuantExact ;
quantRange : QuantExact ',' QuantExact ;
quantMin : QuantExact ',' ;
atom : NormalChar | '(' regExp ')' ; // excluded | charClass ;
QuantExact : [0-9]+ ;
NormalChar : ~[.\\?*+{}()|\[\]] ;
Quantifiers : [?*+] ;
Parsing seems to go fine:
input a(bd){6,7}c{14,15}
However, I get an error message for:
input 12{3,4}
The error is:
line 1:0 mismatched input '12' expecting {, '(', '|', NormalChar}
I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.
I tried a number of changes:
[1] Swapping the definitions of QuantExact and NormalChar.
But swapping introduces an error in the first input:
line 1:6 no viable alternative at input '6'
since in that case '6' is only seen as a NormalChar and NOT as a QuantExact.
[2] Try to make a context for QuantExact (the curly brackets of quantity), such that the lexer only provides the QuantExact symbols in this limited context. But I failed to find ANTLR4 primitives for this.
So nothing seems to work, therefore my question is:
Can I parse this grammar with ANTLR4?
And if so, how?
I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.
The lexer does not "listen" to the parser: no matter if the parser is trying to match a NormalChar, the characters 12 will always be matched as a QuantExact. The lexer tries to match as much characters as possible, and in case of a tie, it chooses the rule defined first.
You could introduce a normalChar rule that matches both a NormalChar and QuantExact and use that rule in your atom:
atom : normalChar | '(' regExp ')' ;
normalChar : NormalChar | QuantExact ;
Another option would be to let the lexer create single char tokens only, and let the parser glue these together (much like a PEG). Something like this:
regExp : branch ( '|' branch )* ;
branch : piece* ;
piece : atom quantifier? ;
quantifier : Quantifiers | '{'quantity'}' ;
quantity : quantRange | quantMin | quantExact ;
quantRange : quantExact ',' quantExact ;
quantMin : quantExact ',' ;
atom : normalChar | '(' regExp ')' ;
normalChar : NormalChar | Digit ;
quantExact : Digit+ ;
Digit : [0-9] ;
NormalChar : ~[.\\?*+{}()|\[\]] ;
Quantifiers : [?*+] ;