How can i prevent that the "," literal in the structure rule is parsed as a operator in the following EBNF grammar for Instaparse?
Grammar:
structure = atom <"("> term ("," term)* <")">
term = atom | number | structure | variable | "(" term ")" | term operator term
operator = "," | ";" | "\\=" | "=="
Using the comma as a separator and as an operator like you do makes comma context sensitive which Ebnf on its own can't deal with.
Related
I'm still new at Antlr4, and I have what is probably a really stupid problem.
Here's a fragment from my .g4 file:
assignStatement
: VariableName '=' expression ';'
;
expression
: (value | VariableName)
| bin_op='(' expression ')'
| expression UNARY_PRE_OR_POST
| (UNARY_PRE_OR_POST | '+' | '-' | '!' | '~' | type_cast) expression
| expression MUL_DIV_MOD expression
| expression ADD_SUB expression
;
VariableName
: ( [a-z] [A-Za-z0-9_]* )
;
// Pre or post increment/decrement
UNARY_PRE_OR_POST
: '++' | '--'
;
// multiply, divide, modulus
MUL_DIV_MOD
: '*' | '/' | '%'
;
// Add, subtract
ADD_SUB
: '+' | '-'
;
And my sample input:
myInt = 10 + 5;
myInt = 10 - 5;
myInt = 1 + 2 + 3;
myInt = 1 + (2 + 3);
myInt = 1 + 2 * 3;
myInt = ++yourInt;
yourInt = (10 - 5)--;
The first sample line myInt = 10 + 5; line produces this error:
line 22:11 mismatched input '+' expecting ';'
line 22:14 extraneous input ';' expecting {<EOF>, 'class', '{', 'interface', 'import', 'print', '[', '_', ClassName, VariableName, LITERAL, STRING, NUMBER, NUMERIC_LITERAL, SYMBOL}
I get similar issues with each of the lines.
If I make one change, a whole bunch of errors disappear:
| expression ADD_SUB expression
change it to this:
| expression ('+' | '-') expression
I've tried a bunch of things. I've tried using both lexer and parser rules (that is, calling it add_sub or ADD_SUB). I've tried a variety of combinations of parenthesis.
I tried:
ADD_SUB: [+-];
What's annoying is the pre- and post-increment lines produce no errors as long as I don't have errors due to +-*. Yet they rely on UNARY_PRE_OR_POST. Of course, maybe it's not really using that and it's using something else that just isn't clear to me.
For now, I'm just eliminating the subrule syntax and will embed everything in the main rule. But I'd like to understand what's going on.
So... what is the proper way to do this:
Do not use literal tokens inside parser rules (unless you know what you're doing).
For the grammar:
expression
: '+' expression
| ...
;
ADD_SUB
: '+' | '-'
;
ANTLR will create a lexer rules for the literal '+', making the grammar really look like this:
expression
: T__0 expression
| ...
;
T__0 : '+';
ADD_SUB
: '+' | '-'
;
causing the input + to never become a ADD_SUB token because T__0 will always match it first. That is simply how the lexer operates: try to match as much characters as possible for every lexer rule, and when 2 (or more) match the same amount of characters, let the one defined first "win".
Do something like this instead:
expression
: value
| '(' expression ')'
| expression UNARY_PRE_OR_POST
| (UNARY_PRE_OR_POST | ADD | SUB | EXCL | TILDE | type_cast) expression
| expression (MUL | DIV | MOD) expression
| expression (ADD | SUB) expression
;
value
: ...
| VariableName
;
VariableName
: [a-z] [A-Za-z0-9_]*
;
UNARY_PRE_OR_POST
: '++' | '--'
;
MUL : '*';
DIV : '/';
MOD : '%';
ADD : '+';
SUB : '-';
EXCL : '!';
TILDE : '~';
I don't understand why this antlr4 grammar
grammar antmath1;
expr
: '(' expr ')' # parensExpr
| op=('+'|'-') expr # unaryExpr
| left=expr op=('*'|'/') right=expr # infixExpr
| left=expr op=('+'|'-') right=expr # infixExpr
| value=NUM # numberExpr
;
NUM : [0-9]+;
WS : [ \t\r\n] -> channel(HIDDEN);
works properly:
antlr tree produced by -(5+9)+1000; result=986
but why this one:
grammar antmath;
expr
: '(' expr ')' # parensExpr
| left=expr op=('*'|'/') right=expr # infixExpr
| left=expr op=('+'|'-') right=expr # infixExpr
| op=('+'|'-') expr # unaryExpr
| value=NUM # numberExpr
;
NUM : [0-9]+;
WS : [ \t\r\n] -> channel(HIDDEN);
fails:
antlr tree produced by the same expression; result=-1014
I expect the first grammar1 (which outputs correct result) to produce the same result as grammar2 (wrong output). The reasoning behind this: the only rule that admits '-' as first token is #unaryExpr so parser generated by any of the grammars would try to match that rule first. Then, provided the parser is greedy (for any of the two grammars), I would expect it to take the "(5+9)+1000" as a whole and match it with expr which it does because it is a valid expr.
where's the fault in my reasoning?
the grammars would try to match that rule first
It does. However, you've made unary minus have lower precedence than binary plus.
That means that the expression is being interpreted as -((5+9)+1000) instead of (-(5+9))+1000.
I've tried to write a basic syntax checker using bisonc++
The rules are:
expression -> OPEN_BRACKET expression CLOSE_BRACKET
expression -> expression operator expression
operator -> PLUS
operator -> MINUS
If I try to run the compiled code, I get an error at this line:
(a+b)-(c+d)
The first rule is applied, the leftmost and the rightmost brackets are the OPEN_BRACKET and the CLOSE_BRACKET. The remaining expression is: a+b)-(c+d
How is it possible to prevent this behaviour? Is it possible to count the open and closed brackets?
Edit
The expression grammar:
expression:
OPEN_BRACKET expression CLOSE_BRACKET
{
//
}
| operator
{
//
}
| VARIABLE
{
//
}
;
operator:
expression PLUS expression
{
//
}
| expression MINUS expression
{
//
}
;
Edit2
The lexer
CHAR [a-z]
WS [ \t\n]
%%
{CHAR}+ return Parser::VARIABLE;
"+" return Parser::PLUS;
"-" return Parser::MINUS;
"(" return Parser::OPEN_BRACKET;
")" return Parser::CLOSE_BRACKET;
This is not a normal expression grammar. Try the normal one.
expression
: term
| expression '+' term
| expression '-' term
;
term
: factor
| term '*' factor
| term '/' factor
| term '%' factor
;
factor
: primary
| '-' factor // unary minus
| primary '^' factor // exponentiation, right-associative
;
primary
: identifier
| literal
| '(' expression ')'
;
Note also the above method of indenting and aligning, and that you only have to return yytext[0] from the lexer for single special characters: you don't need special token names, and it's more readable without them:
CHAR [a-zA-Z]
DIGIT [0-9]
WHITESPACE [ \t\r\n]
%%
{CHAR}+ { return Parser::VARIABLE; }
{DIGIT}+ { return Parser::LITERAL; }
{WHITESPACE}+ ;
. { return yytext[0]; }
Your operator rule does not look good.
Try experiment with:
expression:
OPEN_BRACKET expression CLOSE_BRACKET
{
//
}
|
expression operator expression
{
//
}
|
VARIABLE
{
//
}
;
operator:
PLUS
{
//
}
|
MINUS
{
//
}
;
As your pseudo code actually suggests...
What is the semantics of a Python 2.7 line containing ONLY identifier. I.e. simply
a
or
something
?
If you know the exact place in the Reference, I'd be very pleased.
Tnx.
An identifier by itself is a valid expression. An expression by itself on a line is a valid statement.
The full semantic chain is a little more involved. In order to have nice operator precedence, we classify things like "a and b" as technically both an and_test and an or_test. As a result, a simple identifier technically qualifies as over a dozen grammar items
stmt: simple_stmt | compound_stmt
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
import_stmt | global_stmt | nonlocal_stmt | assert_stmt)
expr_stmt: testlist_star_expr (augassign (yield_expr|testlist) |
('=' (yield_expr|testlist_star_expr))*)
testlist_star_expr: (test|star_expr) (',' (test|star_expr))* [',']
test: or_test ['if' or_test 'else' test] | lambdef
or_test: and_test ('or' and_test)*
and_test: not_test ('and' not_test)*
not_test: 'not' not_test | comparison
comparison: expr (comp_op expr)*
expr: xor_expr ('|' xor_expr)*
xor_expr: and_expr ('^' and_expr)*
and_expr: shift_expr ('&' shift_expr)*
shift_expr: arith_expr (('<<'|'>>') arith_expr)*
arith_expr: term (('+'|'-') term)*
term: factor (('*'|'/'|'%'|'//') factor)*
factor: ('+'|'-'|'~') factor | power
power: atom trailer* ['**' factor]
atom: ('(' [yield_expr|testlist_comp] ')' |
'[' [testlist_comp] ']' |
'{' [dictorsetmaker] '}' |
NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False')
a stmt can be composed of a single simple_stmt, which can be composed of a simgle small_stmt, which can be composed of a single expr_stmt, and so on, down through testlist_star_expr, test, or_test, and_test, not_test, comparison, expr, xor_expr, and_expr, shift_expr, arith_expr, term, factor, power, atom, and finally NAME.
It's a simple expression statement: https://docs.python.org/2/reference/simple_stmts.html
I'm trying to make a context-free-grammar to represent simple regular expressions. The symbols that I want is [0-9][a-z][A-Z], and operators is "|", "()" and "." for concatenation, and for sequences for now I only want "*" later I will add "+","?", etc. I tried this grammar in javacc:
void RE(): {}
{
FINAL(0) ( "." FINAL(0) | "|" FINAL(0))*
}
void FINAL(int sign): { Token t; }
{
t = <SYMBOL> {
if ( sign == 1 )
jjtThis.val = t.image + "*";
else
jjtThis.val = t.image;
}
| FINAL(1) "*"
| "(" RE() ")"
}
The problem is in FINAL function the line | FINAL(1) "*" that gives me a error Left recursion detected: "FINAL... --> FINAL.... Putting "*" on the left of FINAL(1) resolve the problem but this is not what I want..
I already tried to read the article from wikipedia to remove left recursion but I really don't know how to do it, can someone help? :s
The following takes care of the left recursion
RE --> FACTOR ("." FINAL | "|" FINAL)*
FINAL --> PRIMARY ( "*" )*
PRIMARY --> <SYMBOL> | "(" RE ")"
However, that won't give . precedence over | . For that you can do the following
RE --> TERM ("|" TERM)*
TERM --> FINAL ("." FINAL)*
FINAL --> PRIMARY ( "*" )*
PRIMARY --> <SYMBOL> | "(" RE ")"
The general rule is
A --> A b | c | d | ...
can be transformed to
A --> B b*
B --> c | d | ...
where B is a new nonnterminal.