Antlr4 problems with negativ sign and operator - regex

Hello we have this antlr4 Tree Parser:
grammar calc;
calculator: (d)*;
c
: c '*' c
| c '/' c
| c '+' c
| c '-' c
| '(' c ')'
| '-'?
| ID
;
d: ID '=' c;
NBR: [0-9]+;
ID: [a-zA-Z][a-zA-Z0-9]*;
WS: [ \t\r\n]+ -> skip;
The Problem is if I use a -, antlr4 doesn´t recognize, if is it ja sign or operator for sepcial inputs like: (-2-4)*4. For Inputs like this antlr4 doesn´t understand that the - befor the 2 belongs to the constant 2 and that the - is not a operator.

Just do something like this:
c
: '-' c
| c ('*' | '/') c
| c ('+' | '-') c
| '(' c ')'
| ID
| NBR
;
That way all these will be OK:
-1
- 2
-3-4
5+-6
-(7*8)
(-2-4)*4
For example, (-3-10)*10 is parsed like this:
EDIT
This is what happens when I parse 9+38*(19+489*243/1)*1+3:

| '-'?
should be:
| '-'? NBR
You need to specify that it's a NBR that may (or may not) be preceded by a -

Related

Antlr4 tree parser I have problems with numbers like (9-34)*9

I have created a tree with antlr4.The tree performs the calculation steps, for +, -., * and /. The tree works perfectly for positive numbers, my problem is, for negative numbers, like when I enter (9-40)*10 or so, it doesn't work anymore.
My code is:
calculator:
(d)*| ;
c:
c '*' c | c '/' c | c '+' c | c '-' c | '(' c ')' | b | a ;
d:
a '=' c ;
b : '-'?[0-9]+;
a: [a-zA-Z][a-zA-Z0-9]*;
WS : [ \t\r\n]+ -> skip ;
(please be careful to verify your source before your post. The grammar, as posted, is not valid ANTLR. Probably because you changed rule names to a, b, etc. and forgot to keep Lexer rules uppercase.)
If I assume to be the case, here's a modified grammar to address your problem.
grammar calc
;
calculator: (d)*;
c
: c '*' c
| c '/' c
| c '+' c
| c '-' c
| '(' c ')'
| '-'? NBR // <-- The fix
| ID
;
d: ID '=' c;
NBR: [0-9]+; // <-- removed the '-'?
ID: [a-zA-Z][a-zA-Z0-9]*;
WS: [ \t\r\n]+ -> skip;
The source of the problem is, that by putting the - in the Lexer rule, ANTLR will match the "-40" as a NBR token, since it's a longer string of characters than "40". The solution is to handle the negation as an alternative in you parser rule for c (aka expression)
It's always a good idea to use the ANTLR tool (aka grun) the the -tokens option to get a dump of your token stream to ensure that your Lexer rules are giving the parser the token stream that it needs.

Antlr4: Can't understand why breaking something out into a subrule doesn't work

I'm still new at Antlr4, and I have what is probably a really stupid problem.
Here's a fragment from my .g4 file:
assignStatement
: VariableName '=' expression ';'
;
expression
: (value | VariableName)
| bin_op='(' expression ')'
| expression UNARY_PRE_OR_POST
| (UNARY_PRE_OR_POST | '+' | '-' | '!' | '~' | type_cast) expression
| expression MUL_DIV_MOD expression
| expression ADD_SUB expression
;
VariableName
: ( [a-z] [A-Za-z0-9_]* )
;
// Pre or post increment/decrement
UNARY_PRE_OR_POST
: '++' | '--'
;
// multiply, divide, modulus
MUL_DIV_MOD
: '*' | '/' | '%'
;
// Add, subtract
ADD_SUB
: '+' | '-'
;
And my sample input:
myInt = 10 + 5;
myInt = 10 - 5;
myInt = 1 + 2 + 3;
myInt = 1 + (2 + 3);
myInt = 1 + 2 * 3;
myInt = ++yourInt;
yourInt = (10 - 5)--;
The first sample line myInt = 10 + 5; line produces this error:
line 22:11 mismatched input '+' expecting ';'
line 22:14 extraneous input ';' expecting {<EOF>, 'class', '{', 'interface', 'import', 'print', '[', '_', ClassName, VariableName, LITERAL, STRING, NUMBER, NUMERIC_LITERAL, SYMBOL}
I get similar issues with each of the lines.
If I make one change, a whole bunch of errors disappear:
| expression ADD_SUB expression
change it to this:
| expression ('+' | '-') expression
I've tried a bunch of things. I've tried using both lexer and parser rules (that is, calling it add_sub or ADD_SUB). I've tried a variety of combinations of parenthesis.
I tried:
ADD_SUB: [+-];
What's annoying is the pre- and post-increment lines produce no errors as long as I don't have errors due to +-*. Yet they rely on UNARY_PRE_OR_POST. Of course, maybe it's not really using that and it's using something else that just isn't clear to me.
For now, I'm just eliminating the subrule syntax and will embed everything in the main rule. But I'd like to understand what's going on.
So... what is the proper way to do this:
Do not use literal tokens inside parser rules (unless you know what you're doing).
For the grammar:
expression
: '+' expression
| ...
;
ADD_SUB
: '+' | '-'
;
ANTLR will create a lexer rules for the literal '+', making the grammar really look like this:
expression
: T__0 expression
| ...
;
T__0 : '+';
ADD_SUB
: '+' | '-'
;
causing the input + to never become a ADD_SUB token because T__0 will always match it first. That is simply how the lexer operates: try to match as much characters as possible for every lexer rule, and when 2 (or more) match the same amount of characters, let the one defined first "win".
Do something like this instead:
expression
: value
| '(' expression ')'
| expression UNARY_PRE_OR_POST
| (UNARY_PRE_OR_POST | ADD | SUB | EXCL | TILDE | type_cast) expression
| expression (MUL | DIV | MOD) expression
| expression (ADD | SUB) expression
;
value
: ...
| VariableName
;
VariableName
: [a-z] [A-Za-z0-9_]*
;
UNARY_PRE_OR_POST
: '++' | '--'
;
MUL : '*';
DIV : '/';
MOD : '%';
ADD : '+';
SUB : '-';
EXCL : '!';
TILDE : '~';

How to remove ambiguity in EBNF Instaparse grammar

How can i prevent that the "," literal in the structure rule is parsed as a operator in the following EBNF grammar for Instaparse?
Grammar:
structure = atom <"("> term ("," term)* <")">
term = atom | number | structure | variable | "(" term ")" | term operator term
operator = "," | ";" | "\\=" | "=="
Using the comma as a separator and as an operator like you do makes comma context sensitive which Ebnf on its own can't deal with.

Semantics of identifier line in Python

What is the semantics of a Python 2.7 line containing ONLY identifier. I.e. simply
a
or
something
?
If you know the exact place in the Reference, I'd be very pleased.
Tnx.
An identifier by itself is a valid expression. An expression by itself on a line is a valid statement.
The full semantic chain is a little more involved. In order to have nice operator precedence, we classify things like "a and b" as technically both an and_test and an or_test. As a result, a simple identifier technically qualifies as over a dozen grammar items
stmt: simple_stmt | compound_stmt
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
import_stmt | global_stmt | nonlocal_stmt | assert_stmt)
expr_stmt: testlist_star_expr (augassign (yield_expr|testlist) |
('=' (yield_expr|testlist_star_expr))*)
testlist_star_expr: (test|star_expr) (',' (test|star_expr))* [',']
test: or_test ['if' or_test 'else' test] | lambdef
or_test: and_test ('or' and_test)*
and_test: not_test ('and' not_test)*
not_test: 'not' not_test | comparison
comparison: expr (comp_op expr)*
expr: xor_expr ('|' xor_expr)*
xor_expr: and_expr ('^' and_expr)*
and_expr: shift_expr ('&' shift_expr)*
shift_expr: arith_expr (('<<'|'>>') arith_expr)*
arith_expr: term (('+'|'-') term)*
term: factor (('*'|'/'|'%'|'//') factor)*
factor: ('+'|'-'|'~') factor | power
power: atom trailer* ['**' factor]
atom: ('(' [yield_expr|testlist_comp] ')' |
'[' [testlist_comp] ']' |
'{' [dictorsetmaker] '}' |
NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False')
a stmt can be composed of a single simple_stmt, which can be composed of a simgle small_stmt, which can be composed of a single expr_stmt, and so on, down through testlist_star_expr, test, or_test, and_test, not_test, comparison, expr, xor_expr, and_expr, shift_expr, arith_expr, term, factor, power, atom, and finally NAME.
It's a simple expression statement: https://docs.python.org/2/reference/simple_stmts.html

Context-free-grammar to represent regular expressions

I'm trying to make a context-free-grammar to represent simple regular expressions. The symbols that I want is [0-9][a-z][A-Z], and operators is "|", "()" and "." for concatenation, and for sequences for now I only want "*" later I will add "+","?", etc. I tried this grammar in javacc:
void RE(): {}
{
FINAL(0) ( "." FINAL(0) | "|" FINAL(0))*
}
void FINAL(int sign): { Token t; }
{
t = <SYMBOL> {
if ( sign == 1 )
jjtThis.val = t.image + "*";
else
jjtThis.val = t.image;
}
| FINAL(1) "*"
| "(" RE() ")"
}
The problem is in FINAL function the line | FINAL(1) "*" that gives me a error Left recursion detected: "FINAL... --> FINAL.... Putting "*" on the left of FINAL(1) resolve the problem but this is not what I want..
I already tried to read the article from wikipedia to remove left recursion but I really don't know how to do it, can someone help? :s
The following takes care of the left recursion
RE --> FACTOR ("." FINAL | "|" FINAL)*
FINAL --> PRIMARY ( "*" )*
PRIMARY --> <SYMBOL> | "(" RE ")"
However, that won't give . precedence over | . For that you can do the following
RE --> TERM ("|" TERM)*
TERM --> FINAL ("." FINAL)*
FINAL --> PRIMARY ( "*" )*
PRIMARY --> <SYMBOL> | "(" RE ")"
The general rule is
A --> A b | c | d | ...
can be transformed to
A --> B b*
B --> c | d | ...
where B is a new nonnterminal.