Antlr4 tree parser I have problems with numbers like (9-34)*9 - regex

I have created a tree with antlr4.The tree performs the calculation steps, for +, -., * and /. The tree works perfectly for positive numbers, my problem is, for negative numbers, like when I enter (9-40)*10 or so, it doesn't work anymore.
My code is:
calculator:
(d)*| ;
c:
c '*' c | c '/' c | c '+' c | c '-' c | '(' c ')' | b | a ;
d:
a '=' c ;
b : '-'?[0-9]+;
a: [a-zA-Z][a-zA-Z0-9]*;
WS : [ \t\r\n]+ -> skip ;

(please be careful to verify your source before your post. The grammar, as posted, is not valid ANTLR. Probably because you changed rule names to a, b, etc. and forgot to keep Lexer rules uppercase.)
If I assume to be the case, here's a modified grammar to address your problem.
grammar calc
;
calculator: (d)*;
c
: c '*' c
| c '/' c
| c '+' c
| c '-' c
| '(' c ')'
| '-'? NBR // <-- The fix
| ID
;
d: ID '=' c;
NBR: [0-9]+; // <-- removed the '-'?
ID: [a-zA-Z][a-zA-Z0-9]*;
WS: [ \t\r\n]+ -> skip;
The source of the problem is, that by putting the - in the Lexer rule, ANTLR will match the "-40" as a NBR token, since it's a longer string of characters than "40". The solution is to handle the negation as an alternative in you parser rule for c (aka expression)
It's always a good idea to use the ANTLR tool (aka grun) the the -tokens option to get a dump of your token stream to ensure that your Lexer rules are giving the parser the token stream that it needs.

Related

Antlr4 problems with negativ sign and operator

Hello we have this antlr4 Tree Parser:
grammar calc;
calculator: (d)*;
c
: c '*' c
| c '/' c
| c '+' c
| c '-' c
| '(' c ')'
| '-'?
| ID
;
d: ID '=' c;
NBR: [0-9]+;
ID: [a-zA-Z][a-zA-Z0-9]*;
WS: [ \t\r\n]+ -> skip;
The Problem is if I use a -, antlr4 doesn´t recognize, if is it ja sign or operator for sepcial inputs like: (-2-4)*4. For Inputs like this antlr4 doesn´t understand that the - befor the 2 belongs to the constant 2 and that the - is not a operator.
Just do something like this:
c
: '-' c
| c ('*' | '/') c
| c ('+' | '-') c
| '(' c ')'
| ID
| NBR
;
That way all these will be OK:
-1
- 2
-3-4
5+-6
-(7*8)
(-2-4)*4
For example, (-3-10)*10 is parsed like this:
EDIT
This is what happens when I parse 9+38*(19+489*243/1)*1+3:
| '-'?
should be:
| '-'? NBR
You need to specify that it's a NBR that may (or may not) be preceded by a -

Antlr4: Can't understand why breaking something out into a subrule doesn't work

I'm still new at Antlr4, and I have what is probably a really stupid problem.
Here's a fragment from my .g4 file:
assignStatement
: VariableName '=' expression ';'
;
expression
: (value | VariableName)
| bin_op='(' expression ')'
| expression UNARY_PRE_OR_POST
| (UNARY_PRE_OR_POST | '+' | '-' | '!' | '~' | type_cast) expression
| expression MUL_DIV_MOD expression
| expression ADD_SUB expression
;
VariableName
: ( [a-z] [A-Za-z0-9_]* )
;
// Pre or post increment/decrement
UNARY_PRE_OR_POST
: '++' | '--'
;
// multiply, divide, modulus
MUL_DIV_MOD
: '*' | '/' | '%'
;
// Add, subtract
ADD_SUB
: '+' | '-'
;
And my sample input:
myInt = 10 + 5;
myInt = 10 - 5;
myInt = 1 + 2 + 3;
myInt = 1 + (2 + 3);
myInt = 1 + 2 * 3;
myInt = ++yourInt;
yourInt = (10 - 5)--;
The first sample line myInt = 10 + 5; line produces this error:
line 22:11 mismatched input '+' expecting ';'
line 22:14 extraneous input ';' expecting {<EOF>, 'class', '{', 'interface', 'import', 'print', '[', '_', ClassName, VariableName, LITERAL, STRING, NUMBER, NUMERIC_LITERAL, SYMBOL}
I get similar issues with each of the lines.
If I make one change, a whole bunch of errors disappear:
| expression ADD_SUB expression
change it to this:
| expression ('+' | '-') expression
I've tried a bunch of things. I've tried using both lexer and parser rules (that is, calling it add_sub or ADD_SUB). I've tried a variety of combinations of parenthesis.
I tried:
ADD_SUB: [+-];
What's annoying is the pre- and post-increment lines produce no errors as long as I don't have errors due to +-*. Yet they rely on UNARY_PRE_OR_POST. Of course, maybe it's not really using that and it's using something else that just isn't clear to me.
For now, I'm just eliminating the subrule syntax and will embed everything in the main rule. But I'd like to understand what's going on.
So... what is the proper way to do this:
Do not use literal tokens inside parser rules (unless you know what you're doing).
For the grammar:
expression
: '+' expression
| ...
;
ADD_SUB
: '+' | '-'
;
ANTLR will create a lexer rules for the literal '+', making the grammar really look like this:
expression
: T__0 expression
| ...
;
T__0 : '+';
ADD_SUB
: '+' | '-'
;
causing the input + to never become a ADD_SUB token because T__0 will always match it first. That is simply how the lexer operates: try to match as much characters as possible for every lexer rule, and when 2 (or more) match the same amount of characters, let the one defined first "win".
Do something like this instead:
expression
: value
| '(' expression ')'
| expression UNARY_PRE_OR_POST
| (UNARY_PRE_OR_POST | ADD | SUB | EXCL | TILDE | type_cast) expression
| expression (MUL | DIV | MOD) expression
| expression (ADD | SUB) expression
;
value
: ...
| VariableName
;
VariableName
: [a-z] [A-Za-z0-9_]*
;
UNARY_PRE_OR_POST
: '++' | '--'
;
MUL : '*';
DIV : '/';
MOD : '%';
ADD : '+';
SUB : '-';
EXCL : '!';
TILDE : '~';

How to code nextToken() function for a descent recursive parser LL(1)

I'm writting a recursive descent parser LL(1) in C++, but I have a problem because I don't know exactly how to get the next token. I know I have to use regular expressions for getting a terminal but I don't know how to get the largest next token.
For example, this lexical and this grammar (without left recursion, left factoring and without cycles):
//LEXICAL IN FLEX
TIME [0-9]+
DIRECTION UR|DR|DL|UL|U|D|L|R
ACTION A|J|M
%%
{TIME} {printf("TIME"); return (TIME);}
{DIRECTION} {printf("DIRECTION"); return (DIRECTION);}
{ACTION} {printf("ACTION"); return (ACTION);}
"~" {printf("RELEASED"); return (RELEASED);}
"+" {printf("PLUS_OP"); return (PLUS_OP);}
"*" {printf("COMB_OP"); return (COMB_OP);}
//GRAMMAR IN BISON
command : list_move PLUS_OP list_action
| list_move COMB_OP list_action
| list_move list_action
| list_move
| list_action
;
list_move: move list_move_prm
;
list_move_prm: move
| move list_move_prm
| ";"
;
list_action: ACTION list_action_prm
;
list_action_prm: PLUS_OP ACTION list_action_prm
| COMB_OP ACTION list_action_prm
| ACTION list_action_prm
| ";" //epsilon
;
move: TIME RELEASED DIRECTION
| RELEASED DIRECTION
| DIRECTION
;
I have a string that contains: "D DR R + A" it should validate it, but getting "DR" I have problems because "D" it's a token too, I don't know how to get "DR" instead "D".
There are a number of ways of hand-writing a tokenizer
you can use a recusive descent LL(1) parser "all the way down" -- rewrite your grammar in terms of single characters rather than tokens, and left factor it. Then your nextToken() routine becomes just getchar(). You'll end up with additional rules like:
TIME: DIGIT more_digits ;
more_digits: /* epsilon */ | DIGIT more_digits ;
DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
DIRECTION: 'U' dir_suffix | 'D' dir_suffix | 'L' | 'R' ;
dir_suffix: /* epsilon */ | 'L' | 'R' ;
You can use regexes. Generally this means keeping around a buffer and reading the input into it. nextToken() then runs a series of regexes on the buffer, figuring out which one returns the longest token and returns that, advancing the buffer as needed.
You can do what flex does -- this is the buffer approach above, combined with building a single DFA that evaluates all of the regexes simultaneously. Running this DFA on the buffer then returns the longest token (based on the last accepting state reached before getting an error).
Note that in all cases, you'll need to consider how to handle whitespace as well. You can just ignore whitespace everywhere (FORTRAN style) or you can allow whitespace between some tokens, but not others (eg, not between the digits of TIME or within a DIRECTION, but everywhere else in the grammar). This can make the grammar much more complex (and the process of hand-writing the recursive descent parser much more tedious).
“I don't know exactly how to get the next token”
Your input comes from a stream (std::istream). You must write a get_token(istream) function (or a tokenizer class). The function must first discard white spaces, then read a character (or more if necessary) analyze it and returns the associated token. The following functions will help you achieve your goal:
ws – discards white-space.
istream::get – reads a character.
istream::putback – puts back in the stream a character (think “undo get”).
"I don't know how to get "DR" instead "D""
Both "D" and "DR" are words. Just read them as you would read a word: is >> word. You will also need a keyword to token map (see std::map). If you read the "D" string, you can ask the map what the associated token is. If not found, throw an exception.
A starting point (run it):
#include <iostream>
#include <iomanip>
#include <map>
#include <string>
enum token_t
{
END,
PLUS,
NUMBER,
D,
DR,
R,
A,
// ...
};
// ...
using keyword_to_token_t = std::map < std::string, token_t >;
keyword_to_token_t kwtt =
{
{"A", A},
{"D", D},
{"R", R},
{"DR", DR}
// ...
};
// ...
std::string s;
int n;
// ...
token_t get_token( std::istream& is )
{
char c;
std::ws( is ); // discard white-space
if ( !is.get( c ) ) // read a character
return END; // failed to read or eof
// analyze the character
switch ( c )
{
case '+': // simple token
return PLUS;
case '0': case '1': // rest of digits
is.putback( c ); // it starts with a digit: it must be a number, so put it back
is >> n; // and let the library to the hard work
return NUMBER;
//...
default: // keyword
is.putback( c );
is >> s;
if ( kwtt.find( s ) == kwtt.end() )
throw "keyword not found";
return kwtt[ s ];
}
}
int main()
{
try
{
while ( get_token( std::cin ) )
;
std::cout << "valid tokens";
}
catch ( const char* e )
{
std::cout << e;
}
}

How to parse grammar of XSD Regex with ANTLR4?

Dear Antlr4 community,
I recently started to use ANTLR4 to translate regular expression from XSD / xml to cvc4.
I use the grammar as specified by w3c, see http://www.w3.org/TR/xmlschema11-2/#regexs .
For this question I have simplified this grammar (by removing charClass) to:
grammar XSDRegExp;
regExp : branch ( '|' branch )* ;
branch : piece* ;
piece : atom quantifier? ;
quantifier : Quantifiers | '{'quantity'}' ;
quantity : quantRange | quantMin | QuantExact ;
quantRange : QuantExact ',' QuantExact ;
quantMin : QuantExact ',' ;
atom : NormalChar | '(' regExp ')' ; // excluded | charClass ;
QuantExact : [0-9]+ ;
NormalChar : ~[.\\?*+{}()|\[\]] ;
Quantifiers : [?*+] ;
Parsing seems to go fine:
input a(bd){6,7}c{14,15}
However, I get an error message for:
input 12{3,4}
The error is:
line 1:0 mismatched input '12' expecting {, '(', '|', NormalChar}
I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.
I tried a number of changes:
[1] Swapping the definitions of QuantExact and NormalChar.
But swapping introduces an error in the first input:
line 1:6 no viable alternative at input '6'
since in that case '6' is only seen as a NormalChar and NOT as a QuantExact.
[2] Try to make a context for QuantExact (the curly brackets of quantity), such that the lexer only provides the QuantExact symbols in this limited context. But I failed to find ANTLR4 primitives for this.
So nothing seems to work, therefore my question is:
Can I parse this grammar with ANTLR4?
And if so, how?
I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.
The lexer does not "listen" to the parser: no matter if the parser is trying to match a NormalChar, the characters 12 will always be matched as a QuantExact. The lexer tries to match as much characters as possible, and in case of a tie, it chooses the rule defined first.
You could introduce a normalChar rule that matches both a NormalChar and QuantExact and use that rule in your atom:
atom : normalChar | '(' regExp ')' ;
normalChar : NormalChar | QuantExact ;
Another option would be to let the lexer create single char tokens only, and let the parser glue these together (much like a PEG). Something like this:
regExp : branch ( '|' branch )* ;
branch : piece* ;
piece : atom quantifier? ;
quantifier : Quantifiers | '{'quantity'}' ;
quantity : quantRange | quantMin | quantExact ;
quantRange : quantExact ',' quantExact ;
quantMin : quantExact ',' ;
atom : normalChar | '(' regExp ')' ;
normalChar : NormalChar | Digit ;
quantExact : Digit+ ;
Digit : [0-9] ;
NormalChar : ~[.\\?*+{}()|\[\]] ;
Quantifiers : [?*+] ;

how to match only part of the expression to string in ocamllex

I have a simple ocamllex program where the rules section looks somewhat like this-
let digits= ['0'-'9']
let variables= 'X'|'Z'
rule addinlist = parse
|['\n'] {addinlist lexbuf;}
| "Inc" '(' variables+ '(' digits+ ')' ')' as ine { !inputstringarray.(!inputstringarrayi) <-ine;
inputstringarrayi := !inputstringarrayi +1;
addinlist lexbuf}
|_ as c
{ printf "Unrecognized character: %c\n" c;
addinlist lexbuf
}
| eof { () }
My question is suppose I want to match Inc(X(7)) such that I can convert it to my abstract syntax which is "Inc of var of int". I want my lexer to give me the separate strings while reading Inc(X(7)) such that I get "Inc" as a diff string (say inb) followed by "X" as a diff string (say inc) n followed by "7" as a diff string (say ind), so that i can play around with these strings inb, inc, & ind, instead of being stuck with a whole string ine, as is given by my program. How to go about this? I hope my question is clear