How to parse grammar of XSD Regex with ANTLR4? - regex

Dear Antlr4 community,
I recently started to use ANTLR4 to translate regular expression from XSD / xml to cvc4.
I use the grammar as specified by w3c, see http://www.w3.org/TR/xmlschema11-2/#regexs .
For this question I have simplified this grammar (by removing charClass) to:
grammar XSDRegExp;
regExp : branch ( '|' branch )* ;
branch : piece* ;
piece : atom quantifier? ;
quantifier : Quantifiers | '{'quantity'}' ;
quantity : quantRange | quantMin | QuantExact ;
quantRange : QuantExact ',' QuantExact ;
quantMin : QuantExact ',' ;
atom : NormalChar | '(' regExp ')' ; // excluded | charClass ;
QuantExact : [0-9]+ ;
NormalChar : ~[.\\?*+{}()|\[\]] ;
Quantifiers : [?*+] ;
Parsing seems to go fine:
input a(bd){6,7}c{14,15}
However, I get an error message for:
input 12{3,4}
The error is:
line 1:0 mismatched input '12' expecting {, '(', '|', NormalChar}
I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.
I tried a number of changes:
[1] Swapping the definitions of QuantExact and NormalChar.
But swapping introduces an error in the first input:
line 1:6 no viable alternative at input '6'
since in that case '6' is only seen as a NormalChar and NOT as a QuantExact.
[2] Try to make a context for QuantExact (the curly brackets of quantity), such that the lexer only provides the QuantExact symbols in this limited context. But I failed to find ANTLR4 primitives for this.
So nothing seems to work, therefore my question is:
Can I parse this grammar with ANTLR4?
And if so, how?

I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.
The lexer does not "listen" to the parser: no matter if the parser is trying to match a NormalChar, the characters 12 will always be matched as a QuantExact. The lexer tries to match as much characters as possible, and in case of a tie, it chooses the rule defined first.
You could introduce a normalChar rule that matches both a NormalChar and QuantExact and use that rule in your atom:
atom : normalChar | '(' regExp ')' ;
normalChar : NormalChar | QuantExact ;
Another option would be to let the lexer create single char tokens only, and let the parser glue these together (much like a PEG). Something like this:
regExp : branch ( '|' branch )* ;
branch : piece* ;
piece : atom quantifier? ;
quantifier : Quantifiers | '{'quantity'}' ;
quantity : quantRange | quantMin | quantExact ;
quantRange : quantExact ',' quantExact ;
quantMin : quantExact ',' ;
atom : normalChar | '(' regExp ')' ;
normalChar : NormalChar | Digit ;
quantExact : Digit+ ;
Digit : [0-9] ;
NormalChar : ~[.\\?*+{}()|\[\]] ;
Quantifiers : [?*+] ;

Related

Antlr4: Can't understand why breaking something out into a subrule doesn't work

I'm still new at Antlr4, and I have what is probably a really stupid problem.
Here's a fragment from my .g4 file:
assignStatement
: VariableName '=' expression ';'
;
expression
: (value | VariableName)
| bin_op='(' expression ')'
| expression UNARY_PRE_OR_POST
| (UNARY_PRE_OR_POST | '+' | '-' | '!' | '~' | type_cast) expression
| expression MUL_DIV_MOD expression
| expression ADD_SUB expression
;
VariableName
: ( [a-z] [A-Za-z0-9_]* )
;
// Pre or post increment/decrement
UNARY_PRE_OR_POST
: '++' | '--'
;
// multiply, divide, modulus
MUL_DIV_MOD
: '*' | '/' | '%'
;
// Add, subtract
ADD_SUB
: '+' | '-'
;
And my sample input:
myInt = 10 + 5;
myInt = 10 - 5;
myInt = 1 + 2 + 3;
myInt = 1 + (2 + 3);
myInt = 1 + 2 * 3;
myInt = ++yourInt;
yourInt = (10 - 5)--;
The first sample line myInt = 10 + 5; line produces this error:
line 22:11 mismatched input '+' expecting ';'
line 22:14 extraneous input ';' expecting {<EOF>, 'class', '{', 'interface', 'import', 'print', '[', '_', ClassName, VariableName, LITERAL, STRING, NUMBER, NUMERIC_LITERAL, SYMBOL}
I get similar issues with each of the lines.
If I make one change, a whole bunch of errors disappear:
| expression ADD_SUB expression
change it to this:
| expression ('+' | '-') expression
I've tried a bunch of things. I've tried using both lexer and parser rules (that is, calling it add_sub or ADD_SUB). I've tried a variety of combinations of parenthesis.
I tried:
ADD_SUB: [+-];
What's annoying is the pre- and post-increment lines produce no errors as long as I don't have errors due to +-*. Yet they rely on UNARY_PRE_OR_POST. Of course, maybe it's not really using that and it's using something else that just isn't clear to me.
For now, I'm just eliminating the subrule syntax and will embed everything in the main rule. But I'd like to understand what's going on.
So... what is the proper way to do this:
Do not use literal tokens inside parser rules (unless you know what you're doing).
For the grammar:
expression
: '+' expression
| ...
;
ADD_SUB
: '+' | '-'
;
ANTLR will create a lexer rules for the literal '+', making the grammar really look like this:
expression
: T__0 expression
| ...
;
T__0 : '+';
ADD_SUB
: '+' | '-'
;
causing the input + to never become a ADD_SUB token because T__0 will always match it first. That is simply how the lexer operates: try to match as much characters as possible for every lexer rule, and when 2 (or more) match the same amount of characters, let the one defined first "win".
Do something like this instead:
expression
: value
| '(' expression ')'
| expression UNARY_PRE_OR_POST
| (UNARY_PRE_OR_POST | ADD | SUB | EXCL | TILDE | type_cast) expression
| expression (MUL | DIV | MOD) expression
| expression (ADD | SUB) expression
;
value
: ...
| VariableName
;
VariableName
: [a-z] [A-Za-z0-9_]*
;
UNARY_PRE_OR_POST
: '++' | '--'
;
MUL : '*';
DIV : '/';
MOD : '%';
ADD : '+';
SUB : '-';
EXCL : '!';
TILDE : '~';

Reluctant matching in ANTLR 4.4

Just as the reluctant quantifiers work in Regular expressions I'm trying to parse two different tokens from my input i.e, for operand1 and operator. And my operator token should be reluctantly matched instead of greedily matching input tokens for operand1.
Example,
Input:
Active Indicator in ("A", "D", "S")
(To simplify I have removed the code relevant for operand2)
Expected operand1:
Active Indicator
Expected operator:
in
Actual output for operand1:
Active indicator in
and none for the operator rule.
Below is my grammar code:
grammar Test;
condition: leftOperand WHITESPACE* operator;
leftOperand: ALPHA_NUMERIC_WS ;
operator: EQUALS | NOT_EQUALS | IN | NOT_IN;
EQUALS : '=';
NOT_EQUALS : '!=';
IN : 'in';
NOT_IN : 'not' WHITESPACE 'in';
WORD: (LOWERCASE | UPPERCASE )+ ;
ALPHA_NUMERIC_WS: WORD ( WORD| DIGIT | WHITESPACE )* ( WORD | DIGIT)+ ;
WHITESPACE : (' ' | '\t')+;
fragment DIGIT: '0'..'9' ;
LOWERCASE : [a-z] ;
UPPERCASE : [A-Z] ;
One solution to this would be to not produce one token for several words but one token per word instead.
Your grammar would then look like this:
grammar Test;
condition: leftOperand operator;
leftOperand: ALPHA_NUMERIC+ ;
operator: EQUALS | NOT_EQUALS | IN | NOT_IN;
EQUALS : '=';
NOT_EQUALS : '!=';
IN : 'in';
NOT_IN : 'not' WHITESPACE 'in';
WORD: (LOWERCASE | UPPERCASE )+ ;
ALPHA_NUMERIC: WORD ( WORD| DIGIT)* ;
WHITESPACE : (' ' | '\t')+ -> skip; // ignoring WS completely
fragment DIGIT: '0'..'9' ;
LOWERCASE : [a-z] ;
UPPERCASE : [A-Z] ;
Like this the lexer will not match the whole input as ALPHA_NUMERIC_WS once the corresponding lexer rule has been entered because any occuring WS forces the lexer to leave the ALPHA_NUMERIC rule. Therefore any following input will be given a chance to be matched by other lexer-rules (in the order they are defined in the grammar).

Parse arbitrary delimiter character using Antlr4

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?
To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:
REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;
This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).
If I had to solve this using a regular expression itself, I would use a backreference like this:
m(.)(.+?)\1
(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.
It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.
Can this be solved with Antlr4?
You could do something like this:
lexer grammar TLexer;
REGEX
: REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
| '{' REGEX_ATOM+ '}'
| '(' REGEX_ATOM+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [/~##]
;
fragment REGEX_ATOM
: '\\' .
| ~[\\]
;
If you run the following class:
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));
for (Token t : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
}
}
}
you will see the following output:
REGEX /foo/
ANY
ANY /
ANY b
ANY a
ANY r
ANY \
ANY
REGEX ~\~~
ANY
REGEX {mu}
ANY
ANY (
ANY b
ANY l
ANY a
ANY (
The {...}? is called a predicate:
Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?
The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).
Tested with ANTLR 4.5.3
EDIT
And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):
lexer grammar TLexer;
#lexer::members {
boolean delimiterAhead(String start) {
return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
}
}
REGEX
: '/' ( '\\' . | ~[/\\] )+ '/'
| 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
| 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
| 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [~##]
;
fragment SPACES
: [ \t]+
;

ANTLR4 RegEx lexer modes

I am working on a Regx parser for RegEx inside XSD.
My previous problem was descrived here: ANTLR4 parsing RegEx
I have split the Lexer and Parser since than.
Now I have a problem parsing parantheses inside brackets. They should be treated as characters inside the brackets and as grouping tokens outside.
This is my lexer grammar:
lexer grammar RegExLexer;
Char : ALPHA ;
Int : DIGIT ;
LBrack : '[' ;//-> pushMode(modeRange) ;
RBrack : ']' ;//-> popMode ;
LBrace : '(' ;
RBrace : ')' ;
Semi : ';' ;
Comma : ',' ;
Asterisk: '*' ;
Plus : '+' ;
Dot : '.' ;
Dash : '-' ;
Question: '?' ;
LCBrace : '{' ;
RCBrace : '}' ;
Pipe : '|' ;
Esc : '\\' ;
WS : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;
And here is the example:
[0-9a-z()]+
I feel like i should use modes on brackets to change the behaviour of ALPHA fragment. If I copy the fragment, I get an error saying I can't have the declaration twice.
I have read the reference about this and I still don't get what i should do.
How do I implement the modes?
Here's a quick demo of how it is possible to create a context sensitive lexer using ANTLR4's lexical-modes:
lexer grammar RegexLexer;
START_CHAR_CLASS
: '[' -> pushMode(CharClass)
;
START_GROUP
: '('
;
END_GROUP
: ')'
;
PLAIN_ATOM
: ~[()\[\]]
;
mode CharClass;
END_CHAR_CLASS
: ']' -> popMode
;
CHAR_CLASS_ATOM
: ~[\r\n\\\]]
| '\\' .
;
After generating the lexer, you can use the following class to test it:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
public class Main {
public static void main(String[] args) {
RegexLexer lexer = new RegexLexer(new ANTLRInputStream("([()\\]])"));
for (Token token : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", RegexLexer.VOCABULARY.getSymbolicName(token.getType()), token.getText());
}
}
}
And if you run this Main class, the follwoing will be printed to your console:
START_GROUP (
START_CHAR_CLASS [
CHAR_CLASS_ATOM (
CHAR_CLASS_ATOM )
CHAR_CLASS_ATOM \]
END_CHAR_CLASS ]
END_GROUP )
As you can see, the ( and ) are tokenized differently outside the character class as they are inside of it.
You're going to have to handle this in the parser, not the lexer. When lexer sees a '(', it will return token LBrace. For lexer, there is no context as to where token is seen. It simply carves up the input into tokens. You will have to define parse rules and when processing parse tree, you can then determine was the LBrace inside brackets or not.

How can I implement grouping preference in antlr?

I want to write a lexer and parser which could take expressions like
(4+y)*8
4+5*x
(3)+(z*(4+w))*6
And then parse them considering the priority of multiplication over addition. In particular, I can't figure out how I could avoid
4+5*x
being grouped as
MULTIPLICATION(ADDITION(4,5),5) instead of ADDITION(4+MULTIPLICATION)
My lexer looks like that:
PLUS : '+';
TIMES : '*';
NUMBER : [0-9]+'.'?[0-9]*;
VARIABLE : [(a-z)|(A-Z)]+;
OPENING : '(';
CLOSING : ')';
WHITESPACE : [ \t\r\n]+ -> skip ;
The correct grouping will happen automatically if you define your lower-priority operations closer to your "root expression" rule than the higher-priority ones:
expr
: e=multDivExpr
( PLUS e=multDivExpr
| MINUS e=multDivExpr
)*
;
multDivExpr
: e=atomExpr
( TIMES e=atom
| DIV e=atom
| REM e=atom
)*
;
atom
: NUMBER
| VARIABLE
| OPENING e=expr CLOSING
;
A simple way to understand what's going on is to think that the recursive descent parser generated by ANTLR will use multDivExpr non-terminals as "building blocks" for the "additive" expr non-terminal, therefore applying the grouping to multiplication and division before considering addition and subtraction.