Parse arbitrary delimiter character using Antlr4 - regex

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?
To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:
REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;
This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).
If I had to solve this using a regular expression itself, I would use a backreference like this:
m(.)(.+?)\1
(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.
It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.
Can this be solved with Antlr4?

You could do something like this:
lexer grammar TLexer;
REGEX
: REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
| '{' REGEX_ATOM+ '}'
| '(' REGEX_ATOM+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [/~##]
;
fragment REGEX_ATOM
: '\\' .
| ~[\\]
;
If you run the following class:
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));
for (Token t : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
}
}
}
you will see the following output:
REGEX /foo/
ANY
ANY /
ANY b
ANY a
ANY r
ANY \
ANY
REGEX ~\~~
ANY
REGEX {mu}
ANY
ANY (
ANY b
ANY l
ANY a
ANY (
The {...}? is called a predicate:
Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?
The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).
Tested with ANTLR 4.5.3
EDIT
And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):
lexer grammar TLexer;
#lexer::members {
boolean delimiterAhead(String start) {
return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
}
}
REGEX
: '/' ( '\\' . | ~[/\\] )+ '/'
| 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
| 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
| 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [~##]
;
fragment SPACES
: [ \t]+
;

Related

Reluctant matching in ANTLR 4.4

Just as the reluctant quantifiers work in Regular expressions I'm trying to parse two different tokens from my input i.e, for operand1 and operator. And my operator token should be reluctantly matched instead of greedily matching input tokens for operand1.
Example,
Input:
Active Indicator in ("A", "D", "S")
(To simplify I have removed the code relevant for operand2)
Expected operand1:
Active Indicator
Expected operator:
in
Actual output for operand1:
Active indicator in
and none for the operator rule.
Below is my grammar code:
grammar Test;
condition: leftOperand WHITESPACE* operator;
leftOperand: ALPHA_NUMERIC_WS ;
operator: EQUALS | NOT_EQUALS | IN | NOT_IN;
EQUALS : '=';
NOT_EQUALS : '!=';
IN : 'in';
NOT_IN : 'not' WHITESPACE 'in';
WORD: (LOWERCASE | UPPERCASE )+ ;
ALPHA_NUMERIC_WS: WORD ( WORD| DIGIT | WHITESPACE )* ( WORD | DIGIT)+ ;
WHITESPACE : (' ' | '\t')+;
fragment DIGIT: '0'..'9' ;
LOWERCASE : [a-z] ;
UPPERCASE : [A-Z] ;
One solution to this would be to not produce one token for several words but one token per word instead.
Your grammar would then look like this:
grammar Test;
condition: leftOperand operator;
leftOperand: ALPHA_NUMERIC+ ;
operator: EQUALS | NOT_EQUALS | IN | NOT_IN;
EQUALS : '=';
NOT_EQUALS : '!=';
IN : 'in';
NOT_IN : 'not' WHITESPACE 'in';
WORD: (LOWERCASE | UPPERCASE )+ ;
ALPHA_NUMERIC: WORD ( WORD| DIGIT)* ;
WHITESPACE : (' ' | '\t')+ -> skip; // ignoring WS completely
fragment DIGIT: '0'..'9' ;
LOWERCASE : [a-z] ;
UPPERCASE : [A-Z] ;
Like this the lexer will not match the whole input as ALPHA_NUMERIC_WS once the corresponding lexer rule has been entered because any occuring WS forces the lexer to leave the ALPHA_NUMERIC rule. Therefore any following input will be given a chance to be matched by other lexer-rules (in the order they are defined in the grammar).

Changing how Antlr4 prints a tree

I am using Antlr4 in IntelliJ to make a small compiler for arithmetic expressions.
I want to print the tree and use this code snippet to do so.
JFrame frame = new JFrame("Tree");
JPanel panel = new JPanel();
TreeViewer viewr = new TreeViewer(Arrays.asList(
parser.getRuleNames()),tree);
viewr.setScale(2);//scale a little
panel.add(viewr);
frame.add(panel);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.setSize(400,400);
frame.setVisible(true);
This makes a tree which looks like this, for the input 3*5\n
Is there a way to adjust this so it reads from top to bottom
Statement
Expression /n
INT * INT
3 5
instead?
My grammar is defined as:
grammar Expression;
statement: expression ENDSTATEMENT # printExpr
| ID '=' expression ENDSTATEMENT # assign
| ENDSTATEMENT # blank
;
expression: expression MULTDIV expression # MulDiv
| expression ADDSUB expression # AddSub
| INT # int
| FLOAT # float
| ID # id
| '(' expression ')' # parens
;
ID : [a-zA-Z]+ ; // match identifiers
INT : [0-9]+ ; // match integers
MULTDIV : ('*' | '/'); //match multiply or divide
ADDSUB : ('+' | '-'); //match add or subtract
FLOAT: INT '.' INT; //match a floating point number
ENDSTATEMENT:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WHITESPACE : [ \t]+ -> skip ; // ignore whitespace
No, not without changing the source of the tree viewer yourself.

How to match a string with an opening brace { in C++ regex

I have about writing regexes in C++. I have 2 regexes which work fine in java. But these throws an error namely
one of * + was not preceded by a valid regular expression C++
These regexes are as follows:
regex r1("^[\s]*{[\s]*\n"); //Space followed by '{' then followed by spaces and '\n'
regex r2("^[\s]*{[\s]*\/\/.*\n") // Space followed by '{' then by '//' and '\n'
Can someone help me how to fix this error or re-write these regex in C++?
See basic_regex reference:
By default, regex patterns follow the ECMAScript syntax.
ECMAScript syntax reference states:
characters:
\character
description: character
matches: the character character as it is, without interpreting its special meaning within a regex expression.
Any character can be escaped except those which form any of the special character sequences above.
Needed for: ^ $ \ . * + ? ( ) [ ] { } |
So, you need to escape { to get the code working:
std::string s("\r\n { \r\nSome text here");
regex r1(R"(^\s*\{\s*\n)");
regex r2(R"(^\s*\{\s*//.*\n)");
std::string newtext = std::regex_replace( s, r1, "" );
std::cout << newtext << std::endl;
See IDEONE demo
Also, note how the R"(pattern_here_with_single_escaping_backslashes)" raw string literal syntax simplifies a regex declaration.

ANTLR4 RegEx lexer modes

I am working on a Regx parser for RegEx inside XSD.
My previous problem was descrived here: ANTLR4 parsing RegEx
I have split the Lexer and Parser since than.
Now I have a problem parsing parantheses inside brackets. They should be treated as characters inside the brackets and as grouping tokens outside.
This is my lexer grammar:
lexer grammar RegExLexer;
Char : ALPHA ;
Int : DIGIT ;
LBrack : '[' ;//-> pushMode(modeRange) ;
RBrack : ']' ;//-> popMode ;
LBrace : '(' ;
RBrace : ')' ;
Semi : ';' ;
Comma : ',' ;
Asterisk: '*' ;
Plus : '+' ;
Dot : '.' ;
Dash : '-' ;
Question: '?' ;
LCBrace : '{' ;
RCBrace : '}' ;
Pipe : '|' ;
Esc : '\\' ;
WS : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;
And here is the example:
[0-9a-z()]+
I feel like i should use modes on brackets to change the behaviour of ALPHA fragment. If I copy the fragment, I get an error saying I can't have the declaration twice.
I have read the reference about this and I still don't get what i should do.
How do I implement the modes?
Here's a quick demo of how it is possible to create a context sensitive lexer using ANTLR4's lexical-modes:
lexer grammar RegexLexer;
START_CHAR_CLASS
: '[' -> pushMode(CharClass)
;
START_GROUP
: '('
;
END_GROUP
: ')'
;
PLAIN_ATOM
: ~[()\[\]]
;
mode CharClass;
END_CHAR_CLASS
: ']' -> popMode
;
CHAR_CLASS_ATOM
: ~[\r\n\\\]]
| '\\' .
;
After generating the lexer, you can use the following class to test it:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
public class Main {
public static void main(String[] args) {
RegexLexer lexer = new RegexLexer(new ANTLRInputStream("([()\\]])"));
for (Token token : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", RegexLexer.VOCABULARY.getSymbolicName(token.getType()), token.getText());
}
}
}
And if you run this Main class, the follwoing will be printed to your console:
START_GROUP (
START_CHAR_CLASS [
CHAR_CLASS_ATOM (
CHAR_CLASS_ATOM )
CHAR_CLASS_ATOM \]
END_CHAR_CLASS ]
END_GROUP )
As you can see, the ( and ) are tokenized differently outside the character class as they are inside of it.
You're going to have to handle this in the parser, not the lexer. When lexer sees a '(', it will return token LBrace. For lexer, there is no context as to where token is seen. It simply carves up the input into tokens. You will have to define parse rules and when processing parse tree, you can then determine was the LBrace inside brackets or not.

How to parse grammar of XSD Regex with ANTLR4?

Dear Antlr4 community,
I recently started to use ANTLR4 to translate regular expression from XSD / xml to cvc4.
I use the grammar as specified by w3c, see http://www.w3.org/TR/xmlschema11-2/#regexs .
For this question I have simplified this grammar (by removing charClass) to:
grammar XSDRegExp;
regExp : branch ( '|' branch )* ;
branch : piece* ;
piece : atom quantifier? ;
quantifier : Quantifiers | '{'quantity'}' ;
quantity : quantRange | quantMin | QuantExact ;
quantRange : QuantExact ',' QuantExact ;
quantMin : QuantExact ',' ;
atom : NormalChar | '(' regExp ')' ; // excluded | charClass ;
QuantExact : [0-9]+ ;
NormalChar : ~[.\\?*+{}()|\[\]] ;
Quantifiers : [?*+] ;
Parsing seems to go fine:
input a(bd){6,7}c{14,15}
However, I get an error message for:
input 12{3,4}
The error is:
line 1:0 mismatched input '12' expecting {, '(', '|', NormalChar}
I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.
I tried a number of changes:
[1] Swapping the definitions of QuantExact and NormalChar.
But swapping introduces an error in the first input:
line 1:6 no viable alternative at input '6'
since in that case '6' is only seen as a NormalChar and NOT as a QuantExact.
[2] Try to make a context for QuantExact (the curly brackets of quantity), such that the lexer only provides the QuantExact symbols in this limited context. But I failed to find ANTLR4 primitives for this.
So nothing seems to work, therefore my question is:
Can I parse this grammar with ANTLR4?
And if so, how?
I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.
The lexer does not "listen" to the parser: no matter if the parser is trying to match a NormalChar, the characters 12 will always be matched as a QuantExact. The lexer tries to match as much characters as possible, and in case of a tie, it chooses the rule defined first.
You could introduce a normalChar rule that matches both a NormalChar and QuantExact and use that rule in your atom:
atom : normalChar | '(' regExp ')' ;
normalChar : NormalChar | QuantExact ;
Another option would be to let the lexer create single char tokens only, and let the parser glue these together (much like a PEG). Something like this:
regExp : branch ( '|' branch )* ;
branch : piece* ;
piece : atom quantifier? ;
quantifier : Quantifiers | '{'quantity'}' ;
quantity : quantRange | quantMin | quantExact ;
quantRange : quantExact ',' quantExact ;
quantMin : quantExact ',' ;
atom : normalChar | '(' regExp ')' ;
normalChar : NormalChar | Digit ;
quantExact : Digit+ ;
Digit : [0-9] ;
NormalChar : ~[.\\?*+{}()|\[\]] ;
Quantifiers : [?*+] ;