ANTLR4 RegEx lexer modes - regex

I am working on a Regx parser for RegEx inside XSD.
My previous problem was descrived here: ANTLR4 parsing RegEx
I have split the Lexer and Parser since than.
Now I have a problem parsing parantheses inside brackets. They should be treated as characters inside the brackets and as grouping tokens outside.
This is my lexer grammar:
lexer grammar RegExLexer;
Char : ALPHA ;
Int : DIGIT ;
LBrack : '[' ;//-> pushMode(modeRange) ;
RBrack : ']' ;//-> popMode ;
LBrace : '(' ;
RBrace : ')' ;
Semi : ';' ;
Comma : ',' ;
Asterisk: '*' ;
Plus : '+' ;
Dot : '.' ;
Dash : '-' ;
Question: '?' ;
LCBrace : '{' ;
RCBrace : '}' ;
Pipe : '|' ;
Esc : '\\' ;
WS : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;
And here is the example:
[0-9a-z()]+
I feel like i should use modes on brackets to change the behaviour of ALPHA fragment. If I copy the fragment, I get an error saying I can't have the declaration twice.
I have read the reference about this and I still don't get what i should do.
How do I implement the modes?

Here's a quick demo of how it is possible to create a context sensitive lexer using ANTLR4's lexical-modes:
lexer grammar RegexLexer;
START_CHAR_CLASS
: '[' -> pushMode(CharClass)
;
START_GROUP
: '('
;
END_GROUP
: ')'
;
PLAIN_ATOM
: ~[()\[\]]
;
mode CharClass;
END_CHAR_CLASS
: ']' -> popMode
;
CHAR_CLASS_ATOM
: ~[\r\n\\\]]
| '\\' .
;
After generating the lexer, you can use the following class to test it:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
public class Main {
public static void main(String[] args) {
RegexLexer lexer = new RegexLexer(new ANTLRInputStream("([()\\]])"));
for (Token token : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", RegexLexer.VOCABULARY.getSymbolicName(token.getType()), token.getText());
}
}
}
And if you run this Main class, the follwoing will be printed to your console:
START_GROUP (
START_CHAR_CLASS [
CHAR_CLASS_ATOM (
CHAR_CLASS_ATOM )
CHAR_CLASS_ATOM \]
END_CHAR_CLASS ]
END_GROUP )
As you can see, the ( and ) are tokenized differently outside the character class as they are inside of it.

You're going to have to handle this in the parser, not the lexer. When lexer sees a '(', it will return token LBrace. For lexer, there is no context as to where token is seen. It simply carves up the input into tokens. You will have to define parse rules and when processing parse tree, you can then determine was the LBrace inside brackets or not.

Related

How to resolve parsing error in ANTLR CPP14 Grammar

I am using the below ANTLR grammar for parsing my code.
https://github.com/antlr/grammars-v4/tree/master/cpp
But I am getting a parsing error while using the below code:
TEST_F(TestClass, false_positive__N)
{
static constexpr char text[] =
R"~~~(; ModuleID = 'a.cpp'
source_filename = "a.cpp"
define private i32 #"__ir_hidden#100007_"(i32 %arg1) {
ret i32 %arg1
}
define i32 #main(i32 %arg1) {
%1 = call i32 #"__ir_hidden#100007_"(i32 %arg1)
ret i32 %1
}
)~~~";
NameMock ns(text);
ASSERT_EQ(std::string(text), ns.getSeed());
}
Error Details:
line 12:29 token recognition error at: '#1'
line 12:37 token recognition error at: '"(i32 %arg1)\n'
line 12:31 missing ';' at '00007_'
line 13:2 missing ';' at 'ret'
line 13:10 mismatched input '%' expecting {'alignas', '(', '[', '{', '=', ',', ';'}
line 14:0 missing ';' at '}'
line 15:0 mismatched input ')' expecting {'alignas', '(', '[', '{', '=', ',', ';'}
line 15:4 token recognition error at: '";\n'
What modification is needed in parser/lexer to parse the input correctly? Any help on this is highly appreciated. Thanks in advance.
Whenever a certain input does not get parsed properly, I start by displaying all the tokens the input is generating. If you do that, you'll probably see why things are going wrong. Another way would be to remove most of the source, and gradually add more lines of code to it: at a certain point the parser will fail, and you have a starting point to solving it.
So if you dump the tokens your input is creating, you'd get these tokens:
Identifier `TEST_F`
LeftParen `(`
Identifier `TestClass`
Comma `,`
Identifier `false_positive__N`
RightParen `)`
LeftBrace `{`
Static `static`
Constexpr `constexpr`
Char `char`
Identifier `text`
LeftBracket `[`
RightBracket `]`
Assign `=`
UserDefinedLiteral `R"~~~(; ModuleID = 'a.cpp'\n source_filename = "a.cpp"\n\n define private i32 #"__ir_hidden#100007_"(i32 %arg1) {\n ret i32 %arg1\n }\n\ndefine i32 #main(i32 %arg1) {\n %1 = call i32 #"__ir_hidden`
Directive `#100007_"(i32 %arg1)`
...
you can see that the input R"~~~( ... )~~~" is not tokenised as a StringLiteral. Note that a StringLiteral will never be created because at the top of the lexer grammar there this rule:
Literal:
IntegerLiteral
| CharacterLiteral
| FloatingLiteral
| StringLiteral
| BooleanLiteral
| PointerLiteral
| UserDefinedLiteral;
causing none of the IntegerLiteral..UserDefinedLiteral to be created: all of them will become Literal tokens. It is far better to move this Literal rule to the parser instead. I must admit that while scrolling through the lexer grammar, it is a bit of a mess, and fixing the R"~~~( ... )~~~" will only delay another lingering problem popping its ugly head :). I am pretty sure this grammar has never been properly tested, and is full of bugs.
If you look at the lexer definition of a StringLiteral:
StringLiteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? 'R' Rawstring
;
fragment Rawstring
: '"' .*? '(' .*? ')' .*? '"'
;
it is clear why '"' .*? '(' .*? ')' .*? '"' will not match your entire string literal:
What you need is a rule looking like this:
StringLiteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? 'R"' ~[(]* '(' ( . )* ')' ~["]* '"'
;
but that will cause the ( . )* to consume too much: it will grab every character and will then backtrack to the last quote in your character stream (not what you want).
What you really want is this:
StringLiteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? 'R"' ~[(]* '(' ( /* break out of this loop when we see `)~~~"` */ . )* ')' ~["]* '"'
;
The break out of this look when we see ')~~~"' part can be done with a semantic predicate like this:
lexer grammar CPP14Lexer;
#members {
private boolean closeDelimiterAhead(String matched) {
// Grab everything between the matched text's first quote and first '('. Prepend a ')' and append a quote
String delimiter = ")" + matched.substring(matched.indexOf('"') + 1, matched.indexOf('(')) + "\"";
StringBuilder ahead = new StringBuilder();
// Collect as much characters ahead as there are `delimiter`-chars
for (int n = 1; n <= delimiter.length(); n++) {
if (_input.LA(n) == CPP14Lexer.EOF) {
throw new RuntimeException("Missing delimiter: " + delimiter);
}
ahead.append((char) _input.LA(n));
}
return delimiter.equals(ahead.toString());
}
}
...
StringLiteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? 'R"' ~[(]* '(' ( {!closeDelimiterAhead(getText())}? . )* ')' ~["]* '"'
;
...
If you now dump the tokens, you will see this:
Identifier `TEST_F`
LeftParen `(`
Identifier `TestClass`
Comma `,`
Identifier `false_positive__N`
RightParen `)`
LeftBrace `{`
Static `static`
Constexpr `constexpr`
Char `char`
Identifier `text`
LeftBracket `[`
RightBracket `]`
Assign `=`
Literal `R"~~~(; ModuleID = 'a.cpp'\n source_filename = "a.cpp"\n\n define private i32 #"__ir_hidden#100007_"(i32 %arg1) {\n ret i32 %arg1\n }\n\ndefine i32 #main(i32 %arg1) {\n %1 = call i32 #"__ir_hidden#100007_"(i32 %arg1)\n ret i32 %1\n}\n)~~~"`
Semi `;`
...
And there it is: R"~~~( ... )~~~" properly tokenised as a single token (albeit as a Literal token instead of a StringLiteral...). It will throw an exception when input is like R"~~~( ... )~~" or R"~~~( ... )~~~~", and it will successfully tokenise input like R"~~~( )~~" )~~~~" )~~~"
Quickly looking into the parser grammar, I see tokens like StringLiteral being referenced, but such a token will never be produced by the lexer (as I mentioned earlier).
Proceed with caution with this grammar. I would not advice using it (blindly) for anything other than some sort of educational purpose. Do not use in production!
Below changes in Lexer that helped me to resolve the raw string parsing issue
Stringliteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? '"' Schar* '" GST_TIME_FORMAT'
| Encodingprefix? 'R' Rawstring
;
fragment Rawstring
: '"' // Match Opening Double Quote
( /* Handle Empty D_CHAR_SEQ without Predicates
This should also work
'(' .*? ')'
*/
'(' ( ~')' | ')'+ ~'"' )* (')'+)
| D_CHAR_SEQ
/* // Limit D_CHAR_SEQ to 16 characters
{ ( ( getText().length() - ( getText().indexOf("\"") + 1 ) ) <= 16 ) }?
*/
'('
/* From Spec :
Any member of the source character set, except
a right parenthesis ) followed by the initial D_CHAR_SEQUENCE
( which may be empty ) followed by a double quote ".
- The following loop consumes characters until it matches the
terminating sequence of characters for the RAW STRING
- The options are mutually exclusive, so Only one will
ever execute in each loop pass
- Each Option will execute at least once. The first option needs to
match the ')' character even if the D_CHAR_SEQ is empty. The second
option needs to match the closing \" to fall out of the loop. Each
option will only consume at most 1 character
*/
( // Consume everthing but the Double Quote
~'"'
| // If text Does Not End with closing Delimiter, consume the Double Quote
'"'
{
!getText().endsWith(
")"
+ getText().substring( getText().indexOf( "\"" ) + 1
, getText().indexOf( "(" )
)
+ '\"'
)
}?
)*
)
'"' // Match Closing Double Quote
/*
// Strip Away R"D_CHAR_SEQ(...)D_CHAR_SEQ"
// Send D_CHAR_SEQ <TAB> ... to Parser
{
setText( getText().substring( getText().indexOf("\"") + 1
, getText().indexOf("(")
)
+ "\t"
+ getText().substring( getText().indexOf("(") + 1
, getText().lastIndexOf(")")
)
);
}
*/
;
fragment D_CHAR_SEQ // Should be limited to 16 characters
: D_CHAR+
;
fragment D_CHAR
/* Any member of the basic source character set except
space, the left parenthesis (, the right parenthesis ),
the backslash \, and the control characters representing
horizontal tab, vertical tab, form feed, and newline.
*/
: '\u0021'..'\u0023'
| '\u0025'..'\u0027'
| '\u002a'..'\u003f'
| '\u0041'..'\u005b'
| '\u005d'..'\u005f'
| '\u0061'..'\u007e'
;

How to write Antlr4 grammar rule to match file path?

What is the best method to write antlr4 grammar to match file paths like
"C:\Users\Alex\IdeaProjects\Compiler_Project\antlrTest\src\SQL.g4"
Or relative path like
"Compiler_Project//samples//test.txt"
My guess is you are trying to parse some sort of scripting language, like bash or zsh.
I agree that Antlr might not be the best choice to merely parse a file path, but that wasn't your question was it?
Here is a grammar excerpt from a larger grammar that parses windows batch files.
It's worth noting again that Antlr might not be the best choice for parsing Windows batch commands either in that each command can have peculiar syntax that doesn't readily apply to all the commands in a batch file.
That doesn't mean you can't do it though! Here, I use the 'island grammar' feature which requires separate lexer.g4 and grammar.g4 files but allows you to treat each command as its own little grammer.
Token reuse is a little awkward but not horrible.
BatchLexer.g4
lexer grammar BatchLexer;
options {
caseInsensitive=true;
}
CD : ('CD' | 'CHDIR') -> pushMode(CD_CMD) ;
DOT : '.' ;
DOTDOT : '..' ;
BLANK_LINE : NL ;
NL : '\n';
OPTION : '/' [a-z]+? ;
DRIVE : [a-z] ':' ; //posix?
WS : [ \t\r]+ ->skip ;
// This introduces the type name, but doesn't match anything at this scope
PATH : ~[.] ;
fragment ESCAPED_QUOTE : '\\"' ;
fragment PATH_WORD : ~[ <>:/|\r\n]+ ;
fragment RAW_PATH : DRIVE? (DOT | DOTDOT | ESCAPED_QUOTE | PATH_WORD) ;
fragment QUOTED_PATH : '"' DRIVE? (DOT | DOTDOT | ESCAPED_QUOTE | PATH_WORD) '"' ;
mode CD_CMD ;
CD_OPTION : OPTION -> type(OPTION) ;
CD_PATH : (RAW_PATH | QUOTED_PATH) -> type(PATH) ;
CD_NL : NL -> type(NL), popMode ;
CD_WS : WS ->skip ;
Batch.g4
grammar Batch;
options {
tokenVocab=BatchLexer;
caseInsensitive=true;
}
file : (command)* EOF ;
command : (
cd_cmd
)? (NL | BLANK_LINE) ;
cd_cmd : CD OPTION? PATH*? ;

Parse arbitrary delimiter character using Antlr4

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?
To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:
REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;
This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).
If I had to solve this using a regular expression itself, I would use a backreference like this:
m(.)(.+?)\1
(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.
It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.
Can this be solved with Antlr4?
You could do something like this:
lexer grammar TLexer;
REGEX
: REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
| '{' REGEX_ATOM+ '}'
| '(' REGEX_ATOM+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [/~##]
;
fragment REGEX_ATOM
: '\\' .
| ~[\\]
;
If you run the following class:
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));
for (Token t : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
}
}
}
you will see the following output:
REGEX /foo/
ANY
ANY /
ANY b
ANY a
ANY r
ANY \
ANY
REGEX ~\~~
ANY
REGEX {mu}
ANY
ANY (
ANY b
ANY l
ANY a
ANY (
The {...}? is called a predicate:
Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?
The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).
Tested with ANTLR 4.5.3
EDIT
And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):
lexer grammar TLexer;
#lexer::members {
boolean delimiterAhead(String start) {
return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
}
}
REGEX
: '/' ( '\\' . | ~[/\\] )+ '/'
| 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
| 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
| 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [~##]
;
fragment SPACES
: [ \t]+
;

ANTLR4 parsing RegEx

I am tryingo to parse RegEx and specifically the following:
[A-Z0-9]{1,20}
The problem is, i don't know how to make the following grammar work beacuse the Char and Int tokens are both recognizing the digit.
grammar RegEx;
regEx : (character count? )+ ;
character : Char
| range ;
range : '[' (rangeChar|rangeX)+ ']' ;
rangeX : rangeStart '-' rangeEnd ;
rangeChar : Char ;
rangeStart : Char ;
rangeEnd : Char ;
count : '{' (countExact | (countMin ',' countMax) ) '}' ;
countMin : D+ ;
countMax : Int ;
countExact : Int ;
channels {
COUNT_CHANNEL,
RANGE_CHANNEL
}
Char : D | C ;
Int : D+ -> channel(COUNT_CHANNEL) ;
Semicolon : ';' ;
Comma : ',' ;
Asterisk : '*' ;
Plus : '+' ;
Dot : '.' ;
Dash : '-' ;
//CourlyBracketL : '{' ;
//CourlyBracketR : '}' ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines, \r (Windows)
fragment D : [0-9] ;
fragment C : [a-zA-Z] ;
Now, I'm a noob and I am lost wether should I try lexer modes, channels some ifs or what is the "normal" approach here.
Thanks!
Putting tokens on any channel other than the default hides them from the normal operation of the parser.
Try not to combine tokens in the lexer -- winds up loosing information that can be useful in the parser.
Try this:
grammar RegEx;
regEx : ( value count? )+ ;
value : alphNum | range ;
range : LBrack set+ RBrack ;
set : b=alphNum ( Dash e=alphNum)? ;
count : LBrace min=num ( Comma max=num )? RBrace ;
alphNum : Char | Int ;
num : Int+ ;
Char : ALPHA ;
Int : DIGIT ;
Semi : ';' ;
Comma : ',' ;
Star : '*' ;
Plus : '+' ;
Dot : '.' ;
Dash : '-' ;
LBrace : '{' ;
RBrace : '}' ;
LBrack : '[' ;
RBrack : ']' ;
WS : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;

How to parse grammar of XSD Regex with ANTLR4?

Dear Antlr4 community,
I recently started to use ANTLR4 to translate regular expression from XSD / xml to cvc4.
I use the grammar as specified by w3c, see http://www.w3.org/TR/xmlschema11-2/#regexs .
For this question I have simplified this grammar (by removing charClass) to:
grammar XSDRegExp;
regExp : branch ( '|' branch )* ;
branch : piece* ;
piece : atom quantifier? ;
quantifier : Quantifiers | '{'quantity'}' ;
quantity : quantRange | quantMin | QuantExact ;
quantRange : QuantExact ',' QuantExact ;
quantMin : QuantExact ',' ;
atom : NormalChar | '(' regExp ')' ; // excluded | charClass ;
QuantExact : [0-9]+ ;
NormalChar : ~[.\\?*+{}()|\[\]] ;
Quantifiers : [?*+] ;
Parsing seems to go fine:
input a(bd){6,7}c{14,15}
However, I get an error message for:
input 12{3,4}
The error is:
line 1:0 mismatched input '12' expecting {, '(', '|', NormalChar}
I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.
I tried a number of changes:
[1] Swapping the definitions of QuantExact and NormalChar.
But swapping introduces an error in the first input:
line 1:6 no viable alternative at input '6'
since in that case '6' is only seen as a NormalChar and NOT as a QuantExact.
[2] Try to make a context for QuantExact (the curly brackets of quantity), such that the lexer only provides the QuantExact symbols in this limited context. But I failed to find ANTLR4 primitives for this.
So nothing seems to work, therefore my question is:
Can I parse this grammar with ANTLR4?
And if so, how?
I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.
The lexer does not "listen" to the parser: no matter if the parser is trying to match a NormalChar, the characters 12 will always be matched as a QuantExact. The lexer tries to match as much characters as possible, and in case of a tie, it chooses the rule defined first.
You could introduce a normalChar rule that matches both a NormalChar and QuantExact and use that rule in your atom:
atom : normalChar | '(' regExp ')' ;
normalChar : NormalChar | QuantExact ;
Another option would be to let the lexer create single char tokens only, and let the parser glue these together (much like a PEG). Something like this:
regExp : branch ( '|' branch )* ;
branch : piece* ;
piece : atom quantifier? ;
quantifier : Quantifiers | '{'quantity'}' ;
quantity : quantRange | quantMin | quantExact ;
quantRange : quantExact ',' quantExact ;
quantMin : quantExact ',' ;
atom : normalChar | '(' regExp ')' ;
normalChar : NormalChar | Digit ;
quantExact : Digit+ ;
Digit : [0-9] ;
NormalChar : ~[.\\?*+{}()|\[\]] ;
Quantifiers : [?*+] ;