ANTLR4 Lexing C++11 Raw String - c++

All,
I been playing around with creating a C++ grammar from the Standards document N4567, which is the latest I could find. I believe the grammar is complete, but I need to test it. One issue I been trying to resolve is to have the lexer recognize Raw Strings from the standard. I've implemented a possible solution using Actions & Semantic Predicates. I need help determining if it actually works. I've read the ANTLR4 Reference on the interaction between Actions and Predicates but can decide if my solution is valid. A strip down grammar is included below. Any thoughts will be appreciated. I've tried to include my thoughts in the sample.
grammar SampleRaw;
#lexer::members {
string d_char_seq = "";
}
string_literal
: ENCODING_PREFIX? '\"' S_CHAR* '\"'
| ENCODING_PREFIX? 'R' Raw_String
;
ENCODING_PREFIX // one of
: 'u8'
| [uUL]
;
S_CHAR /* any member of the source character set except the
double_quote ", backslash \, or NEW_LINE character
*/
: ~[\"\\\n\r]
| ESCAPE_SEQUENCE
| UNIV_CHAR_NAME
;
fragment ESCAPE_SEQUENCE
: SIMPLE_ESCAPE_SEQ
| OCT_ESCAPE_SEQ
| HEX_ESCAPE_SEQ
;
fragment SIMPLE_ESCAPE_SEQ // one of
: '\\' '\''
| '\\' '\"'
| '\\' '?'
| '\\' '\\'
| '\\' 'a'
| '\\' 'b'
| '\\' 'f'
| '\\' 'n'
| '\\' 'r'
| '\\' 't'
| '\\' 'v'
;
fragment OCT_ESCAPE_SEQ
: [0-3] ( OCT_DIGIT OCT_DIGIT? )?
| [4-7] ( OCT_DIGIT )?
;
fragment HEX_ESCAPE_SEQ
: '\\' 'x' HEX_DIGIT+
;
fragment UNIV_CHAR_NAME
: '\\' 'u' HEX_QUAD
| '\\' 'U' HEX_QUAD HEX_QUAD
;
fragment HEX_QUAD
: HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
fragment HEX_DIGIT
: [a-zA-Z0-9]
;
fragment OCT_DIGIT
: [0-7]
;
/*
Raw_String
: '\"' D_CHAR* '(' R_CHAR* ')' D_CHAR* '\"'
;
*/
Raw_String
: ( /* CASE when D_CHAR is empty
ACTION in D_CHAR_SEQ attempts to reset variable d_char_seq
if it is empty, so handle it staticly
*/
'\"'
'('
( ~[)] // Anything but )
| [)] ~[\"] // ) Actually OK, can't be followed by "
// - )" - these are the terminating chars
)*
')'
'\"'
| '\"'
D_CHAR_SEQ /* Will the ACTION in D_CHAR_SEQ be an issue for
the Semantic Predicates Below????
*/
'('
( ~[)] // Anything but )
| [)] D_CHAR_SEQ { ( getText() != d_char_seq ) }?
/* ) Actually OK, can't be followed D_CHAR_SEQ match
IF D_CHAR_SEQs match, turn OFF the Alternative
*/
| [)] D_CHAR_SEQ { ( getText() == d_char_seq ) }? ~[\"]
/* ) Actually OK, must be followed D_CHAR_SEQ match
IF D_CHAR_SEQs match, turn ON the Alternative
Cant't match the final " , but
WE HAVE MATCHED OUR TERMINATING CHARS
*/
)*
')'
D_CHAR_SEQ /* No need to check here,
Matching Terminating CHARS is only way to get out
of loop above
*/
'\"'
)
{ d_char_seq = ""; } // Reset Variable
;
/*
fragment R_CHAR
// any member of the source character set, except a right
// parenthesis ) followed by the initial D_CHAR*
// (which may be empty) followed by a double quote ".
//
: ~[)]
;
*/
fragment D_CHAR
/* any member of the basic source character set except
space, the left parenthesis (, the right parenthesis ),
the backslash \, and the control characters representing
horizontal tab, vertical tab, form feed, and newline.
*/
: ~[ )(\\\t\v\f\n\r]
;
fragment D_CHAR_SEQ
: D_CHAR+ { d_char_seq = ( d_char_seq == "" ) ? getText() : d_char_seq ; }
;

I managed to hack this out myself, any comments or possible improvements would be greatly appreciated. IF this can be done without ACTIONs that would be great to know as well.
The one draw back is that \" and D_CHAR_SEQ are part of the text of Raw_String passed to the parser. The parser can strip them out but, it would nice if the lexer did it.
grammar SampleRaw;
Reg_String
: '\"' S_CHAR* '\"'
;
fragment S_CHAR
/* any member of the source character set except the
double_quote ", backslash \, or NEW_LINE character
*/
: ~[\n\r\"\\]
| ESCAPE_SEQUENCE
| UNIV_CHAR_NAME
;
fragment ESCAPE_SEQUENCE
: SIMPLE_ESCAPE_SEQ
| OCT_ESCAPE_SEQ
| HEX_ESCAPE_SEQ
;
fragment SIMPLE_ESCAPE_SEQ // one of
: '\\' '\''
| '\\' '\"'
| '\\' '?'
| '\\' '\\'
| '\\' 'a'
| '\\' 'b'
| '\\' 'f'
| '\\' 'n'
| '\\' 'r'
| '\\' 't'
| '\\' 'v'
;
fragment OCT_ESCAPE_SEQ
: [0-3] ( OCT_DIGIT OCT_DIGIT? )?
| [4-7] ( OCT_DIGIT )?
;
fragment OCT_DIGIT
: [0-7]
;
fragment HEX_ESCAPE_SEQ
: '\\' 'x' HEX_DIGIT+
;
fragment HEX_DIGIT
: [a-zA-Z0-9]
;
fragment UNIV_CHAR_NAME
: '\\' 'u' HEX_QUAD
| '\\' 'U' HEX_QUAD HEX_QUAD
;
fragment HEX_QUAD
: HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
Raw_String
: 'R'
'\"' // Match Opening Double Quote
( /* Handle Empty D_CHAR_SEQ without Predicates
This should also work
'(' .*? ')'
*/
'(' ( ~')' | ')'+ ~'\"' )* (')'+)
| D_CHAR_SEQ
/* // Limit D_CHAR_SEQ to 16 characters
{ ( ( getText().length() - ( getText().indexOf("\"") + 1 ) ) <= 16 ) }?
*/
'('
/* From Spec :
Any member of the source character set, except
a right parenthesis ) followed by the initial D_CHAR_SEQUENCE
( which may be empty ) followed by a double quote ".
- The following loop consumes characters until it matches the
terminating sequence of characters for the RAW STRING
- The options are mutually exclusive, so Only one will
ever execute in each loop pass
- Each Option will execute at least once. The first option needs to
match the ')' character even if the D_CHAR_SEQ is empty. The second
option needs to match the closing \" to fall out of the loop. Each
option will only consume at most 1 character
*/
( // Consume everthing but the Double Quote
~'\"'
| // If text Does Not End with closing Delimiter, consume the Double Quote
'\"'
{
!getText().endsWith(
")"
+ getText().substring( getText().indexOf( "\"" ) + 1
, getText().indexOf( "(" )
)
+ '\"'
)
}?
)*
)
'\"' // Match Closing Double Quote
/*
// Strip Away R"D_CHAR_SEQ(...)D_CHAR_SEQ"
// Send D_CHAR_SEQ <TAB> ... to Parser
{
setText( getText().substring( getText().indexOf("\"") + 1
, getText().indexOf("(")
)
+ "\t"
+ getText().substring( getText().indexOf("(") + 1
, getText().lastIndexOf(")")
)
);
}
*/
;
fragment D_CHAR_SEQ // Should be limited to 16 characters
: D_CHAR+
;
fragment D_CHAR
/* Any member of the basic source character set except
space, the left parenthesis (, the right parenthesis ),
the backslash \, and the control characters representing
horizontal tab, vertical tab, form feed, and newline.
*/
: '\u0021'..'\u0023'
| '\u0025'..'\u0027'
| '\u002a'..'\u003f'
| '\u0041'..'\u005b'
| '\u005d'..'\u005f'
| '\u0061'..'\u007e'
;
ENCODING_PREFIX // one of
: 'u8'
| [uUL]
;
WhiteSpace
: [ \u0000-\u0020\u007f]+ -> skip
;
start
: string_literal* EOF
;
string_literal
: ENCODING_PREFIX? Reg_String
| ENCODING_PREFIX? Raw_String
;

Related

How to resolve parsing error in ANTLR CPP14 Grammar

I am using the below ANTLR grammar for parsing my code.
https://github.com/antlr/grammars-v4/tree/master/cpp
But I am getting a parsing error while using the below code:
TEST_F(TestClass, false_positive__N)
{
static constexpr char text[] =
R"~~~(; ModuleID = 'a.cpp'
source_filename = "a.cpp"
define private i32 #"__ir_hidden#100007_"(i32 %arg1) {
ret i32 %arg1
}
define i32 #main(i32 %arg1) {
%1 = call i32 #"__ir_hidden#100007_"(i32 %arg1)
ret i32 %1
}
)~~~";
NameMock ns(text);
ASSERT_EQ(std::string(text), ns.getSeed());
}
Error Details:
line 12:29 token recognition error at: '#1'
line 12:37 token recognition error at: '"(i32 %arg1)\n'
line 12:31 missing ';' at '00007_'
line 13:2 missing ';' at 'ret'
line 13:10 mismatched input '%' expecting {'alignas', '(', '[', '{', '=', ',', ';'}
line 14:0 missing ';' at '}'
line 15:0 mismatched input ')' expecting {'alignas', '(', '[', '{', '=', ',', ';'}
line 15:4 token recognition error at: '";\n'
What modification is needed in parser/lexer to parse the input correctly? Any help on this is highly appreciated. Thanks in advance.
Whenever a certain input does not get parsed properly, I start by displaying all the tokens the input is generating. If you do that, you'll probably see why things are going wrong. Another way would be to remove most of the source, and gradually add more lines of code to it: at a certain point the parser will fail, and you have a starting point to solving it.
So if you dump the tokens your input is creating, you'd get these tokens:
Identifier `TEST_F`
LeftParen `(`
Identifier `TestClass`
Comma `,`
Identifier `false_positive__N`
RightParen `)`
LeftBrace `{`
Static `static`
Constexpr `constexpr`
Char `char`
Identifier `text`
LeftBracket `[`
RightBracket `]`
Assign `=`
UserDefinedLiteral `R"~~~(; ModuleID = 'a.cpp'\n source_filename = "a.cpp"\n\n define private i32 #"__ir_hidden#100007_"(i32 %arg1) {\n ret i32 %arg1\n }\n\ndefine i32 #main(i32 %arg1) {\n %1 = call i32 #"__ir_hidden`
Directive `#100007_"(i32 %arg1)`
...
you can see that the input R"~~~( ... )~~~" is not tokenised as a StringLiteral. Note that a StringLiteral will never be created because at the top of the lexer grammar there this rule:
Literal:
IntegerLiteral
| CharacterLiteral
| FloatingLiteral
| StringLiteral
| BooleanLiteral
| PointerLiteral
| UserDefinedLiteral;
causing none of the IntegerLiteral..UserDefinedLiteral to be created: all of them will become Literal tokens. It is far better to move this Literal rule to the parser instead. I must admit that while scrolling through the lexer grammar, it is a bit of a mess, and fixing the R"~~~( ... )~~~" will only delay another lingering problem popping its ugly head :). I am pretty sure this grammar has never been properly tested, and is full of bugs.
If you look at the lexer definition of a StringLiteral:
StringLiteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? 'R' Rawstring
;
fragment Rawstring
: '"' .*? '(' .*? ')' .*? '"'
;
it is clear why '"' .*? '(' .*? ')' .*? '"' will not match your entire string literal:
What you need is a rule looking like this:
StringLiteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? 'R"' ~[(]* '(' ( . )* ')' ~["]* '"'
;
but that will cause the ( . )* to consume too much: it will grab every character and will then backtrack to the last quote in your character stream (not what you want).
What you really want is this:
StringLiteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? 'R"' ~[(]* '(' ( /* break out of this loop when we see `)~~~"` */ . )* ')' ~["]* '"'
;
The break out of this look when we see ')~~~"' part can be done with a semantic predicate like this:
lexer grammar CPP14Lexer;
#members {
private boolean closeDelimiterAhead(String matched) {
// Grab everything between the matched text's first quote and first '('. Prepend a ')' and append a quote
String delimiter = ")" + matched.substring(matched.indexOf('"') + 1, matched.indexOf('(')) + "\"";
StringBuilder ahead = new StringBuilder();
// Collect as much characters ahead as there are `delimiter`-chars
for (int n = 1; n <= delimiter.length(); n++) {
if (_input.LA(n) == CPP14Lexer.EOF) {
throw new RuntimeException("Missing delimiter: " + delimiter);
}
ahead.append((char) _input.LA(n));
}
return delimiter.equals(ahead.toString());
}
}
...
StringLiteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? 'R"' ~[(]* '(' ( {!closeDelimiterAhead(getText())}? . )* ')' ~["]* '"'
;
...
If you now dump the tokens, you will see this:
Identifier `TEST_F`
LeftParen `(`
Identifier `TestClass`
Comma `,`
Identifier `false_positive__N`
RightParen `)`
LeftBrace `{`
Static `static`
Constexpr `constexpr`
Char `char`
Identifier `text`
LeftBracket `[`
RightBracket `]`
Assign `=`
Literal `R"~~~(; ModuleID = 'a.cpp'\n source_filename = "a.cpp"\n\n define private i32 #"__ir_hidden#100007_"(i32 %arg1) {\n ret i32 %arg1\n }\n\ndefine i32 #main(i32 %arg1) {\n %1 = call i32 #"__ir_hidden#100007_"(i32 %arg1)\n ret i32 %1\n}\n)~~~"`
Semi `;`
...
And there it is: R"~~~( ... )~~~" properly tokenised as a single token (albeit as a Literal token instead of a StringLiteral...). It will throw an exception when input is like R"~~~( ... )~~" or R"~~~( ... )~~~~", and it will successfully tokenise input like R"~~~( )~~" )~~~~" )~~~"
Quickly looking into the parser grammar, I see tokens like StringLiteral being referenced, but such a token will never be produced by the lexer (as I mentioned earlier).
Proceed with caution with this grammar. I would not advice using it (blindly) for anything other than some sort of educational purpose. Do not use in production!
Below changes in Lexer that helped me to resolve the raw string parsing issue
Stringliteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? '"' Schar* '" GST_TIME_FORMAT'
| Encodingprefix? 'R' Rawstring
;
fragment Rawstring
: '"' // Match Opening Double Quote
( /* Handle Empty D_CHAR_SEQ without Predicates
This should also work
'(' .*? ')'
*/
'(' ( ~')' | ')'+ ~'"' )* (')'+)
| D_CHAR_SEQ
/* // Limit D_CHAR_SEQ to 16 characters
{ ( ( getText().length() - ( getText().indexOf("\"") + 1 ) ) <= 16 ) }?
*/
'('
/* From Spec :
Any member of the source character set, except
a right parenthesis ) followed by the initial D_CHAR_SEQUENCE
( which may be empty ) followed by a double quote ".
- The following loop consumes characters until it matches the
terminating sequence of characters for the RAW STRING
- The options are mutually exclusive, so Only one will
ever execute in each loop pass
- Each Option will execute at least once. The first option needs to
match the ')' character even if the D_CHAR_SEQ is empty. The second
option needs to match the closing \" to fall out of the loop. Each
option will only consume at most 1 character
*/
( // Consume everthing but the Double Quote
~'"'
| // If text Does Not End with closing Delimiter, consume the Double Quote
'"'
{
!getText().endsWith(
")"
+ getText().substring( getText().indexOf( "\"" ) + 1
, getText().indexOf( "(" )
)
+ '\"'
)
}?
)*
)
'"' // Match Closing Double Quote
/*
// Strip Away R"D_CHAR_SEQ(...)D_CHAR_SEQ"
// Send D_CHAR_SEQ <TAB> ... to Parser
{
setText( getText().substring( getText().indexOf("\"") + 1
, getText().indexOf("(")
)
+ "\t"
+ getText().substring( getText().indexOf("(") + 1
, getText().lastIndexOf(")")
)
);
}
*/
;
fragment D_CHAR_SEQ // Should be limited to 16 characters
: D_CHAR+
;
fragment D_CHAR
/* Any member of the basic source character set except
space, the left parenthesis (, the right parenthesis ),
the backslash \, and the control characters representing
horizontal tab, vertical tab, form feed, and newline.
*/
: '\u0021'..'\u0023'
| '\u0025'..'\u0027'
| '\u002a'..'\u003f'
| '\u0041'..'\u005b'
| '\u005d'..'\u005f'
| '\u0061'..'\u007e'
;

Retrieve skipped white space In antlr4 parser from listener

I'm trying to construct an object from a parsed message.
I'm using Antlr4 and C++
My issue is that I need to skip white spaces during lexing/parsing but then I have to get them back when I construct my message object in the Listener.
Here's my grammar
grammar MessageTest;
WS: ('\t' | ' ' | '\r' | '\n' )+ -> skip;
message:
messageInfo
startOfMessage
messageText+
| EOF;
messageInfo:
senderName
filingTime
receiverName
;
senderName: WORD;
filingTime: DIGITS;
receiverName: WORD;
messageText: ( WORD | DIGITS | ALLOWED_SYMBOLS)+;
startOfMessage: START_OF_MESSAGE_SYMBOL ;
START_OF_MESSAGE_SYMBOL:':';
WORD: LETTER+;
DIGITS: DIGIT+;
LPAREN: '(';
RPAREN: ')';
ALLOWED_SYMBOLS: '-'| '.' | ',' | '/' | '+' | '?';
fragment LETTER: [A-Z];
fragment DIGIT: [0-9];
So this grammar works well, my parsing tree is correct for the following message example: JOHN0120JANE:HI HOW ARE YOU?
I get this parse tree:
message (
messageInfo (
senderName (
"JOHN"
)
filingTime (
"0120"
)
receiverName (
"JANE"
)
)
startOfMessage (
":"
)
messageText (
"HI"
"HOW"
"ARE"
"YOU"
"?"
)
)
The problem is when Im trying to retrieve the whole messageText as:
HI HOW ARE YOU? I instead get HIHOWAREYOU? from the MessageTextContext
What am I doing wrong?
The getText() retrieval functions never consider skipped or hidden tokens. But it's easy to get the original text of your input (even just a range that corresponds to a specific parse rule), by using the indexes stored in the generated tokens. Parse rule contexts contain a start and an end node, so it's easy to go from the context to the original input like this:
std::string MySQLRecognizerCommon::sourceTextForContext(ParserRuleContext *ctx, bool keepQuotes) {
return sourceTextForRange(ctx->start, ctx->stop, keepQuotes);
}
//----------------------------------------------------------------------------------------------------------------------
std::string MySQLRecognizerCommon::sourceTextForRange(tree::ParseTree *start, tree::ParseTree *stop, bool keepQuotes) {
Token *startToken = antlrcpp::is<tree::TerminalNode *>(start) ? dynamic_cast<tree::TerminalNode *>(start)->getSymbol()
: dynamic_cast<ParserRuleContext *>(start)->start;
Token *stopToken = antlrcpp::is<tree::TerminalNode *>(stop) ? dynamic_cast<tree::TerminalNode *>(start)->getSymbol()
: dynamic_cast<ParserRuleContext *>(stop)->stop;
return sourceTextForRange(startToken, stopToken, keepQuotes);
}
//----------------------------------------------------------------------------------------------------------------------
std::string MySQLRecognizerCommon::sourceTextForRange(Token *start, Token *stop, bool keepQuotes) {
CharStream *cs = start->getTokenSource()->getInputStream();
size_t stopIndex = stop != nullptr ? stop->getStopIndex() : std::numeric_limits<size_t>::max();
std::string result = cs->getText(misc::Interval(start->getStartIndex(), stopIndex));
if (keepQuotes || result.size() < 2)
return result;
char quoteChar = result[0];
if ((quoteChar == '"' || quoteChar == '`' || quoteChar == '\'') && quoteChar == result.back()) {
if (quoteChar == '"' || quoteChar == '\'') {
// Replace any double occurence of the quote char by a single one.
replaceStringInplace(result, std::string(2, quoteChar), std::string(1, quoteChar));
}
return result.substr(1, result.size() - 2);
}
return result;
}
This code is tailored towards use with MySQL (e.g. wrt. quoting characters), but is easy to adapt for any other use case. The essential part is to use the tokens (e.g. taken from a parse rule context) and get the original input from the character input stream.
Code taken from the MySQL Workbench code base.
Seems like you want Lexical Modes.
The idea of using them is simple: when your lexer encounters START_OF_MESSAGE_SYMBOL, it has to switch its context where only one token is possible, let's say MESSAGE_TEXT token.
Once this token has been determined, the lexer's mode switches back to its default mode.
To do this you should first split you grammar into two parts: lexer grammar and a parser grammar, since lexical modes are not allowed in a combined grammar. And then you can use
pushMode() and popMode() commands.
Here's an example:
MessageTestLexer.g4
lexer grammar MessageTestLexer;
WS: ('\t' | ' ' | '\r' | '\n' )+ -> skip;
START_OF_MESSAGE_SYMBOL:':' -> pushMode(MESSAGE_MODE); //pushing MESSAGE_MODE when START_OF_MESSAGE_SYMBOL is encountered
WORD: LETTER+;
DIGITS: DIGIT+;
LPAREN: '(';
RPAREN: ')';
ALLOWED_SYMBOLS: '-'| '.' | ',' | '/' | '+' | '?';
fragment LETTER: [A-Z];
fragment DIGIT: [0-9];
mode MESSAGE_MODE; //tokens below are related to MESSAGE_MODE only
MESSAGE_TEXT: ~('\r'|'\n')*; //consuming any character until the end of the line. You can provide your own rule
END_OF_THE_LINE: ('\r'|'\n') -> popMode; //switching back to the default mode
MessageTestParser.g4
parser grammar MessageTestParser;
options {
tokenVocab=MessageTestLexer; //declaring which lexer rules to use in this parser
}
message:
messageInfo
startOfMessage
MESSAGE_TEXT //use the token instead
| EOF;
messageInfo:
senderName
filingTime
receiverName
;
senderName: WORD;
filingTime: DIGITS;
receiverName: WORD;
startOfMessage: START_OF_MESSAGE_SYMBOL;
P.S. did not test these grammars, but seems it should work.

Reluctant matching in ANTLR 4.4

Just as the reluctant quantifiers work in Regular expressions I'm trying to parse two different tokens from my input i.e, for operand1 and operator. And my operator token should be reluctantly matched instead of greedily matching input tokens for operand1.
Example,
Input:
Active Indicator in ("A", "D", "S")
(To simplify I have removed the code relevant for operand2)
Expected operand1:
Active Indicator
Expected operator:
in
Actual output for operand1:
Active indicator in
and none for the operator rule.
Below is my grammar code:
grammar Test;
condition: leftOperand WHITESPACE* operator;
leftOperand: ALPHA_NUMERIC_WS ;
operator: EQUALS | NOT_EQUALS | IN | NOT_IN;
EQUALS : '=';
NOT_EQUALS : '!=';
IN : 'in';
NOT_IN : 'not' WHITESPACE 'in';
WORD: (LOWERCASE | UPPERCASE )+ ;
ALPHA_NUMERIC_WS: WORD ( WORD| DIGIT | WHITESPACE )* ( WORD | DIGIT)+ ;
WHITESPACE : (' ' | '\t')+;
fragment DIGIT: '0'..'9' ;
LOWERCASE : [a-z] ;
UPPERCASE : [A-Z] ;
One solution to this would be to not produce one token for several words but one token per word instead.
Your grammar would then look like this:
grammar Test;
condition: leftOperand operator;
leftOperand: ALPHA_NUMERIC+ ;
operator: EQUALS | NOT_EQUALS | IN | NOT_IN;
EQUALS : '=';
NOT_EQUALS : '!=';
IN : 'in';
NOT_IN : 'not' WHITESPACE 'in';
WORD: (LOWERCASE | UPPERCASE )+ ;
ALPHA_NUMERIC: WORD ( WORD| DIGIT)* ;
WHITESPACE : (' ' | '\t')+ -> skip; // ignoring WS completely
fragment DIGIT: '0'..'9' ;
LOWERCASE : [a-z] ;
UPPERCASE : [A-Z] ;
Like this the lexer will not match the whole input as ALPHA_NUMERIC_WS once the corresponding lexer rule has been entered because any occuring WS forces the lexer to leave the ALPHA_NUMERIC rule. Therefore any following input will be given a chance to be matched by other lexer-rules (in the order they are defined in the grammar).

Parse arbitrary delimiter character using Antlr4

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?
To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:
REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;
This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).
If I had to solve this using a regular expression itself, I would use a backreference like this:
m(.)(.+?)\1
(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.
It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.
Can this be solved with Antlr4?
You could do something like this:
lexer grammar TLexer;
REGEX
: REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
| '{' REGEX_ATOM+ '}'
| '(' REGEX_ATOM+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [/~##]
;
fragment REGEX_ATOM
: '\\' .
| ~[\\]
;
If you run the following class:
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));
for (Token t : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
}
}
}
you will see the following output:
REGEX /foo/
ANY
ANY /
ANY b
ANY a
ANY r
ANY \
ANY
REGEX ~\~~
ANY
REGEX {mu}
ANY
ANY (
ANY b
ANY l
ANY a
ANY (
The {...}? is called a predicate:
Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?
The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).
Tested with ANTLR 4.5.3
EDIT
And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):
lexer grammar TLexer;
#lexer::members {
boolean delimiterAhead(String start) {
return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
}
}
REGEX
: '/' ( '\\' . | ~[/\\] )+ '/'
| 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
| 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
| 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [~##]
;
fragment SPACES
: [ \t]+
;

ANTLR4 parsing RegEx

I am tryingo to parse RegEx and specifically the following:
[A-Z0-9]{1,20}
The problem is, i don't know how to make the following grammar work beacuse the Char and Int tokens are both recognizing the digit.
grammar RegEx;
regEx : (character count? )+ ;
character : Char
| range ;
range : '[' (rangeChar|rangeX)+ ']' ;
rangeX : rangeStart '-' rangeEnd ;
rangeChar : Char ;
rangeStart : Char ;
rangeEnd : Char ;
count : '{' (countExact | (countMin ',' countMax) ) '}' ;
countMin : D+ ;
countMax : Int ;
countExact : Int ;
channels {
COUNT_CHANNEL,
RANGE_CHANNEL
}
Char : D | C ;
Int : D+ -> channel(COUNT_CHANNEL) ;
Semicolon : ';' ;
Comma : ',' ;
Asterisk : '*' ;
Plus : '+' ;
Dot : '.' ;
Dash : '-' ;
//CourlyBracketL : '{' ;
//CourlyBracketR : '}' ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines, \r (Windows)
fragment D : [0-9] ;
fragment C : [a-zA-Z] ;
Now, I'm a noob and I am lost wether should I try lexer modes, channels some ifs or what is the "normal" approach here.
Thanks!
Putting tokens on any channel other than the default hides them from the normal operation of the parser.
Try not to combine tokens in the lexer -- winds up loosing information that can be useful in the parser.
Try this:
grammar RegEx;
regEx : ( value count? )+ ;
value : alphNum | range ;
range : LBrack set+ RBrack ;
set : b=alphNum ( Dash e=alphNum)? ;
count : LBrace min=num ( Comma max=num )? RBrace ;
alphNum : Char | Int ;
num : Int+ ;
Char : ALPHA ;
Int : DIGIT ;
Semi : ';' ;
Comma : ',' ;
Star : '*' ;
Plus : '+' ;
Dot : '.' ;
Dash : '-' ;
LBrace : '{' ;
RBrace : '}' ;
LBrack : '[' ;
RBrack : ']' ;
WS : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;