How to write Antlr4 grammar rule to match file path? - regex

What is the best method to write antlr4 grammar to match file paths like
"C:\Users\Alex\IdeaProjects\Compiler_Project\antlrTest\src\SQL.g4"
Or relative path like
"Compiler_Project//samples//test.txt"

My guess is you are trying to parse some sort of scripting language, like bash or zsh.
I agree that Antlr might not be the best choice to merely parse a file path, but that wasn't your question was it?
Here is a grammar excerpt from a larger grammar that parses windows batch files.
It's worth noting again that Antlr might not be the best choice for parsing Windows batch commands either in that each command can have peculiar syntax that doesn't readily apply to all the commands in a batch file.
That doesn't mean you can't do it though! Here, I use the 'island grammar' feature which requires separate lexer.g4 and grammar.g4 files but allows you to treat each command as its own little grammer.
Token reuse is a little awkward but not horrible.
BatchLexer.g4
lexer grammar BatchLexer;
options {
caseInsensitive=true;
}
CD : ('CD' | 'CHDIR') -> pushMode(CD_CMD) ;
DOT : '.' ;
DOTDOT : '..' ;
BLANK_LINE : NL ;
NL : '\n';
OPTION : '/' [a-z]+? ;
DRIVE : [a-z] ':' ; //posix?
WS : [ \t\r]+ ->skip ;
// This introduces the type name, but doesn't match anything at this scope
PATH : ~[.] ;
fragment ESCAPED_QUOTE : '\\"' ;
fragment PATH_WORD : ~[ <>:/|\r\n]+ ;
fragment RAW_PATH : DRIVE? (DOT | DOTDOT | ESCAPED_QUOTE | PATH_WORD) ;
fragment QUOTED_PATH : '"' DRIVE? (DOT | DOTDOT | ESCAPED_QUOTE | PATH_WORD) '"' ;
mode CD_CMD ;
CD_OPTION : OPTION -> type(OPTION) ;
CD_PATH : (RAW_PATH | QUOTED_PATH) -> type(PATH) ;
CD_NL : NL -> type(NL), popMode ;
CD_WS : WS ->skip ;
Batch.g4
grammar Batch;
options {
tokenVocab=BatchLexer;
caseInsensitive=true;
}
file : (command)* EOF ;
command : (
cd_cmd
)? (NL | BLANK_LINE) ;
cd_cmd : CD OPTION? PATH*? ;

Related

Running antlr4 parser for c++ on grammar file shows error 33: missing code generation template NonLocalAttrRefHeader

I am relatively new to ANTLR, I have a current project that needs to be merged from ANTLR3 (version 3.5) to ANTLR4. I have gone thru the book and tried the demo, this all works fine, but my own project gives me the following problem:
After converting a ANTRL3 project to ANTLR4 project (resolving all warnings and errors) I was able to build the lexer.h and lexer.cpp file but the following errors come up:
error(33): missing code generation template NonLocalAttrRefHeader
error(33): missing code generation template SetNonLocalAttrHeader
(about 50 times). I haven't been able to find any references of these templates anywhere. Is there anybody who can shed any light on these error messages? Because they don't say anything about line no's or reference any other code I'm completely in in the dark where to look.
I've set up a test environment, testing the demo g4 files. I have pulled the g4 file out of my (VS2017) project and tried it seperately using batch files.
Because of the lack of references I can't show the actual piece of code that is the cause. I have tried a partial parse, but I haven't been able to get any clues from that.
These errors are shown:
error(33): missing code generation template NonLocalAttrRefHeader
error(33): missing code generation template SetNonLocalAttrHeader
I've constructed a small example to demonstrate the problem:
/*
* AMF Syntax definition for ANTLR.
*
*/
grammar amf;
options {
language = Cpp;
}
amf_group[amf::AmfGroup& amfGroup] locals [int jsonScope = 2]
: statements=amf_statements (GROUPSEP WS? LINE_COMMENT? EOL? | EOF)
{
amfGroup.SetStatements(std::move($statements.stmts));
}
;
amf_statements returns [amf::AmfStatements stmts]
: ( WS? ( stmt=amf_statement { stmts.emplace_back(std::move($stmt.value)); } WS? EOL) )*
;
amf_statement returns [amf::AmfStatementPtr value]
: (
{$amf_group::jsonScope == 1}? jsonparent_statement
| {$amf_group::jsonScope == 2}? jsonvalue_statement
)
{
value = std::move(context.expression(0).value);
}
;
jsonparent_statement returns [amf::AmfStatementPtr value] locals [int lineno=0]
:
(T_JSONPAR { $lineno = $T_JSONPAR.line;} ) WS (arg=integer_const)
{
value = std::make_shared<amf::JSONParentStatement>($lineno, nullptr);
}
;
jsonvalue_statement returns [amf::AmfStatementPtr value] locals [int lineno=0]
: ( T_JSONVALUE { $lineno = $T_JSONVALUE.line; } ) WS (arg=integer_const) (WS fmt=integer_const)?
{
value = std::make_shared<amf::JSONValueStatement>($lineno, std::move(arg), std::move(fmt));
}
;
integer_const returns [amf::AmfArgPtr value]
: p='%' (
(signed_int)
{
long num = std::stol($signed_int.text);
value = std::make_shared<amf::AmfArg>(ARG_TYPE::ARG_INTEGER, num);
}
| signed_float
{
value = std::make_shared<amf::AmfArg>(ARG_TYPE::ARG_INTEGER, std::stof($signed_float.text));
}
)
;
signed_int
: MINUS? INT;
signed_float
: MINUS? FLOAT;
T_JSONPAR : 'JSONPAR' | 'JSONPARENT';
T_JSONVALUE : 'JSONVAL' | 'JSONVALUE';
/* Special tokens */
GROUPSEP : '%%';
MINUS : '-';
INT : DIGIT+;
FLOAT
: DIGIT+ '.' DIGIT* EXPONENT?
| '.' DIGIT+ EXPONENT?
| DIGIT+ EXPONENT
;
ID : ('A'..'Z'|'_') ('A'..'Z'|'0'..'9'|'_')*
;
COMMENT
: ('/*' .*? '*/') -> channel(HIDDEN)
;
LINE_COMMENT
: ('//' ~('\n'|'\r')* '\r'?) -> channel(HIDDEN)
;
EOL : ('\r'? '\n');
QOUTED_STRING
: '"$' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
SIMPLE_STRING
: '$' ~(' '|'\t'|'\r'|'\n')*
;
WS : (' '|'\t')+;
fragment
DIGIT
: '0'..'9'
;
fragment
EXPONENT
: 'E' ('+'|'-')? ('0'..'9')+
;
fragment
ESC_SEQ
: '\\' (
'R'
|'N'
|'T'
|'"'
|'\''
|'\\'
)
;
The error occurs as soon as I add the predicates for the amf_statement (in this case 4 times "missing code generation template for NonLocalAttrTefHeader)". I have tried changing the output language to Python or CSharp, but this doesn't help.
After carefully looking at all the steps I stumbled on a small but critical difference in the batch command that executes the java command: I used a copy of my former antrl3 batch file that uses the java -jar option to execute the antlr-4.7.2-complete.jar instead of cp and executing the org.antlr.v4.Tool. All seems to go well, the command line options are displayed well, the syntax errors are all in place, until the actual lexer and parser code are created: then the error(33) is displayed, but only if dynamic scoping is used, otherwise all seems to go well.
Update: I thought I could proceed with my project, but this is only a partial solution: when I switched back to Cpp output, the errors returned. Standard output anf CSharp output is okay, as soon I attempt to generate Cpp output I receive the same errors, again when using dynamic scoping: lines 25 and 26. If I remove the predicates, the errors disappear.
So I'm still stuck with these errors, but only for C++.

Parse arbitrary delimiter character using Antlr4

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?
To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:
REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;
This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).
If I had to solve this using a regular expression itself, I would use a backreference like this:
m(.)(.+?)\1
(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.
It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.
Can this be solved with Antlr4?
You could do something like this:
lexer grammar TLexer;
REGEX
: REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
| '{' REGEX_ATOM+ '}'
| '(' REGEX_ATOM+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [/~##]
;
fragment REGEX_ATOM
: '\\' .
| ~[\\]
;
If you run the following class:
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));
for (Token t : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
}
}
}
you will see the following output:
REGEX /foo/
ANY
ANY /
ANY b
ANY a
ANY r
ANY \
ANY
REGEX ~\~~
ANY
REGEX {mu}
ANY
ANY (
ANY b
ANY l
ANY a
ANY (
The {...}? is called a predicate:
Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?
The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).
Tested with ANTLR 4.5.3
EDIT
And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):
lexer grammar TLexer;
#lexer::members {
boolean delimiterAhead(String start) {
return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
}
}
REGEX
: '/' ( '\\' . | ~[/\\] )+ '/'
| 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
| 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
| 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [~##]
;
fragment SPACES
: [ \t]+
;

ANTLR4 RegEx lexer modes

I am working on a Regx parser for RegEx inside XSD.
My previous problem was descrived here: ANTLR4 parsing RegEx
I have split the Lexer and Parser since than.
Now I have a problem parsing parantheses inside brackets. They should be treated as characters inside the brackets and as grouping tokens outside.
This is my lexer grammar:
lexer grammar RegExLexer;
Char : ALPHA ;
Int : DIGIT ;
LBrack : '[' ;//-> pushMode(modeRange) ;
RBrack : ']' ;//-> popMode ;
LBrace : '(' ;
RBrace : ')' ;
Semi : ';' ;
Comma : ',' ;
Asterisk: '*' ;
Plus : '+' ;
Dot : '.' ;
Dash : '-' ;
Question: '?' ;
LCBrace : '{' ;
RCBrace : '}' ;
Pipe : '|' ;
Esc : '\\' ;
WS : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;
And here is the example:
[0-9a-z()]+
I feel like i should use modes on brackets to change the behaviour of ALPHA fragment. If I copy the fragment, I get an error saying I can't have the declaration twice.
I have read the reference about this and I still don't get what i should do.
How do I implement the modes?
Here's a quick demo of how it is possible to create a context sensitive lexer using ANTLR4's lexical-modes:
lexer grammar RegexLexer;
START_CHAR_CLASS
: '[' -> pushMode(CharClass)
;
START_GROUP
: '('
;
END_GROUP
: ')'
;
PLAIN_ATOM
: ~[()\[\]]
;
mode CharClass;
END_CHAR_CLASS
: ']' -> popMode
;
CHAR_CLASS_ATOM
: ~[\r\n\\\]]
| '\\' .
;
After generating the lexer, you can use the following class to test it:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
public class Main {
public static void main(String[] args) {
RegexLexer lexer = new RegexLexer(new ANTLRInputStream("([()\\]])"));
for (Token token : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", RegexLexer.VOCABULARY.getSymbolicName(token.getType()), token.getText());
}
}
}
And if you run this Main class, the follwoing will be printed to your console:
START_GROUP (
START_CHAR_CLASS [
CHAR_CLASS_ATOM (
CHAR_CLASS_ATOM )
CHAR_CLASS_ATOM \]
END_CHAR_CLASS ]
END_GROUP )
As you can see, the ( and ) are tokenized differently outside the character class as they are inside of it.
You're going to have to handle this in the parser, not the lexer. When lexer sees a '(', it will return token LBrace. For lexer, there is no context as to where token is seen. It simply carves up the input into tokens. You will have to define parse rules and when processing parse tree, you can then determine was the LBrace inside brackets or not.

How to parse grammar of XSD Regex with ANTLR4?

Dear Antlr4 community,
I recently started to use ANTLR4 to translate regular expression from XSD / xml to cvc4.
I use the grammar as specified by w3c, see http://www.w3.org/TR/xmlschema11-2/#regexs .
For this question I have simplified this grammar (by removing charClass) to:
grammar XSDRegExp;
regExp : branch ( '|' branch )* ;
branch : piece* ;
piece : atom quantifier? ;
quantifier : Quantifiers | '{'quantity'}' ;
quantity : quantRange | quantMin | QuantExact ;
quantRange : QuantExact ',' QuantExact ;
quantMin : QuantExact ',' ;
atom : NormalChar | '(' regExp ')' ; // excluded | charClass ;
QuantExact : [0-9]+ ;
NormalChar : ~[.\\?*+{}()|\[\]] ;
Quantifiers : [?*+] ;
Parsing seems to go fine:
input a(bd){6,7}c{14,15}
However, I get an error message for:
input 12{3,4}
The error is:
line 1:0 mismatched input '12' expecting {, '(', '|', NormalChar}
I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.
I tried a number of changes:
[1] Swapping the definitions of QuantExact and NormalChar.
But swapping introduces an error in the first input:
line 1:6 no viable alternative at input '6'
since in that case '6' is only seen as a NormalChar and NOT as a QuantExact.
[2] Try to make a context for QuantExact (the curly brackets of quantity), such that the lexer only provides the QuantExact symbols in this limited context. But I failed to find ANTLR4 primitives for this.
So nothing seems to work, therefore my question is:
Can I parse this grammar with ANTLR4?
And if so, how?
I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.
The lexer does not "listen" to the parser: no matter if the parser is trying to match a NormalChar, the characters 12 will always be matched as a QuantExact. The lexer tries to match as much characters as possible, and in case of a tie, it chooses the rule defined first.
You could introduce a normalChar rule that matches both a NormalChar and QuantExact and use that rule in your atom:
atom : normalChar | '(' regExp ')' ;
normalChar : NormalChar | QuantExact ;
Another option would be to let the lexer create single char tokens only, and let the parser glue these together (much like a PEG). Something like this:
regExp : branch ( '|' branch )* ;
branch : piece* ;
piece : atom quantifier? ;
quantifier : Quantifiers | '{'quantity'}' ;
quantity : quantRange | quantMin | quantExact ;
quantRange : quantExact ',' quantExact ;
quantMin : quantExact ',' ;
atom : normalChar | '(' regExp ')' ;
normalChar : NormalChar | Digit ;
quantExact : Digit+ ;
Digit : [0-9] ;
NormalChar : ~[.\\?*+{}()|\[\]] ;
Quantifiers : [?*+] ;

How can I implement grouping preference in antlr?

I want to write a lexer and parser which could take expressions like
(4+y)*8
4+5*x
(3)+(z*(4+w))*6
And then parse them considering the priority of multiplication over addition. In particular, I can't figure out how I could avoid
4+5*x
being grouped as
MULTIPLICATION(ADDITION(4,5),5) instead of ADDITION(4+MULTIPLICATION)
My lexer looks like that:
PLUS : '+';
TIMES : '*';
NUMBER : [0-9]+'.'?[0-9]*;
VARIABLE : [(a-z)|(A-Z)]+;
OPENING : '(';
CLOSING : ')';
WHITESPACE : [ \t\r\n]+ -> skip ;
The correct grouping will happen automatically if you define your lower-priority operations closer to your "root expression" rule than the higher-priority ones:
expr
: e=multDivExpr
( PLUS e=multDivExpr
| MINUS e=multDivExpr
)*
;
multDivExpr
: e=atomExpr
( TIMES e=atom
| DIV e=atom
| REM e=atom
)*
;
atom
: NUMBER
| VARIABLE
| OPENING e=expr CLOSING
;
A simple way to understand what's going on is to think that the recursive descent parser generated by ANTLR will use multDivExpr non-terminals as "building blocks" for the "additive" expr non-terminal, therefore applying the grouping to multiplication and division before considering addition and subtraction.