Use of StringTemplate in Antlr - templates

I would have this problem :
Given this rules
defField: type VAR ( ',' VAR)* SEP ;
VAR : ('a'..'z'|'A'..'Z')+ ;
type: 'Number'|'String' ;
SEP : '\n'|';' ;
where I have to do is to associate a template with a rule "defField", that returns the string that represents the xml-schema for the field, that is:
Number a,b,c ;-> "<xs:element name="a" type = "xs:Number"\>" ,also for b and c.
my problem is in * of Kleene, that is, how do I write the template to do what I described above in the light of the '*' ??
Thanks you!!!

Collect all VAR tokens in a java.util.List by using the += operator:
defField
: t=type v+=VAR (',' v+=VAR)* SEP
;
Now v (a List) contains all VAR's.
Then pass t and v as a parameter to a method in your StringTemplateGroup:
defField
: t=type v+=VAR (',' v+=VAR)* SEP -> defFieldSchema(type={$t.text}, vars={$v})
;
where defFieldSchema(...) must be declared in your StringTemplateGroup, which might look like (file: T.stg):
group T;
defFieldSchema(type, vars) ::= <<
<vars:{ v | \<xs:element name="<v.text>" type="xs:<type>"\>
}>
>>
The syntax for iterating over a collection is as follows:
<COLLECTION:{ EACH_ITEM_IN_COLLECTION | TEXT_TO_EMIT }>
Ans since vars is a List containing CommonTokens's, I grabbed its .text attribute instead of relying on its toString() method.
Demo
Take the following grammar (file T.g):
grammar T;
options {
output=template;
}
defField
: t=type v+=VAR (',' v+=VAR)* SEP -> defFieldSchema(type={$t.text}, vars={$v})
;
type
: NUMBER
| STRING
;
NUMBER
: 'Number'
;
STRING
: 'String'
;
VAR
: ('a'..'z'|'A'..'Z')+
;
SEP
: '\n'
| ';'
;
SPACE
: ' ' {skip();}
;
which can be tested with the following class (file: Main.java):
import org.antlr.runtime.*;
import org.antlr.stringtemplate.*;
import java.io.*;
public class Main {
public static void main(String[] args) throws Exception {
StringTemplateGroup group = new StringTemplateGroup(new FileReader("T.stg"));
ANTLRStringStream in = new ANTLRStringStream("Number a,b,c;");
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
parser.setTemplateLib(group);
TParser.defField_return returnValue = parser.defField();
StringTemplate st = (StringTemplate)returnValue.getTemplate();
System.out.println(st.toString());
}
}
As you will see when you run this class, it parses the input "Number a,b,c;" and produces the following output:
<xs:element name="a" type="xs:Number">
<xs:element name="b" type="xs:Number">
<xs:element name="c" type="xs:Number">
EDIT
To run the demo, make sure you have all of the following files in the same directory:
T.g (the combined grammar file)
T.stg (the StringTemplateGroup file)
antlr-3.3.jar (the latest stable ANTLR build as of this writing)
Main.java (the test class)
then execute to following commands from your OS's shell/prompt (from the same directory all the files are in):
java -cp antlr-3.3.jar org.antlr.Tool T.g # generate the lexer & parser
javac -cp antlr-3.3.jar *.java # compile all .java source files
java -cp .:antlr-3.3.jar Main # run the main class (*nix)
# or
java -cp .;antlr-3.3.jar Main # run the main class (Windows)
Probably not necessary to mention, but the # including the text after it should not be a part of the commands: these are only comments to indicate what these commands are for.

Related

How to write Antlr4 grammar rule to match file path?

What is the best method to write antlr4 grammar to match file paths like
"C:\Users\Alex\IdeaProjects\Compiler_Project\antlrTest\src\SQL.g4"
Or relative path like
"Compiler_Project//samples//test.txt"
My guess is you are trying to parse some sort of scripting language, like bash or zsh.
I agree that Antlr might not be the best choice to merely parse a file path, but that wasn't your question was it?
Here is a grammar excerpt from a larger grammar that parses windows batch files.
It's worth noting again that Antlr might not be the best choice for parsing Windows batch commands either in that each command can have peculiar syntax that doesn't readily apply to all the commands in a batch file.
That doesn't mean you can't do it though! Here, I use the 'island grammar' feature which requires separate lexer.g4 and grammar.g4 files but allows you to treat each command as its own little grammer.
Token reuse is a little awkward but not horrible.
BatchLexer.g4
lexer grammar BatchLexer;
options {
caseInsensitive=true;
}
CD : ('CD' | 'CHDIR') -> pushMode(CD_CMD) ;
DOT : '.' ;
DOTDOT : '..' ;
BLANK_LINE : NL ;
NL : '\n';
OPTION : '/' [a-z]+? ;
DRIVE : [a-z] ':' ; //posix?
WS : [ \t\r]+ ->skip ;
// This introduces the type name, but doesn't match anything at this scope
PATH : ~[.] ;
fragment ESCAPED_QUOTE : '\\"' ;
fragment PATH_WORD : ~[ <>:/|\r\n]+ ;
fragment RAW_PATH : DRIVE? (DOT | DOTDOT | ESCAPED_QUOTE | PATH_WORD) ;
fragment QUOTED_PATH : '"' DRIVE? (DOT | DOTDOT | ESCAPED_QUOTE | PATH_WORD) '"' ;
mode CD_CMD ;
CD_OPTION : OPTION -> type(OPTION) ;
CD_PATH : (RAW_PATH | QUOTED_PATH) -> type(PATH) ;
CD_NL : NL -> type(NL), popMode ;
CD_WS : WS ->skip ;
Batch.g4
grammar Batch;
options {
tokenVocab=BatchLexer;
caseInsensitive=true;
}
file : (command)* EOF ;
command : (
cd_cmd
)? (NL | BLANK_LINE) ;
cd_cmd : CD OPTION? PATH*? ;

Running antlr4 parser for c++ on grammar file shows error 33: missing code generation template NonLocalAttrRefHeader

I am relatively new to ANTLR, I have a current project that needs to be merged from ANTLR3 (version 3.5) to ANTLR4. I have gone thru the book and tried the demo, this all works fine, but my own project gives me the following problem:
After converting a ANTRL3 project to ANTLR4 project (resolving all warnings and errors) I was able to build the lexer.h and lexer.cpp file but the following errors come up:
error(33): missing code generation template NonLocalAttrRefHeader
error(33): missing code generation template SetNonLocalAttrHeader
(about 50 times). I haven't been able to find any references of these templates anywhere. Is there anybody who can shed any light on these error messages? Because they don't say anything about line no's or reference any other code I'm completely in in the dark where to look.
I've set up a test environment, testing the demo g4 files. I have pulled the g4 file out of my (VS2017) project and tried it seperately using batch files.
Because of the lack of references I can't show the actual piece of code that is the cause. I have tried a partial parse, but I haven't been able to get any clues from that.
These errors are shown:
error(33): missing code generation template NonLocalAttrRefHeader
error(33): missing code generation template SetNonLocalAttrHeader
I've constructed a small example to demonstrate the problem:
/*
* AMF Syntax definition for ANTLR.
*
*/
grammar amf;
options {
language = Cpp;
}
amf_group[amf::AmfGroup& amfGroup] locals [int jsonScope = 2]
: statements=amf_statements (GROUPSEP WS? LINE_COMMENT? EOL? | EOF)
{
amfGroup.SetStatements(std::move($statements.stmts));
}
;
amf_statements returns [amf::AmfStatements stmts]
: ( WS? ( stmt=amf_statement { stmts.emplace_back(std::move($stmt.value)); } WS? EOL) )*
;
amf_statement returns [amf::AmfStatementPtr value]
: (
{$amf_group::jsonScope == 1}? jsonparent_statement
| {$amf_group::jsonScope == 2}? jsonvalue_statement
)
{
value = std::move(context.expression(0).value);
}
;
jsonparent_statement returns [amf::AmfStatementPtr value] locals [int lineno=0]
:
(T_JSONPAR { $lineno = $T_JSONPAR.line;} ) WS (arg=integer_const)
{
value = std::make_shared<amf::JSONParentStatement>($lineno, nullptr);
}
;
jsonvalue_statement returns [amf::AmfStatementPtr value] locals [int lineno=0]
: ( T_JSONVALUE { $lineno = $T_JSONVALUE.line; } ) WS (arg=integer_const) (WS fmt=integer_const)?
{
value = std::make_shared<amf::JSONValueStatement>($lineno, std::move(arg), std::move(fmt));
}
;
integer_const returns [amf::AmfArgPtr value]
: p='%' (
(signed_int)
{
long num = std::stol($signed_int.text);
value = std::make_shared<amf::AmfArg>(ARG_TYPE::ARG_INTEGER, num);
}
| signed_float
{
value = std::make_shared<amf::AmfArg>(ARG_TYPE::ARG_INTEGER, std::stof($signed_float.text));
}
)
;
signed_int
: MINUS? INT;
signed_float
: MINUS? FLOAT;
T_JSONPAR : 'JSONPAR' | 'JSONPARENT';
T_JSONVALUE : 'JSONVAL' | 'JSONVALUE';
/* Special tokens */
GROUPSEP : '%%';
MINUS : '-';
INT : DIGIT+;
FLOAT
: DIGIT+ '.' DIGIT* EXPONENT?
| '.' DIGIT+ EXPONENT?
| DIGIT+ EXPONENT
;
ID : ('A'..'Z'|'_') ('A'..'Z'|'0'..'9'|'_')*
;
COMMENT
: ('/*' .*? '*/') -> channel(HIDDEN)
;
LINE_COMMENT
: ('//' ~('\n'|'\r')* '\r'?) -> channel(HIDDEN)
;
EOL : ('\r'? '\n');
QOUTED_STRING
: '"$' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
SIMPLE_STRING
: '$' ~(' '|'\t'|'\r'|'\n')*
;
WS : (' '|'\t')+;
fragment
DIGIT
: '0'..'9'
;
fragment
EXPONENT
: 'E' ('+'|'-')? ('0'..'9')+
;
fragment
ESC_SEQ
: '\\' (
'R'
|'N'
|'T'
|'"'
|'\''
|'\\'
)
;
The error occurs as soon as I add the predicates for the amf_statement (in this case 4 times "missing code generation template for NonLocalAttrTefHeader)". I have tried changing the output language to Python or CSharp, but this doesn't help.
After carefully looking at all the steps I stumbled on a small but critical difference in the batch command that executes the java command: I used a copy of my former antrl3 batch file that uses the java -jar option to execute the antlr-4.7.2-complete.jar instead of cp and executing the org.antlr.v4.Tool. All seems to go well, the command line options are displayed well, the syntax errors are all in place, until the actual lexer and parser code are created: then the error(33) is displayed, but only if dynamic scoping is used, otherwise all seems to go well.
Update: I thought I could proceed with my project, but this is only a partial solution: when I switched back to Cpp output, the errors returned. Standard output anf CSharp output is okay, as soon I attempt to generate Cpp output I receive the same errors, again when using dynamic scoping: lines 25 and 26. If I remove the predicates, the errors disappear.
So I'm still stuck with these errors, but only for C++.

How do I get a different parser to work on content identified by nestedExpr in pyparsing

I have a nested block of text that is not necessarily well formatted (indented) which needs to be parsed and operated on further.
A sample of the text:
cell(hi) {
param1 : true;
param2 : false;
func1() {
param3 : hello;
param4 : hi;
}
func2() {
param5 : 10;
nestedFunc1() {
nestedParam6 : 20;
nestedFunc2(args) {
index1(a,b,c,d,e);
values(1,2,3,4,\
5);
}
}
}
}
The above snippet is a sample of the text I intend to parse.
So far, here is the parser I have come up with:
Group(Word("cell") + QuotedString('(', escChar=None, endQuoteChar=')') + nestedExpr(opener='{', closer='}').setResultsName("cell")
I tried using the content=LineEnd() argument for the nestedExpr call but that does not get me what I expect.
Apart from that, I have a few other parsers for the content inside the cell {...} wrapper. Examples:
params = Group(Regex(r'(.*)\s*:\s*(.*);')).setResultsName("params")
LookupTables = Group(Word(k) + QuotedString('(', escChar=None, endQuoteChar=')') + QuotedString("{", multiline=True, endQuoteChar="}") )
I was wondering if there is an efficient way to parse the nested block and obtain the output in the following format:
[
['cell', 'hi']
{'param1': 'true'}
{'param2': 'false'}
['func1'
{'param3': 'hello'}
{'param4': 'hi'}
]
['func2'
{'param5': '10'}
['nestedFunc1'
{'index1(a,b,c,d,e)': None}
{'values(1,2,3,4,5)': None}
]
]
]
Basically, I am trying to obtain the Look-up Tables inside of the data:
index1(a) = 1
index1(b) = 2
index1(c) = 3
index1(d) = 4
index1(e) = 5
When I try to use nestedExpr on my input text, I obtain a nested list of all content separated by spaces without the new line characters (shown below). I will at the very least need the new line characters so that I can join the output and re-create the original nested block of data to apply my own parsers using setParseAction etc.
Here is the output I obtain currently:
Parser:
tp = Group(Word("cell") + QuotedString('(', escChar=None, endQuoteChar=')') + nestedExpr(opener='{', closer='}') ).setResultsName("cell")
Output:
[['cell', 'hi', ['param1', ':', 'true;', 'param2', ':', 'false;', 'func1()', ['param3', ':', 'hello;', 'param4', ':', 'hi;'], 'func2()', ['param5', ':', '10;', 'nestedFunc1()', ['nestedParam6', ':', '20;', 'nestedFunc2(args)', ['index1(a,b,c,d,e);', 'values(1,2,3,4,\\', '5);']]]]]]

ANTLR4 RegEx lexer modes

I am working on a Regx parser for RegEx inside XSD.
My previous problem was descrived here: ANTLR4 parsing RegEx
I have split the Lexer and Parser since than.
Now I have a problem parsing parantheses inside brackets. They should be treated as characters inside the brackets and as grouping tokens outside.
This is my lexer grammar:
lexer grammar RegExLexer;
Char : ALPHA ;
Int : DIGIT ;
LBrack : '[' ;//-> pushMode(modeRange) ;
RBrack : ']' ;//-> popMode ;
LBrace : '(' ;
RBrace : ')' ;
Semi : ';' ;
Comma : ',' ;
Asterisk: '*' ;
Plus : '+' ;
Dot : '.' ;
Dash : '-' ;
Question: '?' ;
LCBrace : '{' ;
RCBrace : '}' ;
Pipe : '|' ;
Esc : '\\' ;
WS : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;
And here is the example:
[0-9a-z()]+
I feel like i should use modes on brackets to change the behaviour of ALPHA fragment. If I copy the fragment, I get an error saying I can't have the declaration twice.
I have read the reference about this and I still don't get what i should do.
How do I implement the modes?
Here's a quick demo of how it is possible to create a context sensitive lexer using ANTLR4's lexical-modes:
lexer grammar RegexLexer;
START_CHAR_CLASS
: '[' -> pushMode(CharClass)
;
START_GROUP
: '('
;
END_GROUP
: ')'
;
PLAIN_ATOM
: ~[()\[\]]
;
mode CharClass;
END_CHAR_CLASS
: ']' -> popMode
;
CHAR_CLASS_ATOM
: ~[\r\n\\\]]
| '\\' .
;
After generating the lexer, you can use the following class to test it:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
public class Main {
public static void main(String[] args) {
RegexLexer lexer = new RegexLexer(new ANTLRInputStream("([()\\]])"));
for (Token token : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", RegexLexer.VOCABULARY.getSymbolicName(token.getType()), token.getText());
}
}
}
And if you run this Main class, the follwoing will be printed to your console:
START_GROUP (
START_CHAR_CLASS [
CHAR_CLASS_ATOM (
CHAR_CLASS_ATOM )
CHAR_CLASS_ATOM \]
END_CHAR_CLASS ]
END_GROUP )
As you can see, the ( and ) are tokenized differently outside the character class as they are inside of it.
You're going to have to handle this in the parser, not the lexer. When lexer sees a '(', it will return token LBrace. For lexer, there is no context as to where token is seen. It simply carves up the input into tokens. You will have to define parse rules and when processing parse tree, you can then determine was the LBrace inside brackets or not.

How do I use a regex to find a duplicated string on the left and the right of an equals sign

I have a file with some translation and some missing translations where the english key equals the translation.
...
/* comment1 */
"An unexpected error occurred." = "Ein unerwarteter Fehler ist aufgetreten.";
/* comment2 */
"Enter it here..." = "Enter it here...";
...
Is it possible to:
Find all occurrences of "X" = "X";?
Bonus: For all occurrences delete the line, the comment line above and newline above that?
You'll need to use backreferences here, something along the lines of:
/"(.+)"\s*=\s*"\1"/
^ ^
| |
| backreference to first string
|
capture group for first string
Note that the syntax for backreferences varies between languages, the above one works for your case in Ruby, e.g.
❯ irb
2.2.2 :001 > r = /"(.+)"\s*=\s*"\1"/
=> /"(.+)"\s*=\s*"\1"/
2.2.2 :002 > r.match('"foo" = "foo"')
=> #<MatchData "\"foo\" = \"foo\"" 1:"foo">
2.2.2 :003 > r.match('"foo" = "bar"')
=> nil
In response to your comment about wanting to do it in a text editor, remove the leading/trailing slashes and the above regex should work fine in Sublime Text... YMMV in other editors.
For the Bonus question:
(\R\R)?+/\*[^*]*(?:\*+(?!/)[^*]*)*\*/\R("[^"]*") = \2;(?(1)|\R{0,2})
demo
(works with notepad++, remove the newline above, except for the first item.)
You can find all the occurences by matching each line with the following pattern: "(.*?)"\s*=\s*"\1", if you got a match you can delete the line.
Java working example
public class StackOverflow32507709 {
public static String pattern;
static {
pattern = "\"(.*?)\"\\s*=\\s*\"\\1\"";
}
public static void main(String[] args) {
String[] text = {
"/* comment1 */",
"\r\n",
"\"An unexpected error occurred\" = \"German translation...\";\r\n",
"\r\n",
"\"Enter it here\" = \"Enter it here\";\r\n"
};
List<String> filteredTranslations = new ArrayList<String>();
Pattern p = Pattern.compile(pattern);
for (String line : text) {
Matcher m = p.matcher(line);
if (!m.find()) {
filteredTranslations.add(line);
}
m.reset();
}
for (String filteredTranslation : filteredTranslations) {
System.out.println(filteredTranslation);
}
}
}
You need to use a backreference, like this: http://www.regular-expressions.info/backref.html
I can't give you a full answer because you haven't said which programming language you are using, but I'm sure you can figure it out from there.