ANTLR4 skips empty line only - regex

I am using antlr4 parsing a text file and I am new to it. Here is the part of the file:
abcdef
//emptyline
abcdef
In file stream string it will be looked like this:
abcdef\r\n\r\nabcdef\r\n
In terms of ANTLR4, it offers the "skip" method to skip something like white-space, TAB, and new line symbol by regular expression while parsing. i.e.
WS : [\t\s\r\n]+ -> skip ; // skip spaces, tabs, newlines
My problem is that I want to skip the empty line only. I don't want to skip every single "\r\n". Therefore it means when there are two or more "\r\n" appear together, I only want to skip the second one or following ones. How should I write the regular expression? Thank you.
grammar INIGrammar_1;
init: (section|NEWLINE)+ ;
section: '[' phase_name ':' v ']' (contents)+
| '[' phase_name ']' (contents)+ ;
//
//
phase_name : STRING
|MTT
|MPI_GET
|MPI_INSTALL
|MPI_DETAILS
|TEST_GET
|TEST_BUILD
|TEST_RUN
|REPORTER
;
v : STRING ;
contents: kvpairs
| include_section_pairs
| if_statement
| NEWLINE
| EOT
;
keylhs : STRING
;
valuerhs : STRING
|multiline_valuerhs
|kvpairs
|url
;
kvpairs: keylhs '=' valuerhs NEWLINE
;
include_section_pairs: INCLUDE_SECTION '=' STRING
;
if_statement: IF if_statement_condition THEN NEWLINE (ELSEIF if_statement_condition THEN NEWLINE)*? STRING NEWLINE IFEND NEWLINE
;
if_statement_condition:STRING '=' STRING ';'//here, semicolon has problem, either I use ';' or SEMICOLON
;
multiline_valuerhs:STRING (',' (' ')*? ( '\\' (' ')*? NEWLINE)? STRING)+
;
url:(' ')*?'http'':''//''www.';//ignore this, not finished.
IF: 'if';
ELSEIF:'elif';
IFEND:'fi';
THEN: 'then';
SEMICOLON: ';';
STRING : [a-z|A-Z|0-9|''| |.|\-|_|(|)|#|&|""|/|#|<|>|$]+ ;
//Keywords
MTT: 'MTT';
MPI_GET: 'MPI get';
MPI_INSTALL:'MPI install';
MPI_DETAILS:'MPI Details';
TEST_GET:'Test get';
TEST_BUILD: 'Test build';
TEST_RUN: 'Test run';
REPORTER: 'Reporter';
INCLUDE_SECTION: 'include_section';
//INCLUDE_SECTION_VALUE:STRING;
EOT:'EOT';
NEWLINE: ('\r' ? '\n')+ ;
WS : [\t]+ -> skip ; // skip spaces, tabs, newlines
COMMENT: '#' .*? '\r'?'\n' -> skip;
EMPTYLINE: '\r\n' -> skip;
Part of the INI file
#======================================================================
# MPI run details
#======================================================================
[MPI Details: Open MPI]
# MPI tests
#exec = mpirun #hosts# -np &test_np() #mca# --prefix &test_prefix() &test_executable() &test_argv()
exec = mpirun #hosts# -np &test_np() --prefix &test_prefix() &test_executable() &test_argv()
hosts = &if(&have_hostfile(), "--hostfile " . &hostfile(), \
&if(&have_hostlist(), "--host " . &hostlist(), ""))
One more small thing is, it seems like ";" cannot be indicated as itself in result. The ANTLR4 just keep saying it expects something else and treat the semicolon as unknown symbol.

The short answer to your question is that whitespace is not significant to your parser, so skip it all in the lexer.
The longer answer is to recognize that skipping whitespace (or any other character sequence) does not mean that it is not significant in the lexer. All it means is that no corresponding token is produced for consumption by the parser. Skipped whitespace will therefore still operate as a delimiter for generated tokens.
Couple of additional observations:
Antlr does not do regex's - thinking along those lines will lead to further conceptual difficulties.
Don't ignore warnings and errors messages produced in the generation of the Lexer/Parser - they almost always require correction before the generated code will function correctly.
Really helps to verify that the lexer is producing your intended token stream before trying to debug parser rules. See this answer that shows how to dump the token stream.

I ran into the same issue trying to have a language that does not require a ; command delimiter.
What resolved it for me was adding the new line as a valid parse rule that does nothing.
I am no expert on this matter but it worked:
nl : NEWLINE{};
The new line looks like this (no skipping)
NEWLINE:[\r?\n];

Related

Unexpected behaviour when parsing a string with optional Suffix in antlr4

I want to match multiple Functions to accept a comma-seperated List of placeholders and then the definition of a Unit, which is again seperated by a comma from the rest of the arguments. The text to parse would look like example 1: "produkt([F1],[F2],EURO_CENT)" or example 2:"produkt([F1],[F2],EURO)"
The grammar for this like i would expect it to work is this:
[...]
term: [...]
| 'produkt(' placeholder ',' placeholder ',' UNIT ')' #MultUnit
[...]
| placeholder #PlaceholderTwo
;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
LBRACK: '[';
RBRACK: ']';
PLACE: TEXT+ NUMBER?;
placeholder: LBRACK PLACE+ RBRACK;
[..]
UNIT: TEXT (('_' TEXT)*)?;
TEXT: ('a' .. 'z' | 'A' .. 'Z')+;//[a-zA-Z]+;
[...]
With this grammar example 1 works as expected but example 2 gives me the error "line 1:18 mismatched input 'EURO' expecting UNIT". As i understand it this means that "EURO" itself does not match the pattern for UNIT but "EURO_CENT" does. I do not understand why this is the case because the pattern for UNIT says that the "_CENT" part is optional and only the first part is mandatory.
I also tried to give the UNIT some Prefix (in this case "Unit.") by changing the pattern for Unit to UNIT: 'Unit.' TEXT ('_' TEXT)*;
I changed the input string to "produkt([F1],[F2],Unit.EURO)" accordingly and this matches like a charme.
However the second approach is not very userfriendly since we have to add something (in our opinion) unnecessary to the input. So the question is: why does the first option not match as expected when the UNIT-String is a single word and is there a workaround for it?
The short answer is that PLACE and UNIT are mutually ambiguous for content that only matches TEXT. If the sample inputs are canonical, then change the PLACE rule to remove the ambiguity:
PLACE : TEXT+ NUMBER ;
Other possibilities include redefining PLACE as
PLACE : LBRACK TEXT+ NUMBER? RBRACK; // adjust other rules accordingly
adding a predicate to the rule:
PLACE : {followsLBRACK()}? TEXT+ NUMBER ;
and redefining UNIT:
UNIT: TEXT ( 'S' | ( '_' TEXT )+ ) ; // EUROS or EURO_CENT; similar for other units.
BTW, Antlr generally evaluates its grammars top-down, so mixing your rules as you have actually obfuscates the logic.

ANTLR4 Lexer Matching Start of Line End Of Line

How to achieve Perl regular expression ^ and $ in the ANLTR4 lexer? ie. to match the start of a line and end of a line without consuming any character.
I am trying to use ANTLR4 lexer to match a # character at the start of a line but not in the middle of a line For example, to isolate and toss out all C++ preprocessor directives regardless of which directive it is while disregard a # inside a string literal. (Normally we can tokenize C++ string literals to eliminate a # appearing in the middle of a line but assuming we're not doing that). That means I only want to specify # .*? without bothering #if #ifndef #pragma, etc.
Also, the C++ standard allows whitespace and multi line comments right before and after the # e.g.
/* helo
world*/ # /* hel
l
o
*/ /*world */ifdef .....
is considered a valid preprocessor directive appearing on a single line. (the CRLFs inside the ML COMMENTs are tossed)
This's what I am doing currently:
PPLINE: '\r'? '\n' (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+ -> channel(PPDIR);
But the problem is I have to rely on the existence of a CRLF before the # and toss out that CRLF altogether with the directive. I need to replace the CRLF tossed out by the CRLF of this directive line so I've to make sure the directive is terminated by a CRLF.
However, that means my grammar cannot handle a directive appearing right at the start of file (i.e. no preceding CRLF) or preceded by an EOF without terminating CRLF.
If the Perl style regex ^ $ syntax is available, I can match the SOL/EOL instead of explicitly matching and consuming CRLF.
You can use semantic predicates for the conditions.
PPLINE
: {getCharPositionInLine() == 0}?
(ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+
{_input.LA(1) == '\r' || _input.LA(1) == '\n'}?
-> channel(PPDIR)
;
You could try having multiple rules with gated semantics (Different lexer rules in different state) or with modes (pushMode -> http://www.antlr.org/wiki/display/ANTLR4/Lexer+Rules), having an alternative rule for the beginning of the file and then switching to the core rules when the directives end, but it could be a long job.
Firstly, perhaps, I would try if really there are problems in parsing #pragma/preprocessor directives without changing anything, because for example if the problem of finding a # is it could be present in strings and comments, then just by ordering the rules you should be able to direct it to the right case (but this could be a problem for languages where you can put directives in comments).

ANTLR3 String Literals and Disallowing Nested Comments

I've recently been tasked with writing an ANTLR3 grammar for a fictional language. Everything else seems fine, but I've a couple of minor issues which I could do with some help with:
1) Comments are between '/*' and '*/', and may not be nested. I know how to implement comments themselves ('/*' .* '*/'), but how would I go about disallowing their nesting?
2) String literals are defined as any sequence of characters (except for double quotes and new lines) in between a pair of double quotes. They can only be used in an output statement. I attempted to define this thus:
output : OUTPUT (STRINGLIT | IDENT) ;
STRINGLIT : '"' ~('\r' | '\n' | '"')* '"' ;
For some reason, however, the parser accepts
OUTPUT "Hello,
World!"
and tokenises it as "Hello, \nWorld. Where the exclamation mark or closing " went I have no idea. Something to do with whitespace maybe?
WHITESPACE : ( '\t' | ' ' | '\n' | '\r' | '\f' )+ { $channel = HIDDEN; } ;
Any advice would be much appreciated - thanks for your time! :)
The form you wrote already disallows nested comments. The token will stop at the first instance of */, even if multiple /* sequences appeared in the comment. To allow nested comments you have to write a lexer rule to specifically treat the nesting.
The problem here is STRINGLIT does not allow a string to be split across multiple lines. Without seeing the rest of your lexer rules, I cannot tell you how this will be tokenized, but it's clear from the STRINGLIT rule you gave that the sample input is not a valid string.
NOTE: Your input given in the original question was not clear, so I reformatted it in an attempt to show the exact input you were using. Can you verify that my edit properly represents the input?

Groovy Regex illegal Characters

I have a Groovy script that converts some very poorly formatted data into XML. This part works fine, but it's also happily passing some characters along that aren't legal in XML. So I'm adding some code to strip these out, and this is where the problem is coming from.
The code that isn't compiling is this:
def illegalChars = ~/[\u0000-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/
What I'm wondering is, why? What am I doing wrong here? I tested this regex in http://regexpal.com/ and it works as expected, but I'm getting an error compiling it in Groovy:
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] line 23:26: unexpected char: 0x0
The line above is line 23. The surrounding lines are just variable declarations that I haven't changed while working on the regex.
Thanks!
Update:
The code compiles, but it's not filtering as I'd expected it to.
In regexpal I put the regex:
[\u0000-\u0008\u000B-\u000C\u000E-\u001F\u007F-\u009F]
and the test data:
name='lang'>E</field><field name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc>
<doc><field name='page'>72-88</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field name='issue'>NUMBER</field>
<field name='auth'>Dvorak, A.</field><field name='pub'>KARGER</field><field
name='rr'>GBP013.51</field><field name='issn'>1660-2242</field><field
name='class1'>TS</field><field name='freq'>S</field><field
name='class2'>616.079</field><field name='text'>Subcellular Localization of the
Cytokines, Basic Fibroblast Growth Factor and Tumor Necrosis Factor- in Mast
Cells</field><field name='id'>RN170369808</field><field name='volume'>VOL 85</field>
<field name='year'>2005</field><field name='lang'>E</field><field
name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc><doc><field
name='page'>89-97</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field
It's a grab from a file with one of the illegal characters, so it's a little random. But regexpal highlights only the illegal character, but in Groovy it's replacing even the '<' and '>' characters with empty strings, so it's basically annihilating the entire document.
The code snippet:
def List parseFile(File file){
println "reading File name: ${file.name}"
def lineCount = 0
List data = new ArrayList()
file.eachLine {
String input ->
lineCount ++
String line = input
if(input =~ illegalChars){
line = input.replaceAll(illegalChars, " ")
}
Map document = new HashMap()
elementNames.each(){
token ->
def val = getValue(line, token)
if(val != null){
if(token.equals("ISSUE")){
List entries = val.split(";")
document.putAt("year",entries.getAt(0).trim())
if(entries.size() > 1){
document.putAt("volume", entries.getAt(1).trim())
}
if(entries.size() > 2){
document.putAt("issue", entries.getAt(2).trim())
}
} else {
document.putAt(token, val)
}
}
}
data.add(document)
}
println "done"
return data
}
I don't see any reason that the two should behave differently; am I missing something?
Again, thanks!
line 23:26: unexpected char: 0x0
This error message points to this part of the code:
def illegalChars = ~/[\u0000-...
12345678901234567890123
It looks like for some reason the compiler doesn't like having Unicode 0 character in the source code. That said, you should be able to fix this by doubling the slash. This prevents Unicode escapes at the source code level, and let the regex engine handle the unicode instead:
def illegals = ~/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/
Note that I've also combined the character classes into one instead of as alternates. I've also removed the range definition when they're not necessary.
References
regular-expressions.info/Character Classes
On doubling the slash
Here's the relevant quote from java.util.regex.Pattern
Unicode escape sequences such as \u2014 in Java source code are processed as described in JLS 3.3. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.
To illustrate, in Java:
System.out.println("\n".matches("\\u000A")); // prints "true"
However:
System.out.println("\n".matches("\u000A"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is because \u000A, which is the newline character, is escaped in the second snippet at the source code level. The source code essentially becomes:
System.out.println("\n".matches("
"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is not a legal Java source code.
Try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/
OK here's my finding:
>>> print "XYZ".replaceAll(
/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/,
"-"
)
---
>>> print "X\0YZ".replaceAll(
/[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]/,
"-"
)
X-YZ
>>> print "X\0YZ".replaceAll(
"[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]",
"-"
)
X-YZ
In other words, my \\uNNNN answer within /pattern/ is WRONG. What happens is that 0-\ becomes part of the range, and this includes <, > and all capital letters.
The \\uNNNN only works in "pattern", not in /pattern/.
I will edit my official answer based on comments to this "answer".
Related questions
How to escape Unicode escapes in Groovy’s /pattern/ syntax
try
def illegalChars = ~/[\u0001-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/`

How do I parse this correctly with spirit?

My situation: I'm new to Spirit, I have to use VC6 and am thus using Spirit 1.6.4.
I have a line that looks like this:
//The Description;DESCRIPTION;;
I want to put the text DESCRIPTION in a string if the line starts with //The Description;.
I have something that works but looks not that elegant to me:
vector<char> vDescription; // std::string doesn't work due to missing ::clear() in VC6's STL implementation
if(parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+~ch_p(';'))[assign(vDescription)]
),
// End grammar
space_p).hit)
{
const string desc(vDescription.begin(), vDescription.end());
}
I would much more like to assign all printable characters up to the next ';' but the following won't work because parse(...).hit == false
parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+print_p)[assign(vDescription)]
>> ';'
),
// End grammar
space_p).hit)
How do I make it hit?
You might try using confix_p:
confix_p(as_lower_d["//the description;"],
(+print_p)[assign(vDescription)],
ch_p(';')
)
It should be equivalent to Fred's response.
The reason your code fails is because print_p is greedy. The +print_p parser will consume characters until it encounters the end of the input or a non-printable character. Semicolon is printable, so print_p claims it. Your input gets exhausted, the variable is assigned, and the match fails — there's nothing left for the last semicolon of your parser to match.
Fred's answer constructs a new parser, (print_p - ';'), which matches everything print_p does, except for semicolons. "Match everything except X, and then match X" is a common pattern, so confix_p is provided as a shortcut for constructing that kind of parser. The documentation suggests using it for parsing C- or Pascal-style comments, but that's not required.
For your code to work, Spirit would need to recognize that the greedy print_p matched too much and then backtrack to allow matching less. But although Spirit will backtrack, it won't backtrack to the "middle" of what a sub-parser would otherwise greedily match. It will backtrack to the next "choice point," but your grammar doesn't have any. See Exhaustive backtracking and greedy RD in the Spirit documentation.
You're not getting a hit because ';' is matched by print_p. Try this:
parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+(print_p-';'))[assign(vDescription)]
>> ';'
),
// End grammar
space_p).hit)