Unexpected behaviour when parsing a string with optional Suffix in antlr4 - regex

I want to match multiple Functions to accept a comma-seperated List of placeholders and then the definition of a Unit, which is again seperated by a comma from the rest of the arguments. The text to parse would look like example 1: "produkt([F1],[F2],EURO_CENT)" or example 2:"produkt([F1],[F2],EURO)"
The grammar for this like i would expect it to work is this:
[...]
term: [...]
| 'produkt(' placeholder ',' placeholder ',' UNIT ')' #MultUnit
[...]
| placeholder #PlaceholderTwo
;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
LBRACK: '[';
RBRACK: ']';
PLACE: TEXT+ NUMBER?;
placeholder: LBRACK PLACE+ RBRACK;
[..]
UNIT: TEXT (('_' TEXT)*)?;
TEXT: ('a' .. 'z' | 'A' .. 'Z')+;//[a-zA-Z]+;
[...]
With this grammar example 1 works as expected but example 2 gives me the error "line 1:18 mismatched input 'EURO' expecting UNIT". As i understand it this means that "EURO" itself does not match the pattern for UNIT but "EURO_CENT" does. I do not understand why this is the case because the pattern for UNIT says that the "_CENT" part is optional and only the first part is mandatory.
I also tried to give the UNIT some Prefix (in this case "Unit.") by changing the pattern for Unit to UNIT: 'Unit.' TEXT ('_' TEXT)*;
I changed the input string to "produkt([F1],[F2],Unit.EURO)" accordingly and this matches like a charme.
However the second approach is not very userfriendly since we have to add something (in our opinion) unnecessary to the input. So the question is: why does the first option not match as expected when the UNIT-String is a single word and is there a workaround for it?

The short answer is that PLACE and UNIT are mutually ambiguous for content that only matches TEXT. If the sample inputs are canonical, then change the PLACE rule to remove the ambiguity:
PLACE : TEXT+ NUMBER ;
Other possibilities include redefining PLACE as
PLACE : LBRACK TEXT+ NUMBER? RBRACK; // adjust other rules accordingly
adding a predicate to the rule:
PLACE : {followsLBRACK()}? TEXT+ NUMBER ;
and redefining UNIT:
UNIT: TEXT ( 'S' | ( '_' TEXT )+ ) ; // EUROS or EURO_CENT; similar for other units.
BTW, Antlr generally evaluates its grammars top-down, so mixing your rules as you have actually obfuscates the logic.

Related

How to interpret Regex subtraction with grouping

I would be grateful if someone could explain how the following regex should be interpreted; it is from the W3C reference for Namespaces in XML 1.0, and defines an NCName ([4]) as:
Name - (Char* ':' Char*) /* An XML Name, minus the ":" */
I can understand subtraction when applied to lists, such as:
[a-z-[aeiuo]] representing the list of all consonants (see http://www.regular-expressions.info/charclasssubtract.html), but not when applied to a group (apologies if this is the wrong term) as shown above.
The comment indicates how I should interpret the regex, but I'm struggling; why not just:
Name - ( ':' )
if the intention is for NCName to be Name minus ':', then why are the zero or more characters required on either side (I'm not asking a separate question, just indicating my area of confusion)?
Please accept my thanks in advance.
The documents published by W3C use a variant of the EBNF Notation to describe the languages standardized by them.
It is described in section "6 Notation" of the XML Recommendation.
The example you posted:
NCName ::= Name - (Char* ':' Char*) /* An XML Name, minus the ":" */
How to read it:
NCName is the object described by the rule;
::= separates the name of the described object (on the left) by the expression that describes it (on the right);
Name is an object already described by another rule;
- is the except symbol; A - B in EBNF means "matches A but doesn't match B";
(...) - the parentheses create a group; they make the expression inside them behave as a single item;
Char is another object already described by another rule in the documentation; it basically means a Unicode character;
* - repetition, matches the previous item zero or more times;
':' - string in single or double quotes is a string literal; it represents itself; here, the colon character;
Put together, it means a NCName is a Name that doesn't contain :.
The comment seems incorrect (or maybe it is just bad worded).

ANTLR4 skips empty line only

I am using antlr4 parsing a text file and I am new to it. Here is the part of the file:
abcdef
//emptyline
abcdef
In file stream string it will be looked like this:
abcdef\r\n\r\nabcdef\r\n
In terms of ANTLR4, it offers the "skip" method to skip something like white-space, TAB, and new line symbol by regular expression while parsing. i.e.
WS : [\t\s\r\n]+ -> skip ; // skip spaces, tabs, newlines
My problem is that I want to skip the empty line only. I don't want to skip every single "\r\n". Therefore it means when there are two or more "\r\n" appear together, I only want to skip the second one or following ones. How should I write the regular expression? Thank you.
grammar INIGrammar_1;
init: (section|NEWLINE)+ ;
section: '[' phase_name ':' v ']' (contents)+
| '[' phase_name ']' (contents)+ ;
//
//
phase_name : STRING
|MTT
|MPI_GET
|MPI_INSTALL
|MPI_DETAILS
|TEST_GET
|TEST_BUILD
|TEST_RUN
|REPORTER
;
v : STRING ;
contents: kvpairs
| include_section_pairs
| if_statement
| NEWLINE
| EOT
;
keylhs : STRING
;
valuerhs : STRING
|multiline_valuerhs
|kvpairs
|url
;
kvpairs: keylhs '=' valuerhs NEWLINE
;
include_section_pairs: INCLUDE_SECTION '=' STRING
;
if_statement: IF if_statement_condition THEN NEWLINE (ELSEIF if_statement_condition THEN NEWLINE)*? STRING NEWLINE IFEND NEWLINE
;
if_statement_condition:STRING '=' STRING ';'//here, semicolon has problem, either I use ';' or SEMICOLON
;
multiline_valuerhs:STRING (',' (' ')*? ( '\\' (' ')*? NEWLINE)? STRING)+
;
url:(' ')*?'http'':''//''www.';//ignore this, not finished.
IF: 'if';
ELSEIF:'elif';
IFEND:'fi';
THEN: 'then';
SEMICOLON: ';';
STRING : [a-z|A-Z|0-9|''| |.|\-|_|(|)|#|&|""|/|#|<|>|$]+ ;
//Keywords
MTT: 'MTT';
MPI_GET: 'MPI get';
MPI_INSTALL:'MPI install';
MPI_DETAILS:'MPI Details';
TEST_GET:'Test get';
TEST_BUILD: 'Test build';
TEST_RUN: 'Test run';
REPORTER: 'Reporter';
INCLUDE_SECTION: 'include_section';
//INCLUDE_SECTION_VALUE:STRING;
EOT:'EOT';
NEWLINE: ('\r' ? '\n')+ ;
WS : [\t]+ -> skip ; // skip spaces, tabs, newlines
COMMENT: '#' .*? '\r'?'\n' -> skip;
EMPTYLINE: '\r\n' -> skip;
Part of the INI file
#======================================================================
# MPI run details
#======================================================================
[MPI Details: Open MPI]
# MPI tests
#exec = mpirun #hosts# -np &test_np() #mca# --prefix &test_prefix() &test_executable() &test_argv()
exec = mpirun #hosts# -np &test_np() --prefix &test_prefix() &test_executable() &test_argv()
hosts = &if(&have_hostfile(), "--hostfile " . &hostfile(), \
&if(&have_hostlist(), "--host " . &hostlist(), ""))
One more small thing is, it seems like ";" cannot be indicated as itself in result. The ANTLR4 just keep saying it expects something else and treat the semicolon as unknown symbol.
The short answer to your question is that whitespace is not significant to your parser, so skip it all in the lexer.
The longer answer is to recognize that skipping whitespace (or any other character sequence) does not mean that it is not significant in the lexer. All it means is that no corresponding token is produced for consumption by the parser. Skipped whitespace will therefore still operate as a delimiter for generated tokens.
Couple of additional observations:
Antlr does not do regex's - thinking along those lines will lead to further conceptual difficulties.
Don't ignore warnings and errors messages produced in the generation of the Lexer/Parser - they almost always require correction before the generated code will function correctly.
Really helps to verify that the lexer is producing your intended token stream before trying to debug parser rules. See this answer that shows how to dump the token stream.
I ran into the same issue trying to have a language that does not require a ; command delimiter.
What resolved it for me was adding the new line as a valid parse rule that does nothing.
I am no expert on this matter but it worked:
nl : NEWLINE{};
The new line looks like this (no skipping)
NEWLINE:[\r?\n];

empty rule in ocamlyacc

I have the following lexer rules:
let ws = [' ' '\t' '\n']+
...
| ws {Printf.printf "%s" (Lexing.lexeme lexbuf); WS(Lexing.lexeme lexbuf)}
And the following parser rules:
%token <string> WORD WS
cs : LSQRB wsornon choices wsornon RSQRB {$2}
;
wsornon : /* nothing */
| WS {$1}
;
choices : choice {$1}
| choices choice {$2}
;
choice : CHOICE LCURLYB mainbody RCURLYB {$3}
;
I basically want to get wsornon to match with whitespace or nothing. But cs gives syntax errors for the case without whitespace (which corresponds to the empty rule).
Am I missing something?
Even if you parse the empty stream, you should have a production rule:
wsornon:
| { something for nothing }
| WS { something for whitespace }
Note that menhir has an OPTION parametrized rule that is just fine for this kind of things, so that you don't have to write another rule for that. In fact OPTION(foo) return a production of type bar option if rule foo returns something of type bar, while you're going to ignore them anyway, so that's a bit of a different situation.
If you want to ignore whitespace, why don't you drop it altogether at the lexer step? Is it useful somewhere else in your grammar? I'd rather hack the lexer a bit to have some whitespace token just after some tokens where I know they're important than have them pollute my whole grammar. Of course, menhir allows to define parametrized rules that could help with that (example below untested):
ws(rule):
| LIST(WS) result = rule LIST(WS) { result }

Writing regular expressions and rules in Sublime Text 2 syntax definitions

I'm very interested in Syntax Definitions for Sblime text 2
I've studied the basics but I don't know how to write RE (and rules) for smth like
variable = sentense, i.e. myvar = func(foo, bar) + baz
I can't write anything better than ^\s*([^=\n]+)=([^=\n]+\n) (that doesn't work)
How to write this RE in proper way?
Also, i have some difficulties in defining RE for block
IF i FROM .. TO ..
...
ELSE
...
END IF
Hoe to write it?
In this case you have to write a parser. A regex won't work because the patterns may vary. You've already noticed it when you stated 'variable = sentence'.
For this, you can use spoofax or javacup for grammar definitions. I'll give you a snip in JavaCup:
Scanner issues: suppose 'variable' follows the pattern: (_|[a-zA-Z])(_|[a-zA-Z])*
and 'number' is: ([0-9])+
Note that number could be any decimal or int, but here I state it as that pattern, supposing my language only deals with integer (or whatever that pattern means :) ).
Now we can declare our grammar following the JavaCUP syntax. Which is more or less like:
expression ::= variable "=" sentence
sentence ::= sentence "+" sentence;
sentence ::= sentence "-" sentence;
sentence ::= sentence "*" sentence;
sentence ::= sentence "/" sentence;
sentence ::= number;
...and that goes further.
If you've never had any compiler's class, it may seems very difficult to see. Plus there is a lot of grammar's restrictions to avoiding infinity loop in the parser, depending on which you're using (RL or LL).
Anyway, the real answer for your question is: you can't do what you want only with regex, i'll need more concepts.

How do I parse this correctly with spirit?

My situation: I'm new to Spirit, I have to use VC6 and am thus using Spirit 1.6.4.
I have a line that looks like this:
//The Description;DESCRIPTION;;
I want to put the text DESCRIPTION in a string if the line starts with //The Description;.
I have something that works but looks not that elegant to me:
vector<char> vDescription; // std::string doesn't work due to missing ::clear() in VC6's STL implementation
if(parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+~ch_p(';'))[assign(vDescription)]
),
// End grammar
space_p).hit)
{
const string desc(vDescription.begin(), vDescription.end());
}
I would much more like to assign all printable characters up to the next ';' but the following won't work because parse(...).hit == false
parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+print_p)[assign(vDescription)]
>> ';'
),
// End grammar
space_p).hit)
How do I make it hit?
You might try using confix_p:
confix_p(as_lower_d["//the description;"],
(+print_p)[assign(vDescription)],
ch_p(';')
)
It should be equivalent to Fred's response.
The reason your code fails is because print_p is greedy. The +print_p parser will consume characters until it encounters the end of the input or a non-printable character. Semicolon is printable, so print_p claims it. Your input gets exhausted, the variable is assigned, and the match fails — there's nothing left for the last semicolon of your parser to match.
Fred's answer constructs a new parser, (print_p - ';'), which matches everything print_p does, except for semicolons. "Match everything except X, and then match X" is a common pattern, so confix_p is provided as a shortcut for constructing that kind of parser. The documentation suggests using it for parsing C- or Pascal-style comments, but that's not required.
For your code to work, Spirit would need to recognize that the greedy print_p matched too much and then backtrack to allow matching less. But although Spirit will backtrack, it won't backtrack to the "middle" of what a sub-parser would otherwise greedily match. It will backtrack to the next "choice point," but your grammar doesn't have any. See Exhaustive backtracking and greedy RD in the Spirit documentation.
You're not getting a hit because ';' is matched by print_p. Try this:
parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+(print_p-';'))[assign(vDescription)]
>> ';'
),
// End grammar
space_p).hit)