I am trying to write an antlr4 grammar for a customized language which among its lexer rules originally contained the following:
PLUS : '+' ;
MINUS : '-' ;
NUMBER: ('+'|'-')? [0-9]+ ;
COMMENT : '/*' (COMMENT|~'*'|('*' ~'/'))* '*/' -> skip ;
WS : (' ' | '\t' | '\n') -> skip ;
The parser grammar contains, among other things, an arithmetic expression evaluator, and what I found is that using these lexer rules failed to parse the input '2-2' correctly, which should come out as NUMBER MINUS NUMBER, and instead just returned two NUMBER tokens. I therefore broke out the unary + and - applications into it's own parser rule, as follows:
literal_number : NUMBER
| '-' NUMBER
| '+' NUMBER ;
And defined NUMBER simply as:
NUMBER: [0-9]+ ;
However, with this arrangement, the literal_number parser rule is being activated even if there is whitespace or comments between the plus and minus tokens and the number itself. This should not be valid in parser contexts where I am expecting to see only an integer constant (which is actually anywhere other than when parsing arithmetic expressions). I have another parser rule elsewhere in my parser grammar that handles unary negation already, so I do not need to replicate that in the literal_number parser rule anyways, so all I what I want is for the literal_number parser rule to refer only to places in the text where a real integer constant had been found.
How can I do this? I have already looked at questions on stackoverflow pertaining to rules that are sensitive to whitespace, but I have not been able to figure out how to apply any of those solutions to my problem.
I'm not sure that this matters for my question, but my target language is c++, although I expect I may still be able to generalize from a java-specific example if one is offered.
EDIT:
The response that I've seen so far highlights an issue with my original comment which may have been ambiguous. In my defense I had not wanted to complicate my original question with information that I did not immediately see as relevant, but in light of the response I've seen so far, I can now clearly see that it is. I can only offer my apologies for this initial oversight.
In addition to the literal_number rule, I also have the following rule for expressions, which, in particular, has a rule allowing for negation.
expression : ID # look up value
| literal_number # number
| MINUS expression # negate
| expression (STAR|SLASH) expression # multiply
| expression (PLUS|MINUS) expression # add
;
So to that end, the expression 2-2 should evaluate as literal_number (2) MINUS literal_number (2), 2--2 should evaluate as literal_number (2) MINUS literal_number (-2), while 2-- 2 should evaluate as literal_number (2) MINUS MINUS literal_number (2).
So basically, as I said originally, I only want the literal_number rule to be used at all when the NUMBER is by itself or MINUS and the NUMBER are side by side with no ignored tokens between them, but I cannot just make ('+'|'-') [0-9] a lexical rule for NUMBER without causing the problem I had in the first place.
Related
I'm doing Advent of Code day 9:
You sit for a while and record part of the stream (your puzzle input). The characters represent groups - sequences that begin with { and end with }. Within a group, there are zero or more other things, separated by commas: either another group or garbage. Since groups can contain other groups, a } only closes the most-recently-opened unclosed group - that is, they are nestable. Your puzzle input represents a single, large group which itself contains many smaller ones.
Sometimes, instead of a group, you will find garbage. Garbage begins with < and ends with >. Between those angle brackets, almost any character can appear, including { and }. Within garbage, < has no special meaning.
In a futile attempt to clean up the garbage, some program has canceled some of the characters within it using !: inside garbage, any character that comes after ! should be ignored, including <, >, and even another !.
Of course, this screams out for a Perl 6 Grammar...
grammar Stream
{
rule TOP { ^ <group> $ }
rule group { '{' [ <group> || <garbage> ]* % ',' '}' }
rule garbage { '<' [ <garbchar> | <garbignore> ]* '>' }
token garbignore { '!' . }
token garbchar { <-[ !> ]> }
}
This seems to work fine on simple examples, but it goes wrong with two garbchars in a row:
say Stream.parse('{<aa>}');
gives Nil.
Grammar::Tracer is no help:
TOP
| group
| | group
| | * FAIL
| | garbage
| | | garbchar
| | | * MATCH "a"
| | * FAIL
| * FAIL
* FAIL
Nil
Multiple garbignores are no problem:
say Stream.parse('{<!!a!a>}');
gives:
「{<!!a!a>}」
group => 「{<!!a!a>}」
garbage => 「<!!a!a>」
garbignore => 「!!」
garbchar => 「a」
garbignore => 「!a」
Any ideas?
UPD Given that the Advent of code problem doesn't mention whitespace you shouldn't be using the rule construct at all. Just switch all the rules to tokens and you should be set. In general, follow Brad's advice -- use token unless you know you need a rule (discussed below) or a regex (if you need backtracking).
My original answer below explored why the rules didn't work. I'll leave it in for now.
TL;DR <garbchar> | contains a space. Whitespace that directly follows any atom in a rule indicates a tokenizing break. You can simply remove this inappropriate space, i.e. write <garbchar>| instead (or better still, <.garbchar>| if you don't need to capture the garbage) to get the result you seek.
As your original question allowed, this isn't a bug, it's just that your mental model is off.
Your answer correctly identifies the issue: tokenization.
So what we're left with is your follow up question, which is about your mental model of tokenization, or at least how Perl 6 tokenizes by default:
why ... my second example ... goes wrong with two garbchars in a row:
'{<aa>}'
Simplifying, the issue is how to tokenize this:
aa
The simple high level answer is that, in parsing vernacular, aa will ordinarily be treated as one token, not two, and, by default, Perl 6 assumes this ordinary definition. This is the issue you're encountering.
You can overrule this ordinary definition to get any tokenizing result you care to achieve. But it's seldom necessary to do so and it certainly isn't in simple cases like this.
I'll provide two redundant paths that I hope might lead folk to the correct mental model:
For those who prefer diving straight into nitty gritty detail, there's a reddit comment I wrote recently about tokenization in Perl 6.
The rest of this SO answer provides a high level discussion that complements the low level explanation in my reddit comment.
Excerpting from the "Obstacles" section of the wikipedia page on tokenization, and interleaving the excerpts with P6 specific discussion:
Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often a tokenizer relies on simple heuristics, for example:
Punctuation and whitespace may or may not be included in the resulting list of tokens.
In Perl 6 you control what gets included or not in the parse tree using capturing features that are orthogonal to tokenizing.
All contiguous strings of alphabetic characters are part of one token; likewise with numbers.
Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.
By default, the Perl 6 design embodies an equivalent of these two heuristics.
The key thing to get is that it's the rule construct that handles a string of tokens, plural. The token construct is used to define a single token per call.
I think I'll end my answer here because it's already getting pretty long. Please use the comments to help us improve this answer. I hope what I've written so far helps.
A partial answer to my own question: Change all the rules to tokens and it works.
It makes sense, because the difference is :sigspace, which we don't need or want here. What I don't understand, though, is why it did work for some input, like my second example.
The resulting code is here, if you're interested.
I have this trouble: I must verify the correctness of many mathematical expressions especially check for consecutive operators + - * /.
For example:
6+(69-9)+3
is ok while
6++8-(52--*3)
no.
I am not using the library <regex> since it is only compatible with C++11.
Is there a alternative method to solve this problem? Thanks.
You can use a regular expression to verify everything about a mathematical expression except the check that parentheses are balanced. That is, the regular expression will only ensure that open and close parentheses appear at the point in the expression they should appear, but not their correct relationship with other parentheses.
So you could check both that the expression matches a regex and that the parentheses are balanced. Checking for balanced parentheses is really simple if there is only one type of parenthesis:
bool check_balanced(const char* expr, char open, char close) {
int parens = 0;
for (const char* p = expr; *p; ++p) {
if (*p == open) ++parens;
else if (*p == close && parens-- == 0) return false;
}
return parens == 0;
}
To get the regular expression, note that mathematical expressions without function calls can be summarized as:
BEFORE* VALUE AFTER* (BETWEEN BEFORE* VALUE AFTER*)*
where:
BEFORE is sub-regex which matches an open parenthesis or a prefix unary operator (if you have prefix unary operators; the question is not clear).
AFTER is a sub-regex which matches a close parenthesis or, in the case that you have them, a postfix unary operator.
BETWEEN is a sub-regex which matches a binary operator.
VALUE is a sub-regex which matches a value.
For example, for ordinary four-operator arithmetic on integers you would have:
BEFORE: [-+(]
AFTER: [)]
BETWEEN: [-+*/]
VALUE: [[:digit:]]+
and putting all that together you might end up with the regex:
^[-+(]*[[:digit:]]+[)]*([-+*/][-+(]*[[:digit:]]+[)]*)*$
If you have a Posix C library, you will have the <regex.h> header, which gives you regcomp and regexec. There's sample code at the bottom of the referenced page in the Posix standard, so I won't bother repeating it here. Make sure you supply REG_EXTENDED in the last argument to regcomp; REG_EXTENDED|REG_NOSUB, as in the example code, is probably even better since you don't need captures and not asking for them will speed things up.
You can loop over each charin your expression.
If you encounter a + you can check whether it is follow by another +, /, *...
Additionally you can group operators together to prevent code duplication.
int i = 0
while(!EOF) {
switch(expression[i]) {
case '+':
case '*': //Do your syntax checks here
}
i++;
}
Well, in general case, you can't solve this with regex. Arithmethic expressions "language" can't be described with regular grammar. It's context-free grammar. So if what you want is to check correctness of an arbitrary mathemathical expression then you'll have to write a parser.
However, if you only need to make sure that your string doesn't have consecutive +-*/ operators then regex is enough. You can write something like this [-+*/]{2,}. It will match substrings with 2 or more consecutive symbols from +-*/ set.
Or something like this ([-+*/]\s*){2,} if you also want to handle situations with spaces like 5+ - * 123
Well, you will have to define some rules if possible. It's not possible to completely parse mathamatical language with Regex, but given some lenience it may work.
The problem is that often the way we write math can be interpreted as an error, but it's really not. For instance:
5--3 can be 5-(-3)
So in this case, you have two choices:
Ensure that the input is parenthesized well enough that no two operators meet
If you find something like --, treat it as a special case and investigate it further
If the formulas are in fact in your favor (have well defined parenthesis), then you can just check for repeats. For instance:
--
+-
+*
-+
etc.
If you have a match, it means you have a poorly formatted equation and you can throw it out (or whatever you want to do).
You can check for this, using the following regex. You can add more constraints to the [..][..]. I'm giving you the basics here:
[+\-\*\\/][+\-\*\\/]
which will work for the following examples (and more):
6++8-(52--*3)
6+\8-(52--*3)
6+/8-(52--*3)
An alternative, probably a better one, is just write a parser. it will step by step process the equation to check it's validity. A parser will, if well written, 100% accurate. A Regex approach leaves you to a lot of constraints.
There is no real way to do this with a regex because mathematical expressions inherently aren't regular. Heck, even balancing parens isn't regular. Typically this will be done with a parser.
A basic approach to writing a recursive-descent parser (IMO the most basic parser to write) is:
Write a grammar for a mathematical expression. (These can be found online)
Tokenize the input into lexemes. (This will be done with a regex, typically).
Match the expressions based on the next lexeme you see.
Recurse based on your grammar
A quick Google search can provide many example recursive-descent parsers written in C++.
I defined some regular expressions and rules in flex. Now i want to write a regular expression that does the following: if there is an input that does not match any of rules i defined, i want to just simply print out that input. You may think that since it is not matched with any of the rules, it will automatically be printed out, but that is not the case. Consider my example, i defined the following regular expressions:
[a-zA-Z_]+[a-zA-Z0-9_]* printf("%d tIDENT (%s)\n",lineNum,yytext);
This rule defines an identifier, an identifier can start with underscore or with a letter, and it is a combination of letters, numbers and underscore.
[0-9]+ printf("%d tPOSINT (%s)\n",lineNum,yytext,yytext);
This rule recognizes the positive integers.
Assume these are my only rules, and the input is 2a3. This is not an identifier, and not an integer. But my output takes 2 as integer, and then takes a3 as identifier. But 2a3 does not match with any of the rules, i want to print it out as it is. How can i do this?
You may think that since it is not matched with any of the rules, it will automatically be printed out
No, I don't think that. If I remember correctly, it prints an error saying something like 'flex jammed' if the input doesn't match any rules. But in this case the input does match your rules, so it doesn't happen. If it isn't supposed to match, change your rules accordingly. But I would leave it. 2 followed by a3 won't be legal syntax anyway, so let the parser deal with it.
To avoid the jam message and print out the non-match, you need to add a final rule like this:
. { printf("%s", yytext); }. // or whatever you want
You also need to add a white space rule.
How can I force the shift\reduce conflict to be resolved by the GLR method?
Suppose I want the parser to resolve the conflict between the right shift operator and two closing angle brackets of template arguments for itself. I make the lexer pass the 2 consecutive ">" symbols, as separate tokens, without merging them into one single ">>" token. Then i put these rules to the grammar:
operator_name:
"operator" ">"
| "operator" ">" ">"
;
I want this to be a shift\reduce conflict. If I have the token declaration for ">" with left associativity, this will not be a conflict. So I have to remove the token precedence\associativity declaration, but this results in many other conflicts that I don't want to solve manually by specifying the contextual precedence for each conflicting rule. So, is there a way to force the shift\reduce conflict while having the token declared?
I believe that using context-dependent precedence on the rules for operator_name will work.
The C++ grammar as specified by the updated standard actually modifies the grammar to accept the >> token as closing two open template declarations. I'd recommend following it to get standard behaviour. For example, you must be careful that "x > > y" is not parsed as "x >> y", and you must also ensure that "foo<bar<2 >> 1>>" is invalid, while "foo<bar<(2 >> 1)>>" is valid.
I worked in Yacc (similar to Bison), with a similar scenario.
Standard grammars are, sometimes, called "parsing directed by syntax".
This case is, sometimes, called something like "parsing directed by semantics".
Example:
...
// shift operator example
if ((x >> 2) == 0)
...
// consecutive template closing tag example
List<String, List<String>> MyList =
...
Lets remember, our mind works like a compiler. The human mind can compile this, but the previous grammars, can't. Mhhh. Lets see how a human mind, would compile this code.
As you already know, the "x" before the consecutive ">" and ">" tokens indicates an expression or lvalue. The mind thinks "two consecutive greater-than symbols, after an expresion, should become a single shift operator token".
And for the "string" token: "two consecutive greater-than symbols, after a type identifier, should become two consecutive template closing tag tokens".
I think this case cannot be handled by the usual operator precedence, shift or reduce, or just grammars, but using ( "hacking" ) some functions provided by the parser itself.
I don't see an error in your example grammar rule. The "operator" symbol avoids confusing the two cases you mention. The parts that should be concern its the grammars where the shift operator its used, and the consecutive template closing tags are used.
operator_expr_example:
lvalue "<<" lvalue |
lvalue ">>" lvalue |
lvalue "&&" lvalue |
;
template_params:
identifier |
template_declaration_example |
array_declaration |
other_type_declaration
;
template_declaration_example:
identifier "<" template_params ">"
;
Cheers.
In "modern compiler implementation in Java" by Andrew Appel he claims in an exercise that:
Lex has a lookahead operator / so that the regular expression abc/def matches abc only when followed by def (but def is not part of the matched string, and will be part of the next token(s)). Aho et al. [1986] describe, and Lex [Lesk 1975] uses, an incorrect algorithm for implementing lookahead (it fails on (a|ab)/ba with input aba, matching ab where it should match a). Flex [Paxson 1995] uses a better mechanism that works correctly for (a|ab)/ba but fails (with a warning message on zx*/xy*. Design a better lookahead mechanism.
Does anyone know the solution to what he is describing?
"Does not work how I think it should" and "incorrect" are, not always the same thing. Given the input
aba
and the pattern
(ab|a)/ab
it makes a certain amount of sense for the (ab|a) to match greedily, and then for the /ab constraint to be applied separately. You're thinking that it should work like this regular expression:
(ab|a)(ab)
with the constraint that the part matched by (ab) is not consumed. That's probably better because it removes some limitations, but since there weren't any external requirements for what lex should do at the time it was written, you cannot call either behavior correct or incorrect.
The naive way has the merit that adding a trailing context doesn't change the meaning of a token, but simply adds a totally separate constraint about what may follow it. But that does lead to limitations/surprises:
{IDENT} /* original code */
{IDENT}/ab /* ident, only when followed by ab */
Oops, it won't work because "ab" is swallowed into IDENT precisely because its meaning was not changed by the trailing context. That turns into a limitation, but maybe it's a limitation that the author was willing to live with in exchange for simplicity. (What is the use case for making it more contextual, anyway?)
How about the other way? That could have surprises also:
{IDENT}/ab /* input is bracadabra:123 */
Say the user wants this not to match because bracadabra is not an identifier followed by (or ending in) ab. But {IDENT}/ab will match bracad and then, leaving abra:123 in the input.
A user could have expectations which are foiled no matter how you pin down the semantics.
lex is now standardized by The Single Unix specification, which says this:
r/x
The regular expression r shall be matched only if it is followed by an occurrence of regular expression x ( x is the instance of trailing context, further defined below). The token returned in yytext shall only match r. If the trailing portion of r matches the beginning of x, the result is unspecified. The r expression cannot include further trailing context or the '$' (match-end-of-line) operator; x cannot include the '^' (match-beginning-of-line) operator, nor trailing context, nor the '$' operator. That is, only one occurrence of trailing context is allowed in a lex regular expression, and the '^' operator only can be used at the beginning of such an expression.
So you can see that there is room for interpretation here. The r and x can be treated as separate regexes, with a match for r computed in the normal way as if it were alone, and then x applied as a special constraint.
The spec also has discussion about this very issue (you are in luck):
The following examples clarify the differences between lex regular expressions and regular expressions appearing elsewhere in this volume of IEEE Std 1003.1-2001. For regular expressions of the form "r/x", the string matching r is always returned; confusion may arise when the beginning of x matches the trailing portion of r. For example, given the regular expression "a*b/cc" and the input "aaabcc", yytext would contain the string "aaab" on this match. But given the regular expression "x*/xy" and the input "xxxy", the token xxx, not xx, is returned by some implementations because xxx matches "x*".
In the rule "ab*/bc", the "b*" at the end of r extends r's match into the beginning of the trailing context, so the result is unspecified. If this rule were "ab/bc", however, the rule matches the text "ab" when it is followed by the text "bc". In this latter case, the matching of r cannot extend into the beginning of x, so the result is specified.
As you can see there are some limitations in this feature.
Unspecified behavior means that there are some choices about what the behavior should be, none of which are more correct than the others (and don't write patterns like that if you want your lex program to be portable). "As you can see, there are some limitations in this feature".