Context-free-grammar to represent regular expressions - regex

I'm trying to make a context-free-grammar to represent simple regular expressions. The symbols that I want is [0-9][a-z][A-Z], and operators is "|", "()" and "." for concatenation, and for sequences for now I only want "*" later I will add "+","?", etc. I tried this grammar in javacc:
void RE(): {}
{
FINAL(0) ( "." FINAL(0) | "|" FINAL(0))*
}
void FINAL(int sign): { Token t; }
{
t = <SYMBOL> {
if ( sign == 1 )
jjtThis.val = t.image + "*";
else
jjtThis.val = t.image;
}
| FINAL(1) "*"
| "(" RE() ")"
}
The problem is in FINAL function the line | FINAL(1) "*" that gives me a error Left recursion detected: "FINAL... --> FINAL.... Putting "*" on the left of FINAL(1) resolve the problem but this is not what I want..
I already tried to read the article from wikipedia to remove left recursion but I really don't know how to do it, can someone help? :s

The following takes care of the left recursion
RE --> FACTOR ("." FINAL | "|" FINAL)*
FINAL --> PRIMARY ( "*" )*
PRIMARY --> <SYMBOL> | "(" RE ")"
However, that won't give . precedence over | . For that you can do the following
RE --> TERM ("|" TERM)*
TERM --> FINAL ("." FINAL)*
FINAL --> PRIMARY ( "*" )*
PRIMARY --> <SYMBOL> | "(" RE ")"
The general rule is
A --> A b | c | d | ...
can be transformed to
A --> B b*
B --> c | d | ...
where B is a new nonnterminal.

Related

Antlr4: Can't understand why breaking something out into a subrule doesn't work

I'm still new at Antlr4, and I have what is probably a really stupid problem.
Here's a fragment from my .g4 file:
assignStatement
: VariableName '=' expression ';'
;
expression
: (value | VariableName)
| bin_op='(' expression ')'
| expression UNARY_PRE_OR_POST
| (UNARY_PRE_OR_POST | '+' | '-' | '!' | '~' | type_cast) expression
| expression MUL_DIV_MOD expression
| expression ADD_SUB expression
;
VariableName
: ( [a-z] [A-Za-z0-9_]* )
;
// Pre or post increment/decrement
UNARY_PRE_OR_POST
: '++' | '--'
;
// multiply, divide, modulus
MUL_DIV_MOD
: '*' | '/' | '%'
;
// Add, subtract
ADD_SUB
: '+' | '-'
;
And my sample input:
myInt = 10 + 5;
myInt = 10 - 5;
myInt = 1 + 2 + 3;
myInt = 1 + (2 + 3);
myInt = 1 + 2 * 3;
myInt = ++yourInt;
yourInt = (10 - 5)--;
The first sample line myInt = 10 + 5; line produces this error:
line 22:11 mismatched input '+' expecting ';'
line 22:14 extraneous input ';' expecting {<EOF>, 'class', '{', 'interface', 'import', 'print', '[', '_', ClassName, VariableName, LITERAL, STRING, NUMBER, NUMERIC_LITERAL, SYMBOL}
I get similar issues with each of the lines.
If I make one change, a whole bunch of errors disappear:
| expression ADD_SUB expression
change it to this:
| expression ('+' | '-') expression
I've tried a bunch of things. I've tried using both lexer and parser rules (that is, calling it add_sub or ADD_SUB). I've tried a variety of combinations of parenthesis.
I tried:
ADD_SUB: [+-];
What's annoying is the pre- and post-increment lines produce no errors as long as I don't have errors due to +-*. Yet they rely on UNARY_PRE_OR_POST. Of course, maybe it's not really using that and it's using something else that just isn't clear to me.
For now, I'm just eliminating the subrule syntax and will embed everything in the main rule. But I'd like to understand what's going on.
So... what is the proper way to do this:
Do not use literal tokens inside parser rules (unless you know what you're doing).
For the grammar:
expression
: '+' expression
| ...
;
ADD_SUB
: '+' | '-'
;
ANTLR will create a lexer rules for the literal '+', making the grammar really look like this:
expression
: T__0 expression
| ...
;
T__0 : '+';
ADD_SUB
: '+' | '-'
;
causing the input + to never become a ADD_SUB token because T__0 will always match it first. That is simply how the lexer operates: try to match as much characters as possible for every lexer rule, and when 2 (or more) match the same amount of characters, let the one defined first "win".
Do something like this instead:
expression
: value
| '(' expression ')'
| expression UNARY_PRE_OR_POST
| (UNARY_PRE_OR_POST | ADD | SUB | EXCL | TILDE | type_cast) expression
| expression (MUL | DIV | MOD) expression
| expression (ADD | SUB) expression
;
value
: ...
| VariableName
;
VariableName
: [a-z] [A-Za-z0-9_]*
;
UNARY_PRE_OR_POST
: '++' | '--'
;
MUL : '*';
DIV : '/';
MOD : '%';
ADD : '+';
SUB : '-';
EXCL : '!';
TILDE : '~';

How to remove ambiguity in EBNF Instaparse grammar

How can i prevent that the "," literal in the structure rule is parsed as a operator in the following EBNF grammar for Instaparse?
Grammar:
structure = atom <"("> term ("," term)* <")">
term = atom | number | structure | variable | "(" term ")" | term operator term
operator = "," | ";" | "\\=" | "=="
Using the comma as a separator and as an operator like you do makes comma context sensitive which Ebnf on its own can't deal with.

Accept brackets correctly in bisonc++

I've tried to write a basic syntax checker using bisonc++
The rules are:
expression -> OPEN_BRACKET expression CLOSE_BRACKET
expression -> expression operator expression
operator -> PLUS
operator -> MINUS
If I try to run the compiled code, I get an error at this line:
(a+b)-(c+d)
The first rule is applied, the leftmost and the rightmost brackets are the OPEN_BRACKET and the CLOSE_BRACKET. The remaining expression is: a+b)-(c+d
How is it possible to prevent this behaviour? Is it possible to count the open and closed brackets?
Edit
The expression grammar:
expression:
OPEN_BRACKET expression CLOSE_BRACKET
{
//
}
| operator
{
//
}
| VARIABLE
{
//
}
;
operator:
expression PLUS expression
{
//
}
| expression MINUS expression
{
//
}
;
Edit2
The lexer
CHAR [a-z]
WS [ \t\n]
%%
{CHAR}+ return Parser::VARIABLE;
"+" return Parser::PLUS;
"-" return Parser::MINUS;
"(" return Parser::OPEN_BRACKET;
")" return Parser::CLOSE_BRACKET;
This is not a normal expression grammar. Try the normal one.
expression
: term
| expression '+' term
| expression '-' term
;
term
: factor
| term '*' factor
| term '/' factor
| term '%' factor
;
factor
: primary
| '-' factor // unary minus
| primary '^' factor // exponentiation, right-associative
;
primary
: identifier
| literal
| '(' expression ')'
;
Note also the above method of indenting and aligning, and that you only have to return yytext[0] from the lexer for single special characters: you don't need special token names, and it's more readable without them:
CHAR [a-zA-Z]
DIGIT [0-9]
WHITESPACE [ \t\r\n]
%%
{CHAR}+ { return Parser::VARIABLE; }
{DIGIT}+ { return Parser::LITERAL; }
{WHITESPACE}+ ;
. { return yytext[0]; }
Your operator rule does not look good.
Try experiment with:
expression:
OPEN_BRACKET expression CLOSE_BRACKET
{
//
}
|
expression operator expression
{
//
}
|
VARIABLE
{
//
}
;
operator:
PLUS
{
//
}
|
MINUS
{
//
}
;
As your pseudo code actually suggests...

Scala Regex for less than equal to operator (<=)

I am trying to parse an expression with (<, <=, >=, >). All but <= works just fine. Can someone help what could be the issue.
Code:
object MyTestParser extends RegexParsers {
override def skipWhitespace = true
private val expression: Parser[String] = """[a-zA-Z0-9\.]+""".r
val operation: Parser[Try[Boolean]] =
expression ~ ("<" | "<=" | ">=" | ">") ~ expression ^^ {
case v1 ~ op ~ v2 => for {
a <- Try(v1.toDouble)
b <- Try(v2.toDouble)
} yield op match {
case "<" => a < b
case "<=" => a <= b
case ">" => a > b
case ">=" => a >= b
}
}
}
Test:
"MyTestParser" should {
"successfully parse <= condition" in {
val parser = MyTestParser.parseAll(MyTestParser.operation, "10 <= 20")
val result = parser match {
case MyTestParser.Success(s, _) => s.get
case MyTestParser.Failure(e, _) =>
println(s"Parsing failed with error: $e")
false
case MyTestParser.Error(e, _) =>
println(s"Parsing error: $e")
false
}
result === true
}
"successfully parse >= condition" in {
val result = MyTestParser.parseAll(MyTestParser.operation, "50 >= 20").get
result === scala.util.Success(true)
}
}
Error for <= condition:
Parsing failed with error: string matching regex `[a-zA-Z0-9\.]+' expected but `=' found
You need to change the order of the alternatives so that the longest options could be checked first.
expression ~ ( "<=" | ">=" | ">" | "<") ~ expression ^^ {
If the shortest alternative matches first, others are not considered at all.
Also note that a period does not have to be escaped inside a character class, this will do:
"""[a-zA-Z0-9.]+""".r
Your problem is that "<" is matched by <=, so it moves on to trying the expression. If you change the order so that "<=" comes first, that will be matched instead, and you will get the desired result.
#Prateek: it does not work cause the regex engine works just like a boolean OR. It does not search further if one of the patterns in the or-chain is satisfied at a certain point.
So, when use | between patterns, if two or more patterns have substring in common, you have to place the longest first.
As a general rule: order the patterns starting from the longest to the shortest.
Change the relevant line like this make it works:
// It works as expected with '>= / >' also before for the same reason
expression ~ ("<=" | "<" | ">=" | ">") ~ expression ^^ {
Or you want to follow the general rule:
expression ~ ("<=" | ">=" | "<" | ">") ~ expression ^^ {

Regex BNF Grammar

Is there any BNF grammar for regular expression?
You can see one for Perl regexp (displayed a little more in detail here, as posted by edg)
To post them on-site:
CMPT 384 Lecture Notes Robert D. Cameron November 29 - December 1,
1999
BNF Grammar of Regular Expressions
Following the precedence rules given previously, a BNF grammar for Perl-style regular expressions can be constructed as follows.
<RE> ::= <union> | <simple-RE>
<union> ::= <RE> "|" <simple-RE>
<simple-RE> ::= <concatenation> | <basic-RE>
<concatenation> ::= <simple-RE> <basic-RE>
<basic-RE> ::= <star> | <plus> | <elementary-RE>
<star> ::= <elementary-RE> "*"
<plus> ::= <elementary-RE> "+"
<elementary-RE> ::= <group> | <any> | <eos> | <char> | <set>
<group> ::= "(" <RE> ")"
<any> ::= "."
<eos> ::= "$"
<char> ::= any non metacharacter | "\" metacharacter
<set> ::= <positive-set> | <negative-set>
<positive-set> ::= "[" <set-items> "]"
<negative-set> ::= "[^" <set-items> "]"
<set-items> ::= <set-item> | <set-item> <set-items>
<set-items> ::= <range> | <char>
<range> ::= <char> "-" <char>
via VonC.
--- Knud van Eeden --- 21 October 2003 - 03:22 am --------------------
PERL:Search/Replace:Regular Expression:Backus Naur Form:What is
possible BNF for regular expression?
expression = term
term | expression
term = factor
factor term
factor = atom
atom metacharacter
atom = character
.
( expression )
[ characterclass ]
[ ^ characterclass ]
{ min }
{ min , }
{ min , max }
characterclass = characterrange
characterrange characterclass
characterrange = begincharacter
begincharacter - endcharacter
begincharacter = character
endcharacter = character
character =
anycharacterexceptmetacharacters
\ anycharacterexceptspecialcharacters
metacharacter = ?
* {=0 or more, greedy}
*? {=0 or more, non-greedy}
+ {=1 or more, greedy}
+? {=1 or more, non-greedy}
^ {=begin of line character}
$ {=end of line character}
$` {=the characters to the left of the match}
$' {=the characters to the right of the match}
$& {=the characters that are matched}
\t {=tab character}
\n {=newline character}
\r {=carriage return character}
\f {=form feed character}
\cX {=control character CTRL-X}
\N {=the characters in Nth tag (if on match side)}
$N{=the characters in Nth tag (if not on match side)}
\NNN {=octal code for character NNN}
\b {=match a 'word' boundary}
\B {=match not a 'word' boundary}
\d {=a digit, [0-9]}
\D {=not a digit, [^0-9]}
\s {=whitespace, [ \t\n\r\f]}
\S {=not a whitespace, [^ \t\n\r\f]}
\w {='word' character, [a-zA-Z0-9_]}
\W {=not a 'word' character, [^a-zA-Z0-9_]}
\Q {=put a quote (de-meta) on characters, until \E}
\U {=change characters to uppercase, until \E}
\L {=change characters to uppercase, until \E}
min = integer
max = integer
integer = digit
digit integer
anycharacter = ! " # $ % & ' ( ) * + , - . / :
; < = > ? # [ \ ] ^ _ ` { | } ~
0 1 2 3 4 5 6 7 8 9
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
---
[book: see also: Bucknall, Julian - the Tomes of Delphi: Algorithms
and Datastructures - p. 37 - 'Using regular expressions' -
http://www.amazon.com/exec/obidos/tg/detail/-
/1556227361/qid=1065748783/sr=1-1/ref=sr_1_1/002-0122962-7851254?
v=glance&s=books]
---
---
Internet: see also:
---
Compiler: Grammar: Expression: Regular: Which grammar defines set of
all regular expressions? [BNF]
http://www.faqts.com/knowledge_base/view.phtml/aid/25950/fid/1263
---
Perl Regular Expression: Quick Reference 1.05
http://www.erudil.com/preqr.pdf
---
Top: Computers: Programming: Languages: Regular Expressions: Perl
http://dmoz.org/Computers/Programming/Languages/Regular_Expressions/Per
l/
---
TSE: Search/Replace:Regular Expression:Backus Naur Form:What is
possible BNF for regular expression?
http://www.faqts.com/knowledge_base/view.phtml/aid/25714/fid/1236
---
Delphi: Search: Regular expression: Create: How to create a regular
expression parser in Delphi?
http://www.faqts.com/knowledge_base/view.phtml/aid/25645/fid/175
---
Delphi: Search: Regular expression: How to add regular expression
searching to Delphi? [Systools]
http://www.faqts.com/knowledge_base/view.phtml/aid/25295/fid/175
----------------------------------------------------------------------
via Ed Guinness.
http://web.archive.org/web/20090129224504/http://faqts.com/knowledge_base/view.phtml/aid/25718/fid/200