I am in the process of creating a Antlr4 grammar for a language where the logical operator NOT and the operators Plus and Minus can be unary or binary operators.
How should I define the operators in Antlr4 grammar so that the parser can differentiate between them ?
Example:
NOT 1 is 0 (Unary Operator)
1 NOT 1 is 0 (Binary Operator)
Here is a small part of my Antlr4 Parser:
expr: expr ('%') expr #Modulo
| expr op=('*'|'/') expr #MulDiv
| expr op=('+'|'-') expr #AddSub
| NOT expr #NegOp
Here is a small part of my Antlr4 Lexer:
ADD : '+';
SUB : '-';
NOT : ([nN][oO][tT]|[~]);
To make a operator for my language unary and binary at the same time I needed to add the following rules to my parser:
expr (MOD) expr #Modulo
| op=(ADD|SUB) expr #UnaryPlusMinus
| expr op=(ADD|SUB) expr #AddSub
| expr op=(AND|OR|XOR|NOT) expr #LogOp
| NOT expr #NegOp
For me the above solution is fine because my language supports syntax like the following below:
++5---4 (result is 1)
NOT NOT NOT 1 (result is 0)
However it would be interesting to create a parser/lexer rule where a operator (for example Minus) could be Unary or (Exclusive Or) Binary but the parser should know when the operator is used as Unary or Binary operator and therefore not allow something like this 5-++--4 , as this would lead to an error, but at the same time this would be ok ---5 as it would result to -5.
Related
# float_of_int -3;;
Error: This expression has type int -> float
but an expression was expected of type int
I thought function application has the highest precedence, so float_of_int -3 is equal to float_of_int (-3). Why do I need to put the parentheses explicitly there to suppress the error?
Exactly because of this reason, that function application is having higher precedence than infix operators, you have to add parenthesis.
In other words, function application is greedy and it will consume all terms until it reaches an infix operator, e.g.,
f x y z + g p q r
is parsed as (f x y z) + (g p q r).
The same is with your example,
float_of_int - 3
is parsed as
(float_of_int) - (3)
Another option for you would be to use a special prefix operator ~-, e.g.,
float_of_int ~-1
which has higher precedence (binds tighter) than the function application.
I am constructing a grammar in CUP, and I have ran into a roadblock on defining IF-THEN-ELSE statements.
My code looked like this:
start with statements;
/* Top level statements */
statements ::= statement | statement SEPARATOR statements ;
statement ::= if_statement | block | while_statement | declaration | assignment ;
block ::= START_BLOCK statements END_BLOCK ;
/* Control statements */
if_statement ::= IF expression THEN statement
| IF expression THEN statement ELSE statement ;
while_statement ::= WHILE expression THEN statement ;
But the CUP tool complained about the ambiguity in the definition of the if_statement.
I found this article describing how to eliminate the ambiguity without introducing endif tokens.
So I tried adapting their solution:
start with statements;
statements ::= statement | statement SEPARATOR statements ;
statement ::= IF expression THEN statement
| IF expression THEN then_statement ELSE statement
| non_if_statement ;
then_statement ::= IF expression THEN then_statement ELSE then_statement
| non_if_statement ;
// The statement vs then_statement is for disambiguation purposes
// Solution taken from http://goldparser.org/doc/grammars/example-if-then-else.htm
non_if_statement ::= START_BLOCK statements END_BLOCK // code block
| WHILE expression statement // while statement
| declaration | assignment ;
Sadly CUP is complaining as follows:
Warning : *** Reduce/Reduce conflict found in state #57
between statement ::= non_if_statement (*)
and then_statement ::= non_if_statement (*)
under symbols: {ELSE}
Resolved in favor of the first production.
Why is this not working? How do I fix it?
The problem here is the interaction between if statements and while statements, which you can see if you remove the while statement production from non-if-statement.
The problem is that the target of a while statement can be an if statement, and that while statement could then be in the then clause of another if statement:
IF expression THEN WHILE expression IF expression THEN statement ELSE ...
Now we have a slightly different manifestation of the original problem: the else at the end could be part of the nested if or the outer if.
The solution is to extend the distinction between restricted statements ("then-statements" in the terms of your link) to also include two different kinds of while statements:
statement ::= IF expression THEN statement
| IF expression THEN then_statement ELSE statement
| WHILE expression statement
| non_if_statement ;
then_statement ::= IF expression THEN then_statement ELSE then_statement
| WHILE expression then_statement
| non_if_statement ;
non_if_statement ::= START_BLOCK statements END_BLOCK
| declaration | assignment ;
Of course, if you extend your grammar to include other types of compound statements (such as for loops), you will have to do the same thing for each of them.
I have implemented shunting-yard algorithm to parse arithmetic string-expressions from infix to postfix-notation.
I now also want to parse expressions with relational operators and Ternary conditional. Considering C++ Operator Precedence i added those operators with the lowest precedence and right-associativity for ? and : and the second-lowest and left-associativity for > and <.
When i now parse an expression like: A>B?C:D (where A, B, C and D can be any valid arithmetic expression) i would expect: A B > C D : ?, but i get A B C D : ? <. When i use Parentheses to evaluate the Condition first (A>B)?C:D it works. A mixed expression like (1+2<3+4)?C:D gives me 1 2 + 3 4 + < C D : ? which seems also legit to me. (A<B)?5+6:C gives me A B < 5 6 : + ? which again is messed up. Again, (A<B)?(5+6):C would fix that.
As stated in the comments, evaluating conditions first and the proceed with the left arithmetic expression would also be fine. But i really did not stumble upon an algorithm for evaluating expressions with relational and ternary operators in my research yet. Any help, even if pointing out an algorithm would be very appreciated
Here is the implementation of shunting-yard:
QQueue<QString> ShuntingYard::infixToPostfixWithConditionals(const QString& expression)
{
QStack<QString> stack;
QQueue<QString> queue;
QString token;
QStringList tokens = splitExpression(expression);
Q_FOREACH(QString token, tokens)
{
if (isDefineOrNumber(token))
queue.enqueue(token);
if (isFunction(token))
stack.push(token);
if (isOperator(token))
{
while (!stack.isEmpty() && isOperator(stack.top()) && isLeftAssociativeOperator(token) && !hasHigherPrecedence(token, stack.top()))
queue.enqueue(stack.pop());
stack.push(token);
}
if (isLeftParenthese(token))
stack.push(token);
else if (isRightParenthese(token))
{
while (!isLeftParenthese(stack.top()))
{
if (stack.isEmpty())
break;
if (isOperator(stack.top()))
queue.enqueue(stack.pop());
}
stack.pop();
if (isFunction(stack.top()))
queue.enqueue(stack.pop());
}
}
while (!stack.isEmpty())
{
if (isLeftParenthese(stack.top()))
break;
queue.enqueue(stack.pop());
}
return queue;
}
EDIT: made code and description more concise for readabilility
Another EDIT: Treating ?: as left-associative gave me the expected output
Now, regarding this question about the associativity of ternary conditionals. If i input a<b?a:b?c:d i get a b < a b c d : ? : ?, where a < b will be evaluated first, which is correct, due to its higher precedence, but then b ? c : d will be evaluated first, which is the correct right-to-left order. Confusing.
I am currently going over CFG and saw the answer and I am not sure how they got it. How did they get it to convert into Regular Expression from CFG here?
S -> aS|bX|a
X -> aX|bY|a
Y -> aY|a
answer:
R.E -> (a*(a+ba*a+ba*ba*a))
You should learn the basic rules that I have written in my answer "constructing an equivalent regular grammar from a regular expression", those rules will help you in converting "a regular expression into right or left liner grammar" or "a right or left liner grammar into regular expression" - both.
Though, more than one regular expressions (and grammars/automata) can be possible for a language. Below, I have tried to explain how to find regular expression given in answer for the question in your textbook. Read each step precisely and linked answer(s) so that you can learn approaches to solve such questions yourself next time.
At first step, to answering such question you should be clear "what does language this grammar generate?" (similarly, if you have an automata then try to understand language represented by that automata).
As I said in linked answer, grammar rules like: S → eS | e are corresponding to "plus clouser" and generates strings e+. Similarly, you have three pairs of such rules to generate a+ in your grammar.
S → aS | a
X → aX | a
Y → aY | a
(Note: a+ can also be written as a*a or aa* – describes one or more 'a'.)
Also notice in grammar, you do not have any "null production" e.g. A → ∧, so non-of the variable S, X or Y are nullable, that implies empty string is not a member of language of grammar, as: ε ∉ L(G).
If you notice start-variable's S productions rules:
S → aS | bX | a
Then it is very clear that strings ω in language can either start with symbol 'a' or with 'b' (as you have two choices to apply S productions either (1) S → aS | a that gives 'a' as the first symbol in ω, or (2) S → bX that use to produce strings those start with symbol 'b').
Now, what are the possible minimum length strings ω in L(G)? – minimum length string is "a" that is possible using production rule: S → a.
Next note that "b" ∉ L(G) because if you apples S → bX then later on you have to replace X in sentential form bX using some of X's production rules, and as we know X is also not nullable hence there would be always some symbol(s) after 'b' – in other words sentimental from bX derives ∣ω∣ ≥ 2.
Form above discussion, it is very clear that using S production rules you can generate sentential forms either a*a or a*bX, in two steps:
For a* use S → aS repeatedly that will give S ⇝ a*S (symbol ⇝ means more than one steps)
Replace S in rhs of S ⇝ a*S to get either by a*a or a*bX
Also, "a*a or a*bX" can be written as S ⇝ a*(a + bX) or S ⇝ (a*(a + bX)) if you like to parenthesizes complete expression✎.
Now compare production rules of S and X both are the same! So as I shown above for S, you can also describe for X that it can use to generate sentential forms X ⇝ (a*(a + bY)).
To derive the regular expressions given in answer replace X by (a*(a + bY)) in S ⇝ a*(a + bX), you will get:
S ⇝ a*(a + b X )
S ⇝ a*(a + b (a*(a + bY)) )
And now, last Y production rules are comparatively very simple - just use to create "plus clouser" a+ (or a*a).
So let's replace Y also in S derived sentential form.
S ⇝ a*(a + b(a*(a + bY)))
⇝ a*(a + b(a*(a + ba*a)))
Simplify it, apply distribution low twice to remove inner parenthesis and concatenate regular expressions – P(Q + R) can be written as PQ + PR.✞
⇝ a*(a + b(a*(a + ba*a)))
⇝ a*(a + b(a*a + a*ba*a))
⇝ a*(a + ba*a + ba*ba*a)
✎ : + in regular expression in formal languages use in two syntax (i) + as binary operator means – "union operation" (ii) + as unary superscript operator means – "plus clouser"
✎ : In regex in programming languages + is only uses for "plus clouser"
✞ : In regex we use ∣ symbol for union, but that is not exactly a union operator. In union (A ∪ B) is same as (B ∪ A) but in regex (A ∣ B) may not equals to (B ∣ A)
What you can observe from the question is that the grammar apart from being a CFG is also right linear. So you can construct an finite automata for this right linear grammar. Now that you have the finite automata constructed their exists a regular expression with the same language and the conversion can be done using the steps given in this site.
Is it possible to generate a parser for a scripting language that uses the Reverse Polish notation (and a Postscript-like syntax) using bison/yacc?
The parser should be able to parse code similar to the following one:
/fib
{
dup dup 1 eq exch 0 eq or not
{
dup 1 sub fib
exch 2 sub fib
add
} if
} def
Given the short description above and the notes on Wikipedia:
http://en.wikipedia.org/wiki/Stack-oriented_programming_language#PostScript_stacks
A simple bison grammer for the above could be:
%token ADD
%token DUP
%token DEF
%token EQ
%token EXCH
%token IF
%token NOT
%token OR
%token SUB
%token NUMBER
%token IDENTIFIER
%%
program : action_list_opt
action_list_opt : action_list
| /* No Action */
action_list : action
| action_list action
action : param_list_opt operator
param_list_opt : param_list
| /* No Parameters */
param_list : param
| param_list param
param : literal
| name
| action_block
operator : ADD
| DUP
| DEF
| EQ
| EXCH
| IF
| NOT
| OR
| SUB
literal : NUMBER
name : '/' IDENTIFIER
action_block : '{' program '}'
%%
Yes. Assuming you mean one that also uses postscript notation, it means you'd define your expressions something like:
expression: operand operand operator
Rather than the more common infix notation:
expression: operand operator operand
but that hardly qualifies as a big deal. If you mean something else by "Postcript-like", you'll probably have to clarify before a better answer can be given.
Edit: Allowing an arbitrary number of operands and operators is also pretty easy:
operand_list:
| operand_list operand
;
operator_list:
| operator_list operator
;
expression: operand_list operator_list
;
As it stands, this doesn't attempt to enforce the proper number of operators being present for any particular operand -- you'd have to add those checks separately. In a typical case, a postscript notation is executed on a stack machine, so most such checks become simple stack checks.
I should add that although you certainly can write such parsers in something like yacc, languages using postscript notation generally require such minimal parsing that you frequently feed them directly to some sort of virtual machine interpreter that executes them quite directly, with minimal parsing (mostly, the parsing comes down to throwing an error if you attempt to use a name that hasn't been defined).