Yacc grammar producing incorrect terminal - c++

I've been working on a hobby compiler for a while now, using lex and yacc for the parsing stage. This is all working fine for the majority of things, but when I added in if statements, the production rule for symbols is now giving the previous (or next?) item on the stack instead of the symbol value needed.
Grammar is given below with hopefully unrelated rules taken out:
%{
...
%}
%define parse.error verbose
%token ...
%%
Program:
Function { root->addChild($1);}
;
Function:
Type Identifier '|' ArgumentList '|' StatementList END
{ $$ = new FunctionDef($1, $2, $4, $6); }
/******************************************/
/* Statements and control flow ************/
/******************************************/
Statement:
Expression Delimiter
| VariableDeclaration Delimiter
| ControlFlowStatement Delimiter
| Delimiter
;
ControlFlowStatement:
IfStatement
;
IfStatement:
IF Expression StatementList END { $$ = new IfStatement($2, $3); }
| IF Expression StatementList ELSE StatementList END { $$ = new IfStatement($2, $3, $5);}
;
VariableDeclaration:
Type Identifier { $$ = new VariableDeclaration($1, $2);}
| Type Identifier EQUALS Expression { $$ = new VariableDeclaration($1, $2, $4);}
;
StatementList:
StatementList Statement { $1->addChild($2); }
| Statement { $$ = new GenericList($1); }
;
Delimiter:
';'
| NEWLINE
;
Type:
...
Expression:
...
PostfixExpression:
Value '[' Expression ']' { std::cout << "TODO: indexing operators ([ ])" << std::endl;}
| Value '.' SYMBOL { std::cout << "TODO: member access" << std::endl;}
| Value INCREMENT { $$ = new UnaryExpression(UNARY_POSTINC, $1); }
| Value DECREMENT { $$ = new UnaryExpression(UNARY_POSTDEC, $1); }
| Value '(' ')' { $$ = new FunctionCall($1, NULL); }
| Value '(' ExpressionList ')' { $$ = new FunctionCall($1, $3); }
| Value
;
Value:
BININT { $$ = new Integer(yytext, 2); }
| HEXINT { $$ = new Integer(yytext, 16); }
| DECINT { $$ = new Integer(yytext); }
| FLOAT { $$ = new Float(yytext); }
| SYMBOL { $$ = new Symbol(yytext); }
| STRING { $$ = new String(yytext); }
| LambdaFunction
| '(' Expression ')' { $$ = $2; }
| '[' ExpressionList ']' { $$ = $2;}
;
LambdaFunction:
...
%%
I cannot work out what about the control flow code can make the Symbol:
rule match something that isn't classed as a symbol from the lex definition:
symbol [a-zA-Z_]+(alpha|digit)*
...
{symbol} {return SYMBOL;}
Any help from somebody who knows about yacc and grammars in general would be very much appreciated. Also example files of the syntax it parses can be shown if necessary.
Thanks!

You cannot count on the value of yytext outside of a flex action.
Bison grammars typically read a lookahead token before deciding on how to proceed, so in a bison action, yytext has already been replaced with the token value of the lookahead token. (You can't count on that either, though: sometimes no lookahead token is needed.)
So you need to make a copy of yytext before the flex action returns and make that copy available to the bison grammar by placing it into the yylval semantic union.
See this bison FAQ entry
By the way, the following snippet from your flex file is incorrect:
symbol [a-zA-Z_]+(alpha|digit)*
In that regular expression, alpha and digit are just ordinary strings, so it is the same as [a-zA-Z_]+("alpha"|"digit")*, which means that it will match, for example, a_digitdigitdigit but not a_123. (It would have matched a_digitdigitdigit without the part following the +, so I presume that wasn't your intention.)
On the whole, I think it's better to use Posix character classes than either hand-written character classes or defined symbols, so I would write that as
symbol [[:alpha:]_]([[:alnum:]_]*[[:alnum:]])?
assuming that your intention is that a symbol can start but not end with an underscore, and end but not start with a digit. Using Posix character classes requires you to execute flex with the correct locale -- almost certainly the C locale -- but so do character ranges, so there is nothing to be lost by using the self-documenting Posix classes.
(Of course, I have no idea what your definitions of {alpha} and {digit} are, but it seems to me that they are either the same as [[:alpha:]] and [[:digit:]], in which case they are redundant, or different from the Posix classes, in which case they are confusing to the reader.)

Related

Expression occurrences in flex/bison

Suppose I have a Bison expression like this:
multiply: T_FIGURE { $$ = $1; }
| multiply T_ASTERISK multiply { $$ = $1 * $3; }
;
It should return a result of multiplying some figures or give the input back if only one figure provided. If I wanted to limit the number of figures provided to at most 3, I would rewrite the expression like this:
multiply: T_FIGURE { $$ = $1; }
| T_FIGURE T_MULTIPLY T_FIGURE { $$ = $1 * $3; }
| T_FIGURE T_MULTIPLY T_FIGURE T_MULTIPLY T_FIGURE { $$ = $1 * $3 * $5; }
;
My question: is there a way to rewrite this expression so that I wouldn't have to manually specify the occurrences and instead use some kind of parameter to be able to easily change the upper limit to, for example, 30 occurrences?
In a word, "No". That is not a feature of bison (nor any yacc derivative I know of).
The easiest way to solve problems like this is to use a code generator. Either use an available macro processor like m4 or write your own script in whatever scripting language you feel comfortable with.
You could also solve the problem dynamically by counting arguments in your semantic action (which means modifying your semantic type to include both a value and a count.) You could then throw a syntax error if the count is exceeded. (Again, in your semantic action.) The main advantage of this approach is that avoids blowing up the parser's state table. If your limits are large and interact with each other, you might find you are producing a very large state machine.
As a very simple example (with only a single operator):
%{
typedef struct ExprNode {
int count;
double value;
} ExprNode;
%}
%union {
ExprNode expr;
double value;
}
%token <value> T_FIGURE
%type <expr> expr multiply
%%
expr: T_FIGURE { $$.count = 0; $$.value = $1; }
multiply: expr
| multiply '*' expr { if ($1.count >= LIMIT) {
yyerror("Too many products");
YYABORT;
}
else {
$$.count = $1.count + 1;
$$.value = $1.value * $3.value;
}
}

Accept brackets correctly in bisonc++

I've tried to write a basic syntax checker using bisonc++
The rules are:
expression -> OPEN_BRACKET expression CLOSE_BRACKET
expression -> expression operator expression
operator -> PLUS
operator -> MINUS
If I try to run the compiled code, I get an error at this line:
(a+b)-(c+d)
The first rule is applied, the leftmost and the rightmost brackets are the OPEN_BRACKET and the CLOSE_BRACKET. The remaining expression is: a+b)-(c+d
How is it possible to prevent this behaviour? Is it possible to count the open and closed brackets?
Edit
The expression grammar:
expression:
OPEN_BRACKET expression CLOSE_BRACKET
{
//
}
| operator
{
//
}
| VARIABLE
{
//
}
;
operator:
expression PLUS expression
{
//
}
| expression MINUS expression
{
//
}
;
Edit2
The lexer
CHAR [a-z]
WS [ \t\n]
%%
{CHAR}+ return Parser::VARIABLE;
"+" return Parser::PLUS;
"-" return Parser::MINUS;
"(" return Parser::OPEN_BRACKET;
")" return Parser::CLOSE_BRACKET;
This is not a normal expression grammar. Try the normal one.
expression
: term
| expression '+' term
| expression '-' term
;
term
: factor
| term '*' factor
| term '/' factor
| term '%' factor
;
factor
: primary
| '-' factor // unary minus
| primary '^' factor // exponentiation, right-associative
;
primary
: identifier
| literal
| '(' expression ')'
;
Note also the above method of indenting and aligning, and that you only have to return yytext[0] from the lexer for single special characters: you don't need special token names, and it's more readable without them:
CHAR [a-zA-Z]
DIGIT [0-9]
WHITESPACE [ \t\r\n]
%%
{CHAR}+ { return Parser::VARIABLE; }
{DIGIT}+ { return Parser::LITERAL; }
{WHITESPACE}+ ;
. { return yytext[0]; }
Your operator rule does not look good.
Try experiment with:
expression:
OPEN_BRACKET expression CLOSE_BRACKET
{
//
}
|
expression operator expression
{
//
}
|
VARIABLE
{
//
}
;
operator:
PLUS
{
//
}
|
MINUS
{
//
}
;
As your pseudo code actually suggests...

How to define unrecognized rules in Ocamlyacc

I'm working on company projet, where i have to create a compilator for a language using Ocamlyacc and Ocamllex. I want to know if is it possible to define a rule in my Ocamlyacc Parser that can tell me that no rules of my grammar matching the syntax of an input.
I have to insist that i'am a beginner in Ocamllex/Ocamlyacc
Thank you a lot for your help.
If no rule in your grammar matches the input, then Parsing.Parse_error exception is raised. Usually, this is what you want.
There is also a special token called error that allows you to resynchronize your parser state. You can use it in your rules, as it was a real token produced by a lexer, cf., eof token.
Also, I would suggest to use menhir instead of more venerable ocamlyacc. It is easier to use and debug, and it also comes with a good library of predefined grammars.
When you write a compiler for a language, the first step is to run your lexer and to check if your program is good from a lexical point of view.
See the below example :
{
open Parser (* The type token is defined in parser.mli *)
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip blanks *)
| ['\n' ] { EOL }
| ['0'-'9']+ as lxm { INT(int_of_string lxm) }
| '+' { PLUS }
| '-' { MINUS }
| '*' { TIMES }
| '/' { DIV }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { raise Eof }
It's a lexer to recognize some arithmetic expressions.
If your lexer accepts the input then you give the sequence of lexemes to the parser which try to find if a AST can be build with the specified grammar. See :
%token <int> INT
%token PLUS MINUS TIMES DIV
%token LPAREN RPAREN
%token EOL
%left PLUS MINUS /* lowest precedence */
%left TIMES DIV /* medium precedence */
%nonassoc UMINUS /* highest precedence */
%start main /* the entry point */
%type <int> main
%%
main:
expr EOL { $1 }
;
expr:
INT { $1 }
| LPAREN expr RPAREN { $2 }
| expr PLUS expr { $1 + $3 }
| expr MINUS expr { $1 - $3 }
| expr TIMES expr { $1 * $3 }
| expr DIV expr { $1 / $3 }
| MINUS expr %prec UMINUS { - $2 }
;
This is a little program to parse arithmetic expression. A program can be rejected at this step because there is no rule of the grammar to apply in order to have an AST at the end. There is no way to define unrecognized rules but you need to write a grammar which define how a program can be accepted or rejected.
let _ =
try
let lexbuf = Lexing.from_channel stdin in
while true do
let result = Parser.main Lexer.token lexbuf in
print_int result; print_newline(); flush stdout
done
with Lexer.Eof ->
exit 0
If your compile the lexer, the parser and the last program, you have :
1 + 2 is accepted because there is no error lexical errors and an AST can be build corresponding to this expression.
1 ++ 2 is rejected : no lexical errors but there is no rule to build a such AST.
You can found more documentation here : http://caml.inria.fr/pub/docs/manual-ocaml-4.00/manual026.html

OCaml interpreter: evaluate a function inside a function

I'm trying to write an interpreter in OCaml and I have a problem here.
In my program, I want to call a function like this, for example:
print (get_line 4) // print: print to stdout, get_line: get a specific line in a file
How can I do that? The problem is in our parser, I think so as it defines how a program will be run, how a function is defined and the flow of a program. This is what I have so far in parser an lexer (code below), but it didn't seem to work. I don't really see any difference between my code and the calculator on OCaml site, the statement inside the bracket is evaluated firstly, then return its value to its parent operation to do the next evaluating.
In my interpreter, the function get_line inside bracket is evaluate firstly, but I don't think it returns the value to print function, or it does but wrong type (checked, but I don't think it's this error).
One difference between calculator and my interpreter is that the calculator is working with primitive types, mine are functions. But they should be similar.
This is my code, just a part of it:
parser.mly:
%token ODD
%token CUT
%start main
%type <Path.term list> main
%%
main:
| expr EOL main {$1 :: $3}
| expr EOF { [$1] }
| EOL main { $2 }
;
expr:
| ODD INT { Odd $2}
| ODD LPAREN INT RPAREN expr { Odd $3 }
| CUT INT INT { Cut ($2, $3)}
| CUT INT INT expr { Cut ($2, $3) }
lexer.mll:
{
open Parser
}
(* define all keyword used in the program *)
rule main =
parse
| ['\n'] { EOL }
| ['\r']['\n'] { EOL }
| [' ''\t''\n'] { main lexbuf }
| '(' { LPAREN }
| ')' { RPAREN }
| "cut" { CUT }
| "trunclength" { TRUNCLENGTH }
| "firstArithmetic" { FIRSTARITH }
| "f_ArithmeticLength" { F_ARITHLENGTH }
| "secondArithmetic" { SECARITH }
| "s_ArithmeticLength" { S_ARITHLENGTH }
| "odd" { ODD }
| "oddLength" { ODDLENGTH }
| "zip" { ZIP }
| "zipLength" { ZIPLENGTH }
| "newline" { NEWLINE }
| eof { EOF }
| ['0' - '9']+ as lxm { INT(int_of_string lxm) }
| ['a'-'z''A'-'Z'] ['a'-'z''A'-'Z''0'-'9']* as lxm { STRING lxm }
| ODD LPAREN INT RPAREN expr { Odd $3 }
Your grammar rule requires an INT between parenthesis. You need to change that to an expr. There are a number of other issues with this, but I'll leave it at that.
First, you parser only tries to build a list of Path.term, but what do you want to do with it?
Then, there are many things wrong with your parser, so I don't really know where to start. For instance, the second and fourth case of the expr rule totally ignore the last expr. Moreover, your parser only recognize expressions containing "odd <int>" (or "odd (<int>)") and "cut <int> <int>", so how is it supposed to evaluate print and get_line? You should edit your question and try to make it clearer.
To evaluate expressions, you can
do it directly inside the semantic-actions (as in the calculator example),
or (better) build an AST (for Abstract Syntax Tree) with your parser and then interpret it.
If you want to interpret print (get_line 4), your parser need to know what print and get_line mean. In your code, your parser will see print or get_line as a STRING token (having a string value). As they seem to be keywords in your language, your lexer should recognize them and return a specific token.

What is $$ in bison?

In the bison manual in section 2.1.2 Grammar Rules for rpcalc, it is written that:
In each action, the pseudo-variable $$ stands for the semantic value
for the grouping that the rule is going to construct. Assigning a
value to $$ is the main job of most actions
Does that mean $$ is used for holding the result from a rule? like:
exp exp '+' { $$ = $1 + $2; }
And what's the typical usage of $$ after begin assigned to?
Yes, $$ is used to hold the result of the rule. After being assigned to, it typically becomes a $x in some higher-level (or lower precedence) rule.
Consider (for example) input like 2 * 3 + 4. Assuming you follow the normal precedence rules, you'd have an action something like: { $$ = $1 * $3; }. In this case, that would be used for the 2 * 3 part and, obviously enough, assign 6 to $$. Then you'd have your { $$ = $1 + $3; } to handle the addition. For this action, $1 would be given the value 6 that you assigned to $$ in the multiplication rule.
Does that mean $$ is used for holding the result from a rule? like:
Yes.
And what's the typical usage of $$ after begin assigned to?
Typically you won’t need that value again. Bison uses it internally to propagate the value. In your example, $1 and $2 are the respective semantic values of the two exp productions, that is, their values were set somewhere in the semantic rule for exp by setting its $$ variable.
Try this. Create a YACC file with:
%token NUMBER
%%
exp: exp '+' NUMBER { $$ = $1 + $3; }
| exp '-' NUMBER { $$ = $1 - $3; }
| NUMBER { $$ = $1; }
;
Then process it using Bison or YACC. I am using Bison but I assume YACC is the same. Then just find the "#line" directives. Let us find the "#line 3" directive; it and the relevant code will look like:
#line 3 "DollarDollar.y"
{ (yyval) = (yyvsp[(1) - (3)]) + (yyvsp[(3) - (3)]); }
break;
And then we can quickly see that "$$" expands to "yyval". That other stuff, such as "yyvsp", is not so obvious but at least "yyval" is.
$$ represents the result reference of the current expression's evaluation. In other word, its result.Therefore, there's no particular usage after its assignation.
Bye !