bison error: rule given for Semi, which is a token? - c++

my bison grammar met an error:
parser.yy:
%union {
Ast *ast;
char *str;
int tok;
}
%token <tok> NEWLINE SEMICOLON
%type <ast> Semi
%%
Semi: NEWLINE { $$ = new Ast($1); }
| SEMICOLON { $$ = new Ast($1); }
;
Statements: Statement
| Statement Semi Statements
;
Statement: ...
;
%%
It gives error message:
Parser.yy:xxx.x-x: error: rule given for Semi, which is a token
Is there a way to implement this ?
Or I have to write it like this: ?
Statements: Statement
| Statement NEWLINE Statements
| Statement SEMICOLON Statements
;

Semi is a token. It doesn't need define rule to return.
Just use it.

Related

Bison if statements - setting symbol table prior to parsing block statements

In my language I have the ability to declare a variable in the current symbol table scope and also create a if statement which will generate a new symbol table scope for its statements.
stmts : stmt { $$ = new Block(); $$->addStatement($1); }
| stmts stmt { $1->addStatement($2); }
| /*blank*/ { $$ = new Block(); }
;
stmt : vardecl
| ifstmt
;
ifstmt : TIF TLPAREN exprBase TRPAREN TOPENBLOCK stmts TCLOSEBLOCK {
semanticAnalyzerParser->enterScope("if statement scope");
$$ = new IfStatement($3, $7);
}
;
assign : ident ident TASSIGN exprBase {
Var* typeName = $1;
Var* varName = $2;
ExpressionBase* exprBase = $4;
semanticAnalyzerParser->getScope()->registerVariable(typeName->identifier, varName->identifier, exprBase);
$$ = new VarDecl(typeName, varName, exprBase);
}
;
What I'd like to do is set a new scope before bison enters the if statement block of statements. E.g. semanticAnalyzerParser->enterScope("if statement scope");, so that when grammar for declaring variables are recognised it will declare it on the correct scope with semanticAnalyzerParser->getScope()->registerVariable(typeName->identifier, varName->identifier, exprBase);
However, since bison has to recognise the complete grammar of a if statement, it only creates the scope after it finishes parsing and thus registers the variables on the wrong scope.
How can I execute code prior to parsing the stmts part of the ifstmt grammar, so that it can set the correct scope? I know that one option is to walk the AST tree afterwards but I want to avoid this since the ASTs to create in bison are largely determined by information gathered in the semantic analysis.
Usually you would do this with an "embedded" action:
ifstmt : TIF TLPAREN exprBase TRPAREN {
semanticAnalyzerParser->enterScope("if statement scope"); }
TOPENBLOCK stmts TCLOSEBLOCK {
semanticAnalyzerParser->leaveScope("if statement scope");
$$ = new IfStatement($3, $7); }
;
The embedded action will be executed after the first part of the ifstmt is recognized (up to the TRPAREN, with the TOPENBLOCK as lookahead), before the body (stmts) is parsed.

How to define unrecognized rules in Ocamlyacc

I'm working on company projet, where i have to create a compilator for a language using Ocamlyacc and Ocamllex. I want to know if is it possible to define a rule in my Ocamlyacc Parser that can tell me that no rules of my grammar matching the syntax of an input.
I have to insist that i'am a beginner in Ocamllex/Ocamlyacc
Thank you a lot for your help.
If no rule in your grammar matches the input, then Parsing.Parse_error exception is raised. Usually, this is what you want.
There is also a special token called error that allows you to resynchronize your parser state. You can use it in your rules, as it was a real token produced by a lexer, cf., eof token.
Also, I would suggest to use menhir instead of more venerable ocamlyacc. It is easier to use and debug, and it also comes with a good library of predefined grammars.
When you write a compiler for a language, the first step is to run your lexer and to check if your program is good from a lexical point of view.
See the below example :
{
open Parser (* The type token is defined in parser.mli *)
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip blanks *)
| ['\n' ] { EOL }
| ['0'-'9']+ as lxm { INT(int_of_string lxm) }
| '+' { PLUS }
| '-' { MINUS }
| '*' { TIMES }
| '/' { DIV }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { raise Eof }
It's a lexer to recognize some arithmetic expressions.
If your lexer accepts the input then you give the sequence of lexemes to the parser which try to find if a AST can be build with the specified grammar. See :
%token <int> INT
%token PLUS MINUS TIMES DIV
%token LPAREN RPAREN
%token EOL
%left PLUS MINUS /* lowest precedence */
%left TIMES DIV /* medium precedence */
%nonassoc UMINUS /* highest precedence */
%start main /* the entry point */
%type <int> main
%%
main:
expr EOL { $1 }
;
expr:
INT { $1 }
| LPAREN expr RPAREN { $2 }
| expr PLUS expr { $1 + $3 }
| expr MINUS expr { $1 - $3 }
| expr TIMES expr { $1 * $3 }
| expr DIV expr { $1 / $3 }
| MINUS expr %prec UMINUS { - $2 }
;
This is a little program to parse arithmetic expression. A program can be rejected at this step because there is no rule of the grammar to apply in order to have an AST at the end. There is no way to define unrecognized rules but you need to write a grammar which define how a program can be accepted or rejected.
let _ =
try
let lexbuf = Lexing.from_channel stdin in
while true do
let result = Parser.main Lexer.token lexbuf in
print_int result; print_newline(); flush stdout
done
with Lexer.Eof ->
exit 0
If your compile the lexer, the parser and the last program, you have :
1 + 2 is accepted because there is no error lexical errors and an AST can be build corresponding to this expression.
1 ++ 2 is rejected : no lexical errors but there is no rule to build a such AST.
You can found more documentation here : http://caml.inria.fr/pub/docs/manual-ocaml-4.00/manual026.html

Yacc grammar producing incorrect terminal

I've been working on a hobby compiler for a while now, using lex and yacc for the parsing stage. This is all working fine for the majority of things, but when I added in if statements, the production rule for symbols is now giving the previous (or next?) item on the stack instead of the symbol value needed.
Grammar is given below with hopefully unrelated rules taken out:
%{
...
%}
%define parse.error verbose
%token ...
%%
Program:
Function { root->addChild($1);}
;
Function:
Type Identifier '|' ArgumentList '|' StatementList END
{ $$ = new FunctionDef($1, $2, $4, $6); }
/******************************************/
/* Statements and control flow ************/
/******************************************/
Statement:
Expression Delimiter
| VariableDeclaration Delimiter
| ControlFlowStatement Delimiter
| Delimiter
;
ControlFlowStatement:
IfStatement
;
IfStatement:
IF Expression StatementList END { $$ = new IfStatement($2, $3); }
| IF Expression StatementList ELSE StatementList END { $$ = new IfStatement($2, $3, $5);}
;
VariableDeclaration:
Type Identifier { $$ = new VariableDeclaration($1, $2);}
| Type Identifier EQUALS Expression { $$ = new VariableDeclaration($1, $2, $4);}
;
StatementList:
StatementList Statement { $1->addChild($2); }
| Statement { $$ = new GenericList($1); }
;
Delimiter:
';'
| NEWLINE
;
Type:
...
Expression:
...
PostfixExpression:
Value '[' Expression ']' { std::cout << "TODO: indexing operators ([ ])" << std::endl;}
| Value '.' SYMBOL { std::cout << "TODO: member access" << std::endl;}
| Value INCREMENT { $$ = new UnaryExpression(UNARY_POSTINC, $1); }
| Value DECREMENT { $$ = new UnaryExpression(UNARY_POSTDEC, $1); }
| Value '(' ')' { $$ = new FunctionCall($1, NULL); }
| Value '(' ExpressionList ')' { $$ = new FunctionCall($1, $3); }
| Value
;
Value:
BININT { $$ = new Integer(yytext, 2); }
| HEXINT { $$ = new Integer(yytext, 16); }
| DECINT { $$ = new Integer(yytext); }
| FLOAT { $$ = new Float(yytext); }
| SYMBOL { $$ = new Symbol(yytext); }
| STRING { $$ = new String(yytext); }
| LambdaFunction
| '(' Expression ')' { $$ = $2; }
| '[' ExpressionList ']' { $$ = $2;}
;
LambdaFunction:
...
%%
I cannot work out what about the control flow code can make the Symbol:
rule match something that isn't classed as a symbol from the lex definition:
symbol [a-zA-Z_]+(alpha|digit)*
...
{symbol} {return SYMBOL;}
Any help from somebody who knows about yacc and grammars in general would be very much appreciated. Also example files of the syntax it parses can be shown if necessary.
Thanks!
You cannot count on the value of yytext outside of a flex action.
Bison grammars typically read a lookahead token before deciding on how to proceed, so in a bison action, yytext has already been replaced with the token value of the lookahead token. (You can't count on that either, though: sometimes no lookahead token is needed.)
So you need to make a copy of yytext before the flex action returns and make that copy available to the bison grammar by placing it into the yylval semantic union.
See this bison FAQ entry
By the way, the following snippet from your flex file is incorrect:
symbol [a-zA-Z_]+(alpha|digit)*
In that regular expression, alpha and digit are just ordinary strings, so it is the same as [a-zA-Z_]+("alpha"|"digit")*, which means that it will match, for example, a_digitdigitdigit but not a_123. (It would have matched a_digitdigitdigit without the part following the +, so I presume that wasn't your intention.)
On the whole, I think it's better to use Posix character classes than either hand-written character classes or defined symbols, so I would write that as
symbol [[:alpha:]_]([[:alnum:]_]*[[:alnum:]])?
assuming that your intention is that a symbol can start but not end with an underscore, and end but not start with a digit. Using Posix character classes requires you to execute flex with the correct locale -- almost certainly the C locale -- but so do character ranges, so there is nothing to be lost by using the self-documenting Posix classes.
(Of course, I have no idea what your definitions of {alpha} and {digit} are, but it seems to me that they are either the same as [[:alpha:]] and [[:digit:]], in which case they are redundant, or different from the Posix classes, in which case they are confusing to the reader.)

Bison parser calculator

this is my bison code:
%}
%union {
int int_val;
}
%left '+' '-'
%nonassoc '(' ')'
%token INTEGER PRINT
%type <int_val> expr_int INTEGER
%%
program: command '\n' { return 0; }
;
command: print_expr
;
print_expr: PRINT expr_int { cout<<$2<<endl; }
expr_int: expr_int '+' expr_int { $$ = $1 + $3; }
| expr_int '-' expr_int { $$ = $1 - $3; }
| '(' expr_int ')' { $$ = $2; }
| INTEGER
;
and this is the flex code:
%}
INTEGER [1-9][0-9]*|0
BINARY [-+]
WS [ \t]+
BRACKET [\(\)]
%%
print{WS} { return PRINT; }
{INTEGER} { yylval.int_val=atoi(yytext); return INTEGER; }
{BINARY}|\n { return *yytext; }
{BRACKET} { return *yytext; }
{WS} {}
. { return *yytext; }
%%
//////////////////////////////////////////////////
int yywrap(void) { return 1; } // Callback at end of file
Invalid inputs for the program are:
print 5
output:
5
input:
print (1+1)
output:
2
But for some reason, for the following inputs I do not get immediate error:
print (1+1))
output:
2
some error
input:
print 5!
output:
5
some error
I would like an error to be printed immediately, without commiting the print command and then throwing an error.
How should I change the program so it will not print errornous inputs?
Download the "flex & bison" book by John Levine or the "bison" manual from gnu. Both contain an infix calculator that you can reference.
The grammar you have written " '(' expr_int ')'" reduces to expr_int before the grammatically incorrect ')' in "(1 + 1))' is detected. That is the parser does:
(1 + 1)) => ( expr_int )) => expr_int)
and then sees the error. In order to capture the error you have to change the parser to see the error before the reduction, and you have to do it for all errors that you want treated. Therefore you would write (in this case):
expr_int '(' expr_int ')' ')' { this is an error message }
The short answer, after the long answer, is that it is impractical to generate a parser containing instances of all possible errors. What you have is fine for what you are doing. What you should explore is how to (gracefully) recover from an error rather than abandoning parsing.
Your "program" and "command" non-terminals can be combined as:
program: print-expr '\n' { return 0; }
On a separate note, your regular expressions can be rewritten to good effect as:
%%
INTEGER [0-9]+
WS [ \t]+
%%
print/{WS} { return PRINT; }
{INTEGER} { yylval.int_val=atoi(yytext); return INTEGER; }
'(' { return '('; }
')' { return ')'; }
'+' { return '+'; }
'-' { return '-'; }
{WS}* {}
\n { return '\n'; }
. { return *yytext; } // do you really want to do this?
%%
Create an end-of-line token (eg ;) for your language and make all lines statements exactly at the point when they encounter this end-of-line token.
Well that is because you are executing the code while parsing it. The good old bison calculator is meant to teach you how to write a grammar, not implement a full compiler/interpreter.
The normal way to build compiler/interpreter is the following:
lexer -> parser -> semantic analyser -> code generator -> interpreter
Granted building a fully fledged compiler may be overkill in your case. What you need to to is store the result somewhere and only output it after yyparse has returned without an error.
In yacc/bison, the code associated with a semantic action is executed as soon as the rule in question is reduced, which may happen immediately after the tokens for the rule have been shifted, before looking at any following context to see if there's and error or not (a so-called "default" reduction, used to make the parse tables smaller).
If you want to avoid printing an answer until an entire line is read (and recognized), you need to include the newline in the rule that has the action that prints the message. In your case, you can move the newline from the program rule to the print_expr rule:
program: command { return 0; } ;
print_expr: PRINT expr_int '\n' { cout<<$2<<endl; }
Of course, this will still give you an error (after printing output) if you give it multiple lines of input.

Bison & rec2c: Get Current line number

I'm dealing with PHP grammar and I want to pass to my function the line number
I have something like:
internal_functions_in_yacc:
T_ISSET '(' isset_variables ')'
| T_EMPTY '(' variable ')'
| T_INCLUDE expr { observers.IncludeFound($2); }
| T_INCLUDE_ONCE expr { observers.IncludeFound($2); }
| T_EVAL '(' expr ')'
| T_REQUIRE expr { observers.IncludeFound($2); }
| T_REQUIRE_ONCE expr { observers.IncludeFound($2); }
;
Now I want to pass line number, something like
T_REQUIRE_ONCE expr { observers.IncludeFound($2,$line_number_here); }
Is there a way to know line number of the token that bison is parsing? Or is it something that have to be done in lexing?
EDIT
I found lexing is done using rec2c not lex.
If line numbers are enabled then they can be accessed using #n with n being the tokens location.
T_REQUIRE_ONCE expr { observers.IncludeFound($2,#2.first_line); }
Edit:
To expand on the answer %locations is the directive in the link that enables line numbers in bison. The lexer is still responsible for incrementing the line numbers and requires %option yylineno.
Lex File:
\n { yylloc->lines(yyleng); yylloc->step(); }