Bison outputting string after the wrong line - c++

The input
1 -- Narrowing Variable Initialization
2
3 function main a: integer returns integer;
4 b: integer is a * 2.;
5 begin
6 if a <= 0 then
7 b + 3;
8 else
9 b * 4;
10 endif;
11 end;
is yielding the output
1 -- Narrowing Variable Initialization
2
3 function main a: integer returns integer;
4 b: integer is a * 2.;
5 begin
Narrowing Variable Initialization
6 if a <= 0 then
7 b + 3;
8 else
9 b * 4;
10 endif;
11 end;
Instead of placing that error message under line 4, which is where the error actually occurs. I've looked at it for hours and can't figure it out.
%union
{
char* ident;
Types types;
}
%token <ident> IDENTIFIER
%token <types> INTEGER_LITERAL
%token <types> REAL_LITERAL
%token BEGIN_
%token FUNCTION
%token IS
%token <types> INTEGER
%token <types> REAL
%token RETURNS
%type <types> expression
%type <types> factor
%type <types> literal
%type <types> term
%type <types> statement
%type <types> type
%type <types> variable
%%
program:
/* empty */ |
functions ;
functions:
function_header_recovery body ; |
function_header_recovery body functions ;
function_header_recovery:
function_header ';' |
error ';' ;
function_header:
FUNCTION {locals = new Locals();} IDENTIFIER optional_parameters RETURNS type {globals->insert($3,locals->tList);} ;
optional_parameters:
/* empty */ |
parameters;
parameters:
IDENTIFIER ':' type {locals->insert($1, $3); locals->tList.push_back($3); } |
IDENTIFIER ':' type {locals->insert($1, $3); locals->tList.push_back($3); } "," parameters;
type:
INTEGER | REAL ;
body:
optional_variables BEGIN_ statement END ';' ;
optional_variables:
/* empty */ |
variables ;
variables:
variable IS statement {checkTypes($1, $3, 2);} |
variable IS statement {checkTypes($1, $3, 2);} variables ;
variable:
IDENTIFIER ':' type {locals->insert($1, $3);} {$$ = $3;} ;
statement:
expression ';' |
...
Types checkTypes(Types left, Types right, int flag)
{
if (left == right)
{
return left;
}
if (flag == 1)
{
Listing::appendError("Conditional Expression Type Mismatch", Listing::SEMANTIC);
}
else if (flag == 2)
{
if (left < right)
{
Listing::appendError("Narrowing Variable Initialization", Listing::SEMANTIC);
}
}
return REAL_TYPE;
}
printing being handled by:
void Listing::nextLine()
{
printf("\n");
if (error == "")
{
lineNo++;
printf("%4d%s",lineNo," ");
}
else
{
printf("%s", error.c_str());
error = "";
nextLine();
}
}
void Listing::appendError(const char* errText, int errEnum)
{
error = error + errText;
if (errEnum == 997)
{
lexErrCount++;
}
else if (errEnum == 998)
{
synErrCount++;
}
else if (errEnum == 999)
{
semErrCount++;
}
}
void Listing::display()
{
printf( "\b\b\b\b\b\b " );
if (lexErrCount + synErrCount + semErrCount > 0)
{
printf("\n\n%s%d","Lexical Errors ",lexErrCount);
printf("\n%s%d","Syntax Errors ",synErrCount);
printf("\n%s%d\n","Semantic Errors ",semErrCount);
}
else
{
printf("\nCompiled Successfully\n");
}
}

That's just the way bison works. It produces a one-token lookahead parser, so your production actions don't get triggered until it has read the token following the production. Consequently, begin must be read before the action associated with variables happens. (bison never tries to combine actions, even if they are textually identical. So it really cannot know which variables production applies and which action to execute until it sees the following token.)
There are various ways to associate a line number and/or column position with each token, and to use that information when an error message is to be produced. Interspersing the errors and/or warnings with the input text, in general, requires buffering the input; for syntax errors, you only need to buffer until the next token but that is not a general solution; in some cases, for example, you may want to associate an error with an operator, for example, but the error won't be detected until the operator's trailing argument has been parsed.
A simple technique to correctly intersperse errors/warnings with source is to write all the errors/warnings to a temporary file, putting the file offset at the front of each error. This file can then be sorted, and the input can then be reread, inserting the error messages at appropriate points. The nice thing about this strategy is that it avoids having to maintain line numbers for each error, which noticeably slows down lexical analysis. Of course, it won't work so easily if you allow constructs like C's #include.
Because generating good error messages is hard, and even tracking locations can slow parsing down quite a bit, I've sometimes used the strategy of parsing input twice if an error is detected. The first parse only detects errors and fails early if it can't do anything more reasonable; if an error is detected, the input is reparsed with a more elaborate parser which carefully tracks file locations and possibly even uses heuristics like indentation depth to try to produce better error messages.

As rici notes, bison produces an LALR(1) parser, so it uses one token of lookahead to know what action to take. However, it doesn't ALWAYS use a token of lookahead -- in some cases (where there's only one possibility regardless of lookahead), it uses default reductions which can reduce a rule (and run the associated action) WITHOUT lookahead.
In your case, you can take advantage of that to get the action to run without lookahead if you really need to. The particular rule in question (which triggers the requirement for lookahead) is:
variables:
variable IS statement {checkTypes($1, $3, 2);} |
variable IS statement {checkTypes($1, $3, 2);} variables ;
in this case, after seeing a variable IS statement, it needs to see the next token to decide if there are more variable declarations in order to know which action (the first or the second) to run. But as the two actions are really the same, you could combine them into a single action:
variables: vardecl | vardecl variables ;
vardecl: variable IS statement {checkTypes($1, $3, 2);}
which would end up using a default reduction as it doesn't need the lookahead to decide between two reductions/actions.
Note that the above depends on being able to find the end of a statement without lookahead, which should be the case as long as all statements end unambiguously with a ;

Related

Expression occurrences in flex/bison

Suppose I have a Bison expression like this:
multiply: T_FIGURE { $$ = $1; }
| multiply T_ASTERISK multiply { $$ = $1 * $3; }
;
It should return a result of multiplying some figures or give the input back if only one figure provided. If I wanted to limit the number of figures provided to at most 3, I would rewrite the expression like this:
multiply: T_FIGURE { $$ = $1; }
| T_FIGURE T_MULTIPLY T_FIGURE { $$ = $1 * $3; }
| T_FIGURE T_MULTIPLY T_FIGURE T_MULTIPLY T_FIGURE { $$ = $1 * $3 * $5; }
;
My question: is there a way to rewrite this expression so that I wouldn't have to manually specify the occurrences and instead use some kind of parameter to be able to easily change the upper limit to, for example, 30 occurrences?
In a word, "No". That is not a feature of bison (nor any yacc derivative I know of).
The easiest way to solve problems like this is to use a code generator. Either use an available macro processor like m4 or write your own script in whatever scripting language you feel comfortable with.
You could also solve the problem dynamically by counting arguments in your semantic action (which means modifying your semantic type to include both a value and a count.) You could then throw a syntax error if the count is exceeded. (Again, in your semantic action.) The main advantage of this approach is that avoids blowing up the parser's state table. If your limits are large and interact with each other, you might find you are producing a very large state machine.
As a very simple example (with only a single operator):
%{
typedef struct ExprNode {
int count;
double value;
} ExprNode;
%}
%union {
ExprNode expr;
double value;
}
%token <value> T_FIGURE
%type <expr> expr multiply
%%
expr: T_FIGURE { $$.count = 0; $$.value = $1; }
multiply: expr
| multiply '*' expr { if ($1.count >= LIMIT) {
yyerror("Too many products");
YYABORT;
}
else {
$$.count = $1.count + 1;
$$.value = $1.value * $3.value;
}
}

Conditional statement parsing in yacc

I am writing an llvm code generation demo for a certain language which includes if statement. Here are the rules and the actions corresponding to my question:
IfStatement : IF CondExpression THEN Statement {if_Stmt(string($2),string($4));} %prec LOWER_THAN_ELSE ;
| IF CondExpression THEN Statement ELSE Statement {if_else_Stmt(string($2),string($4),string($6));}
;
CondExpression : Expression Relop Expression { $$ = operation($2,string($1),string($3));printf("Relop value : %s \n",$2);}
| Expression {$$ = $1;}
;
Relop : EE {$$ = (char *)(string("icmp eq ").c_str());printf("%s\n",$$);}
| NE {$$ = (char *)(string("icmp ne ").c_str());}
| LT {$$ = (char *)(string("icmp slt ").c_str());}
| GT {$$ = (char *)(string("icmp sgt ").c_str());}
| LTE {$$ = (char *)(string("icmp sle ").c_str());}
| GTE {$$ = (char *)(string("icmp sge ").c_str());}
;
The CondExpression rule should parse the conditional expression. I am using print function to print the value of Relop token which is of type < char * >. The Relop should have the value of the conditional tokens inside the string function as shown above in the code. However, the result of the print function is 0
Relop value : 0
and the result of the second print inside Relop is correct,
Relop value : icmp eq
why the Relop value in the CondExpression is 0 and how to make it take the correct value returned from Relop rule.
Not only is
(char *)(string("icmp ne ").c_str()
an absurdly obfuscated way of writing
"icmp ne"
it also introduces Undefined Behaviour not present in the simple and obvious alternative. The string constructor creates and returns a temporary string, and c_str is then used to return a pointer to internal storage of that temporary. You then store that pointer into the parser stack and let the temporary be deconstructed, orphaning the stored pointer. So when you attempt to print the string, you are passing a dangling pointer and anything might happen, such as the memory being reused for some other object leading to a mysterious string being printed.
Of course, if your semantic type is char *, C++ will complain that $$ = "icmp eq"; is not const-safe. It's not immediately obvious to me why you wouldn't use char *const as the semantic type, unless some other part of your code either intends to modify the string or may need to free the memory (because in some cases the string was dynamically allocated). In that case, you could force a copy of the string using, for example, strdup. If your library doesn't provide strdup or you don't want to rely on that, it can easily be defined as something like
char* strdup(const char* s, size_t len=strlen(s)) {
char* r = malloc(len + 1);
memcpy(r, s, len);
r[len] = 0;
return r;
}
Although a more C++-like solution would be to use std::string* as the semantic type, allowing you to write:
$$ = new std::string("icmp eq");

OCaml + Menhir: How to parse OCaml like tuple-pattern?

I'm a quite beginner of menhir.
I'm wondering how to parse OCaml like tuple-pattern in my own language, which is quite similar to OCaml.
For example, in the expression let a,b,c = ...,
a, b, c should be parsed like Tuple (Var "a", Var "b", Var "c").
But, in following definition of parser, the above example is parsed as Tuple (Tuple (Var "a", Var "b"), Var "c").
I am wondering how to fix the following definition to parse patterns like ocaml.
I have checked OCaml's parser.mly, but I'm not sure how to implement to do that.
I think my definition is similar to OCaml's one ...
What a magic they use ?
%token LPAREN
%token RPAREN
%token EOF
%token COMMA
%left COMMA
%token <string> LIDENT
%token UNDERBAR
%nonassoc below_COMMA
%start <Token.token> toplevel
%%
toplevel:
| p = pattern EOF { p }
pattern:
| p = simple_pattern { p }
| psec = pattern_tuple %prec below_COMMA
{ Ppat_tuple (List.rev psec) }
simple_pattern:
| UNDERBAR { Ppat_any }
| LPAREN RPAREN { Ppat_unit }
| v = lident { Ppat_var v }
| LPAREN p = pattern RPAREN { p }
pattern_tuple:
| seq = pattern_tuple; COMMA; p = pattern { p :: seq }
| p1 = pattern; COMMA; p2 = pattern { [p2; p1] }
lident:
| l = LIDENT { Pident l }
The result is the following:
[~/ocaml/error] menhir --interpret --interpret-show-cst ./parser.mly
File "./parser.mly", line 27, characters 2-42:
Warning: production pattern_tuple -> pattern_tuple COMMA pattern is never reduced.
Warning: in total, 1 productions are never reduced.
LIDENT COMMA LIDENT COMMA LIDENT
ACCEPT
[toplevel:
[pattern:
[pattern_tuple:
[pattern:
[pattern_tuple:
[pattern: [simple_pattern: [lident: LIDENT]]]
COMMA
[pattern: [simple_pattern: [lident: LIDENT]]]
]
]
COMMA
[pattern: [simple_pattern: [lident: LIDENT]]]
]
]
EOF
]
It contains a typical shift-reduce conflict, and you made a mistake at its resolution by specifying precedence. Please open any book about parsing by Yacc and check about shift-reduce conflicts and their resolution.
Let's see it using your rules. Suppose we have the following input and the parser is looking ahead at the second ,:
( p1 , p2 , ...
↑
Yacc is looking at this second COMMA token
Their are two possibilities:
Reduce: take p1 , p2 as a pattern. It uses the second rule of pattern
Shift: consumes the token COMMA and shift the look ahead cursor right to try making the comma separated list longer.
You can see the conflict easily removing %prec below_COMMA from the pattern rule:
$ menhir z.mly # the %prec thing is removed
...
File "z.mly", line 4, characters 0-9:
Warning: the precedence level assigned to below_COMMA is never useful.
Warning: one state has shift/reduce conflicts.
Warning: one shift/reduce conflict was arbitrarily resolved.
Many Yacc documents say in this case Yacc prefers shift, and this default usually matches with human intention, including your case. So one of the solutions is simply drop %prec below_COMMA thing and forget the warning.
If you do not like having shift reduce conflict warnings (That's the spirit!), you can explicitly state which rule should be chosen in this case using precedences, just like OCaml's parser.mly does. (BTW, OCaml's parser.mly is a jewel box of shift reduce resolutions. If you are not familiar, you should check one or two of them.)
Yacc chooses the rule with higher precedence at a shift reduce conflict. For shift, its precedence is the one of the token at the look ahead cursor, which is COMMA in this case. The precedence of reduce can be declared by %prec TOKEN suffix at the corresponding rule. If you do not specify it, I guess the precedence of a rule is undefined, and this is why the shift reduce conflict is warned if you remove %prec below_COMMA.
So now the question is: which has a higher precedence, COMMA or below_COMMA? This should be declared in the preamble of the mly file. (And this is why I have asked the questioner show that part.)
...
%left COMMA
...
%nonassoc below_COMMA
I skip what %left and %nonassoc mean, for all the Yacc books should explain them. Here below_COMMA pseudo token is below COMMA. It means below_COMMA has a higher precedence than COMMA. Therefore, the above example chooses Reduce, and gets ( (p1, p2), ... against the intention.
The correct precedence declaration is the opposite. To let Shift happen, below_COMMA must come above of COMMA, to have a lower precedence:
...
%nonassoc below_COMMA
%left COMMA
...
See OCaml's parser.mly. It exactly does like this. Putting "below thing above" sounds totally crazy, but it is not Menhir's fault. It is Yacc's unfortunate tradition. Blame it. OCaml's parser.mly has a comment already:
The precedences must be listed from low to high.
This is how I would rewrite it:
%token LPAREN
%token RPAREN
%token EOF
%token COMMA
%token LIDENT
%token UNDERBAR
%start toplevel
%%
toplevel:
| p = pattern EOF { p }
pattern:
| p = simple_pattern; tail = pattern_tuple_tail
{ match tail with
| [] -> p
| _ -> Ppat_tuple (p :: tail)
}
pattern_tuple_tail:
| COMMA; p = simple_pattern; seq = pattern_tuple_tail { p :: seq }
| { [] }
simple_pattern:
| UNDERBAR { Ppat_any }
| LPAREN RPAREN { Ppat_unit }
| v = lident { Ppat_var v }
| LPAREN p = pattern RPAREN { p }
lident:
| l = LIDENT { Pident l }
A major change is that a tuple element is no longer a pattern but a simple_pattern. A pattern itself is a comma-separated sequence of one or more elements.
$ menhir --interpret --interpret-show-cst parser.mly
LIDENT COMMA LIDENT COMMA LIDENT
ACCEPT
[toplevel:
[pattern:
[simple_pattern: [lident: LIDENT]]
[pattern_tuple_tail:
COMMA
[simple_pattern: [lident: LIDENT]]
[pattern_tuple_tail:
COMMA
[simple_pattern: [lident: LIDENT]]
[pattern_tuple_tail:]
]
]
]
EOF
]

How to define unrecognized rules in Ocamlyacc

I'm working on company projet, where i have to create a compilator for a language using Ocamlyacc and Ocamllex. I want to know if is it possible to define a rule in my Ocamlyacc Parser that can tell me that no rules of my grammar matching the syntax of an input.
I have to insist that i'am a beginner in Ocamllex/Ocamlyacc
Thank you a lot for your help.
If no rule in your grammar matches the input, then Parsing.Parse_error exception is raised. Usually, this is what you want.
There is also a special token called error that allows you to resynchronize your parser state. You can use it in your rules, as it was a real token produced by a lexer, cf., eof token.
Also, I would suggest to use menhir instead of more venerable ocamlyacc. It is easier to use and debug, and it also comes with a good library of predefined grammars.
When you write a compiler for a language, the first step is to run your lexer and to check if your program is good from a lexical point of view.
See the below example :
{
open Parser (* The type token is defined in parser.mli *)
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip blanks *)
| ['\n' ] { EOL }
| ['0'-'9']+ as lxm { INT(int_of_string lxm) }
| '+' { PLUS }
| '-' { MINUS }
| '*' { TIMES }
| '/' { DIV }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { raise Eof }
It's a lexer to recognize some arithmetic expressions.
If your lexer accepts the input then you give the sequence of lexemes to the parser which try to find if a AST can be build with the specified grammar. See :
%token <int> INT
%token PLUS MINUS TIMES DIV
%token LPAREN RPAREN
%token EOL
%left PLUS MINUS /* lowest precedence */
%left TIMES DIV /* medium precedence */
%nonassoc UMINUS /* highest precedence */
%start main /* the entry point */
%type <int> main
%%
main:
expr EOL { $1 }
;
expr:
INT { $1 }
| LPAREN expr RPAREN { $2 }
| expr PLUS expr { $1 + $3 }
| expr MINUS expr { $1 - $3 }
| expr TIMES expr { $1 * $3 }
| expr DIV expr { $1 / $3 }
| MINUS expr %prec UMINUS { - $2 }
;
This is a little program to parse arithmetic expression. A program can be rejected at this step because there is no rule of the grammar to apply in order to have an AST at the end. There is no way to define unrecognized rules but you need to write a grammar which define how a program can be accepted or rejected.
let _ =
try
let lexbuf = Lexing.from_channel stdin in
while true do
let result = Parser.main Lexer.token lexbuf in
print_int result; print_newline(); flush stdout
done
with Lexer.Eof ->
exit 0
If your compile the lexer, the parser and the last program, you have :
1 + 2 is accepted because there is no error lexical errors and an AST can be build corresponding to this expression.
1 ++ 2 is rejected : no lexical errors but there is no rule to build a such AST.
You can found more documentation here : http://caml.inria.fr/pub/docs/manual-ocaml-4.00/manual026.html

Bison parser calculator

this is my bison code:
%}
%union {
int int_val;
}
%left '+' '-'
%nonassoc '(' ')'
%token INTEGER PRINT
%type <int_val> expr_int INTEGER
%%
program: command '\n' { return 0; }
;
command: print_expr
;
print_expr: PRINT expr_int { cout<<$2<<endl; }
expr_int: expr_int '+' expr_int { $$ = $1 + $3; }
| expr_int '-' expr_int { $$ = $1 - $3; }
| '(' expr_int ')' { $$ = $2; }
| INTEGER
;
and this is the flex code:
%}
INTEGER [1-9][0-9]*|0
BINARY [-+]
WS [ \t]+
BRACKET [\(\)]
%%
print{WS} { return PRINT; }
{INTEGER} { yylval.int_val=atoi(yytext); return INTEGER; }
{BINARY}|\n { return *yytext; }
{BRACKET} { return *yytext; }
{WS} {}
. { return *yytext; }
%%
//////////////////////////////////////////////////
int yywrap(void) { return 1; } // Callback at end of file
Invalid inputs for the program are:
print 5
output:
5
input:
print (1+1)
output:
2
But for some reason, for the following inputs I do not get immediate error:
print (1+1))
output:
2
some error
input:
print 5!
output:
5
some error
I would like an error to be printed immediately, without commiting the print command and then throwing an error.
How should I change the program so it will not print errornous inputs?
Download the "flex & bison" book by John Levine or the "bison" manual from gnu. Both contain an infix calculator that you can reference.
The grammar you have written " '(' expr_int ')'" reduces to expr_int before the grammatically incorrect ')' in "(1 + 1))' is detected. That is the parser does:
(1 + 1)) => ( expr_int )) => expr_int)
and then sees the error. In order to capture the error you have to change the parser to see the error before the reduction, and you have to do it for all errors that you want treated. Therefore you would write (in this case):
expr_int '(' expr_int ')' ')' { this is an error message }
The short answer, after the long answer, is that it is impractical to generate a parser containing instances of all possible errors. What you have is fine for what you are doing. What you should explore is how to (gracefully) recover from an error rather than abandoning parsing.
Your "program" and "command" non-terminals can be combined as:
program: print-expr '\n' { return 0; }
On a separate note, your regular expressions can be rewritten to good effect as:
%%
INTEGER [0-9]+
WS [ \t]+
%%
print/{WS} { return PRINT; }
{INTEGER} { yylval.int_val=atoi(yytext); return INTEGER; }
'(' { return '('; }
')' { return ')'; }
'+' { return '+'; }
'-' { return '-'; }
{WS}* {}
\n { return '\n'; }
. { return *yytext; } // do you really want to do this?
%%
Create an end-of-line token (eg ;) for your language and make all lines statements exactly at the point when they encounter this end-of-line token.
Well that is because you are executing the code while parsing it. The good old bison calculator is meant to teach you how to write a grammar, not implement a full compiler/interpreter.
The normal way to build compiler/interpreter is the following:
lexer -> parser -> semantic analyser -> code generator -> interpreter
Granted building a fully fledged compiler may be overkill in your case. What you need to to is store the result somewhere and only output it after yyparse has returned without an error.
In yacc/bison, the code associated with a semantic action is executed as soon as the rule in question is reduced, which may happen immediately after the tokens for the rule have been shifted, before looking at any following context to see if there's and error or not (a so-called "default" reduction, used to make the parse tables smaller).
If you want to avoid printing an answer until an entire line is read (and recognized), you need to include the newline in the rule that has the action that prints the message. In your case, you can move the newline from the program rule to the print_expr rule:
program: command { return 0; } ;
print_expr: PRINT expr_int '\n' { cout<<$2<<endl; }
Of course, this will still give you an error (after printing output) if you give it multiple lines of input.