Conditional statement parsing in yacc - llvm

I am writing an llvm code generation demo for a certain language which includes if statement. Here are the rules and the actions corresponding to my question:
IfStatement : IF CondExpression THEN Statement {if_Stmt(string($2),string($4));} %prec LOWER_THAN_ELSE ;
| IF CondExpression THEN Statement ELSE Statement {if_else_Stmt(string($2),string($4),string($6));}
;
CondExpression : Expression Relop Expression { $$ = operation($2,string($1),string($3));printf("Relop value : %s \n",$2);}
| Expression {$$ = $1;}
;
Relop : EE {$$ = (char *)(string("icmp eq ").c_str());printf("%s\n",$$);}
| NE {$$ = (char *)(string("icmp ne ").c_str());}
| LT {$$ = (char *)(string("icmp slt ").c_str());}
| GT {$$ = (char *)(string("icmp sgt ").c_str());}
| LTE {$$ = (char *)(string("icmp sle ").c_str());}
| GTE {$$ = (char *)(string("icmp sge ").c_str());}
;
The CondExpression rule should parse the conditional expression. I am using print function to print the value of Relop token which is of type < char * >. The Relop should have the value of the conditional tokens inside the string function as shown above in the code. However, the result of the print function is 0
Relop value : 0
and the result of the second print inside Relop is correct,
Relop value : icmp eq
why the Relop value in the CondExpression is 0 and how to make it take the correct value returned from Relop rule.

Not only is
(char *)(string("icmp ne ").c_str()
an absurdly obfuscated way of writing
"icmp ne"
it also introduces Undefined Behaviour not present in the simple and obvious alternative. The string constructor creates and returns a temporary string, and c_str is then used to return a pointer to internal storage of that temporary. You then store that pointer into the parser stack and let the temporary be deconstructed, orphaning the stored pointer. So when you attempt to print the string, you are passing a dangling pointer and anything might happen, such as the memory being reused for some other object leading to a mysterious string being printed.
Of course, if your semantic type is char *, C++ will complain that $$ = "icmp eq"; is not const-safe. It's not immediately obvious to me why you wouldn't use char *const as the semantic type, unless some other part of your code either intends to modify the string or may need to free the memory (because in some cases the string was dynamically allocated). In that case, you could force a copy of the string using, for example, strdup. If your library doesn't provide strdup or you don't want to rely on that, it can easily be defined as something like
char* strdup(const char* s, size_t len=strlen(s)) {
char* r = malloc(len + 1);
memcpy(r, s, len);
r[len] = 0;
return r;
}
Although a more C++-like solution would be to use std::string* as the semantic type, allowing you to write:
$$ = new std::string("icmp eq");

Related

Expression occurrences in flex/bison

Suppose I have a Bison expression like this:
multiply: T_FIGURE { $$ = $1; }
| multiply T_ASTERISK multiply { $$ = $1 * $3; }
;
It should return a result of multiplying some figures or give the input back if only one figure provided. If I wanted to limit the number of figures provided to at most 3, I would rewrite the expression like this:
multiply: T_FIGURE { $$ = $1; }
| T_FIGURE T_MULTIPLY T_FIGURE { $$ = $1 * $3; }
| T_FIGURE T_MULTIPLY T_FIGURE T_MULTIPLY T_FIGURE { $$ = $1 * $3 * $5; }
;
My question: is there a way to rewrite this expression so that I wouldn't have to manually specify the occurrences and instead use some kind of parameter to be able to easily change the upper limit to, for example, 30 occurrences?
In a word, "No". That is not a feature of bison (nor any yacc derivative I know of).
The easiest way to solve problems like this is to use a code generator. Either use an available macro processor like m4 or write your own script in whatever scripting language you feel comfortable with.
You could also solve the problem dynamically by counting arguments in your semantic action (which means modifying your semantic type to include both a value and a count.) You could then throw a syntax error if the count is exceeded. (Again, in your semantic action.) The main advantage of this approach is that avoids blowing up the parser's state table. If your limits are large and interact with each other, you might find you are producing a very large state machine.
As a very simple example (with only a single operator):
%{
typedef struct ExprNode {
int count;
double value;
} ExprNode;
%}
%union {
ExprNode expr;
double value;
}
%token <value> T_FIGURE
%type <expr> expr multiply
%%
expr: T_FIGURE { $$.count = 0; $$.value = $1; }
multiply: expr
| multiply '*' expr { if ($1.count >= LIMIT) {
yyerror("Too many products");
YYABORT;
}
else {
$$.count = $1.count + 1;
$$.value = $1.value * $3.value;
}
}

how to elegantly handle rule with multiple components in bison

I use to program in ocaml and use ocalmyacc to generate parser. One very useful feather of ocaml is its variant type like this:
type exp = Number of int
| Addexp of exp*exp
with such a type, I can construct an AST data structure very elegantly in the parser to represent an exp like this:
exp :
number {Number($1)}
| exp1 + exp2 {Addexp($1,$3)}
So if there exist similar mechanism in C++ and bison?
Yes, just match against exp + exp. Notice that for a given rule all its actions must have the same declared %type assigned to $$. In your case it would look something like this:
exp: number { $$ = PrimaryExp($1); }
| exp '+' exp { $$ = AddExp($1, $2); }

Yacc grammar producing incorrect terminal

I've been working on a hobby compiler for a while now, using lex and yacc for the parsing stage. This is all working fine for the majority of things, but when I added in if statements, the production rule for symbols is now giving the previous (or next?) item on the stack instead of the symbol value needed.
Grammar is given below with hopefully unrelated rules taken out:
%{
...
%}
%define parse.error verbose
%token ...
%%
Program:
Function { root->addChild($1);}
;
Function:
Type Identifier '|' ArgumentList '|' StatementList END
{ $$ = new FunctionDef($1, $2, $4, $6); }
/******************************************/
/* Statements and control flow ************/
/******************************************/
Statement:
Expression Delimiter
| VariableDeclaration Delimiter
| ControlFlowStatement Delimiter
| Delimiter
;
ControlFlowStatement:
IfStatement
;
IfStatement:
IF Expression StatementList END { $$ = new IfStatement($2, $3); }
| IF Expression StatementList ELSE StatementList END { $$ = new IfStatement($2, $3, $5);}
;
VariableDeclaration:
Type Identifier { $$ = new VariableDeclaration($1, $2);}
| Type Identifier EQUALS Expression { $$ = new VariableDeclaration($1, $2, $4);}
;
StatementList:
StatementList Statement { $1->addChild($2); }
| Statement { $$ = new GenericList($1); }
;
Delimiter:
';'
| NEWLINE
;
Type:
...
Expression:
...
PostfixExpression:
Value '[' Expression ']' { std::cout << "TODO: indexing operators ([ ])" << std::endl;}
| Value '.' SYMBOL { std::cout << "TODO: member access" << std::endl;}
| Value INCREMENT { $$ = new UnaryExpression(UNARY_POSTINC, $1); }
| Value DECREMENT { $$ = new UnaryExpression(UNARY_POSTDEC, $1); }
| Value '(' ')' { $$ = new FunctionCall($1, NULL); }
| Value '(' ExpressionList ')' { $$ = new FunctionCall($1, $3); }
| Value
;
Value:
BININT { $$ = new Integer(yytext, 2); }
| HEXINT { $$ = new Integer(yytext, 16); }
| DECINT { $$ = new Integer(yytext); }
| FLOAT { $$ = new Float(yytext); }
| SYMBOL { $$ = new Symbol(yytext); }
| STRING { $$ = new String(yytext); }
| LambdaFunction
| '(' Expression ')' { $$ = $2; }
| '[' ExpressionList ']' { $$ = $2;}
;
LambdaFunction:
...
%%
I cannot work out what about the control flow code can make the Symbol:
rule match something that isn't classed as a symbol from the lex definition:
symbol [a-zA-Z_]+(alpha|digit)*
...
{symbol} {return SYMBOL;}
Any help from somebody who knows about yacc and grammars in general would be very much appreciated. Also example files of the syntax it parses can be shown if necessary.
Thanks!
You cannot count on the value of yytext outside of a flex action.
Bison grammars typically read a lookahead token before deciding on how to proceed, so in a bison action, yytext has already been replaced with the token value of the lookahead token. (You can't count on that either, though: sometimes no lookahead token is needed.)
So you need to make a copy of yytext before the flex action returns and make that copy available to the bison grammar by placing it into the yylval semantic union.
See this bison FAQ entry
By the way, the following snippet from your flex file is incorrect:
symbol [a-zA-Z_]+(alpha|digit)*
In that regular expression, alpha and digit are just ordinary strings, so it is the same as [a-zA-Z_]+("alpha"|"digit")*, which means that it will match, for example, a_digitdigitdigit but not a_123. (It would have matched a_digitdigitdigit without the part following the +, so I presume that wasn't your intention.)
On the whole, I think it's better to use Posix character classes than either hand-written character classes or defined symbols, so I would write that as
symbol [[:alpha:]_]([[:alnum:]_]*[[:alnum:]])?
assuming that your intention is that a symbol can start but not end with an underscore, and end but not start with a digit. Using Posix character classes requires you to execute flex with the correct locale -- almost certainly the C locale -- but so do character ranges, so there is nothing to be lost by using the self-documenting Posix classes.
(Of course, I have no idea what your definitions of {alpha} and {digit} are, but it seems to me that they are either the same as [[:alpha:]] and [[:digit:]], in which case they are redundant, or different from the Posix classes, in which case they are confusing to the reader.)

Bison outputting string after the wrong line

The input
1 -- Narrowing Variable Initialization
2
3 function main a: integer returns integer;
4 b: integer is a * 2.;
5 begin
6 if a <= 0 then
7 b + 3;
8 else
9 b * 4;
10 endif;
11 end;
is yielding the output
1 -- Narrowing Variable Initialization
2
3 function main a: integer returns integer;
4 b: integer is a * 2.;
5 begin
Narrowing Variable Initialization
6 if a <= 0 then
7 b + 3;
8 else
9 b * 4;
10 endif;
11 end;
Instead of placing that error message under line 4, which is where the error actually occurs. I've looked at it for hours and can't figure it out.
%union
{
char* ident;
Types types;
}
%token <ident> IDENTIFIER
%token <types> INTEGER_LITERAL
%token <types> REAL_LITERAL
%token BEGIN_
%token FUNCTION
%token IS
%token <types> INTEGER
%token <types> REAL
%token RETURNS
%type <types> expression
%type <types> factor
%type <types> literal
%type <types> term
%type <types> statement
%type <types> type
%type <types> variable
%%
program:
/* empty */ |
functions ;
functions:
function_header_recovery body ; |
function_header_recovery body functions ;
function_header_recovery:
function_header ';' |
error ';' ;
function_header:
FUNCTION {locals = new Locals();} IDENTIFIER optional_parameters RETURNS type {globals->insert($3,locals->tList);} ;
optional_parameters:
/* empty */ |
parameters;
parameters:
IDENTIFIER ':' type {locals->insert($1, $3); locals->tList.push_back($3); } |
IDENTIFIER ':' type {locals->insert($1, $3); locals->tList.push_back($3); } "," parameters;
type:
INTEGER | REAL ;
body:
optional_variables BEGIN_ statement END ';' ;
optional_variables:
/* empty */ |
variables ;
variables:
variable IS statement {checkTypes($1, $3, 2);} |
variable IS statement {checkTypes($1, $3, 2);} variables ;
variable:
IDENTIFIER ':' type {locals->insert($1, $3);} {$$ = $3;} ;
statement:
expression ';' |
...
Types checkTypes(Types left, Types right, int flag)
{
if (left == right)
{
return left;
}
if (flag == 1)
{
Listing::appendError("Conditional Expression Type Mismatch", Listing::SEMANTIC);
}
else if (flag == 2)
{
if (left < right)
{
Listing::appendError("Narrowing Variable Initialization", Listing::SEMANTIC);
}
}
return REAL_TYPE;
}
printing being handled by:
void Listing::nextLine()
{
printf("\n");
if (error == "")
{
lineNo++;
printf("%4d%s",lineNo," ");
}
else
{
printf("%s", error.c_str());
error = "";
nextLine();
}
}
void Listing::appendError(const char* errText, int errEnum)
{
error = error + errText;
if (errEnum == 997)
{
lexErrCount++;
}
else if (errEnum == 998)
{
synErrCount++;
}
else if (errEnum == 999)
{
semErrCount++;
}
}
void Listing::display()
{
printf( "\b\b\b\b\b\b " );
if (lexErrCount + synErrCount + semErrCount > 0)
{
printf("\n\n%s%d","Lexical Errors ",lexErrCount);
printf("\n%s%d","Syntax Errors ",synErrCount);
printf("\n%s%d\n","Semantic Errors ",semErrCount);
}
else
{
printf("\nCompiled Successfully\n");
}
}
That's just the way bison works. It produces a one-token lookahead parser, so your production actions don't get triggered until it has read the token following the production. Consequently, begin must be read before the action associated with variables happens. (bison never tries to combine actions, even if they are textually identical. So it really cannot know which variables production applies and which action to execute until it sees the following token.)
There are various ways to associate a line number and/or column position with each token, and to use that information when an error message is to be produced. Interspersing the errors and/or warnings with the input text, in general, requires buffering the input; for syntax errors, you only need to buffer until the next token but that is not a general solution; in some cases, for example, you may want to associate an error with an operator, for example, but the error won't be detected until the operator's trailing argument has been parsed.
A simple technique to correctly intersperse errors/warnings with source is to write all the errors/warnings to a temporary file, putting the file offset at the front of each error. This file can then be sorted, and the input can then be reread, inserting the error messages at appropriate points. The nice thing about this strategy is that it avoids having to maintain line numbers for each error, which noticeably slows down lexical analysis. Of course, it won't work so easily if you allow constructs like C's #include.
Because generating good error messages is hard, and even tracking locations can slow parsing down quite a bit, I've sometimes used the strategy of parsing input twice if an error is detected. The first parse only detects errors and fails early if it can't do anything more reasonable; if an error is detected, the input is reparsed with a more elaborate parser which carefully tracks file locations and possibly even uses heuristics like indentation depth to try to produce better error messages.
As rici notes, bison produces an LALR(1) parser, so it uses one token of lookahead to know what action to take. However, it doesn't ALWAYS use a token of lookahead -- in some cases (where there's only one possibility regardless of lookahead), it uses default reductions which can reduce a rule (and run the associated action) WITHOUT lookahead.
In your case, you can take advantage of that to get the action to run without lookahead if you really need to. The particular rule in question (which triggers the requirement for lookahead) is:
variables:
variable IS statement {checkTypes($1, $3, 2);} |
variable IS statement {checkTypes($1, $3, 2);} variables ;
in this case, after seeing a variable IS statement, it needs to see the next token to decide if there are more variable declarations in order to know which action (the first or the second) to run. But as the two actions are really the same, you could combine them into a single action:
variables: vardecl | vardecl variables ;
vardecl: variable IS statement {checkTypes($1, $3, 2);}
which would end up using a default reduction as it doesn't need the lookahead to decide between two reductions/actions.
Note that the above depends on being able to find the end of a statement without lookahead, which should be the case as long as all statements end unambiguously with a ;

What is $$ in bison?

In the bison manual in section 2.1.2 Grammar Rules for rpcalc, it is written that:
In each action, the pseudo-variable $$ stands for the semantic value
for the grouping that the rule is going to construct. Assigning a
value to $$ is the main job of most actions
Does that mean $$ is used for holding the result from a rule? like:
exp exp '+' { $$ = $1 + $2; }
And what's the typical usage of $$ after begin assigned to?
Yes, $$ is used to hold the result of the rule. After being assigned to, it typically becomes a $x in some higher-level (or lower precedence) rule.
Consider (for example) input like 2 * 3 + 4. Assuming you follow the normal precedence rules, you'd have an action something like: { $$ = $1 * $3; }. In this case, that would be used for the 2 * 3 part and, obviously enough, assign 6 to $$. Then you'd have your { $$ = $1 + $3; } to handle the addition. For this action, $1 would be given the value 6 that you assigned to $$ in the multiplication rule.
Does that mean $$ is used for holding the result from a rule? like:
Yes.
And what's the typical usage of $$ after begin assigned to?
Typically you won’t need that value again. Bison uses it internally to propagate the value. In your example, $1 and $2 are the respective semantic values of the two exp productions, that is, their values were set somewhere in the semantic rule for exp by setting its $$ variable.
Try this. Create a YACC file with:
%token NUMBER
%%
exp: exp '+' NUMBER { $$ = $1 + $3; }
| exp '-' NUMBER { $$ = $1 - $3; }
| NUMBER { $$ = $1; }
;
Then process it using Bison or YACC. I am using Bison but I assume YACC is the same. Then just find the "#line" directives. Let us find the "#line 3" directive; it and the relevant code will look like:
#line 3 "DollarDollar.y"
{ (yyval) = (yyvsp[(1) - (3)]) + (yyvsp[(3) - (3)]); }
break;
And then we can quickly see that "$$" expands to "yyval". That other stuff, such as "yyvsp", is not so obvious but at least "yyval" is.
$$ represents the result reference of the current expression's evaluation. In other word, its result.Therefore, there's no particular usage after its assignation.
Bye !