I'm writing an ansi-C parser in C++ with flex and bison; it's pretty complex.
The issue I'm having is a compilation error. The error is below, it's because yy_terminate returns YY_NULL which is defined as (an int) 0 and yylex has the return type of yy::AnsiCParser::symbol_type. yy_terminate(); is the automatic action for the <<EOF>> token in scanners generated by flex. Obviously this causes a type issue.
My scanner doesn't produce any special token for the EOF, because EOF has no purpose in a C grammar. I could create a token-rule for the <<EOF>> but if I ignore it then the scanner hangs in an infinite loop in yylex on the YY_STATE_EOF(INITIAL) case.
The compilation error,
ansi-c.yy.cc: In function ‘yy::AnsiCParser::symbol_type yylex(AnsiCDriver&)’:
ansi-c.yy.cc:145:17: error: could not convert ‘0’ from ‘int’ to ‘yy::AnsiCParser::symbol_type {aka yy::AnsiCParser::basic_symbol<yy::AnsiCParser::by_type>}’
ansi-c.yy.cc:938:30: note: in expansion of macro ‘YY_NULL’
ansi-c.yy.cc:1583:2: note: in expansion of macro ‘yyterminate’
Also, Bison generates this rule for my start-rule (translation_unit) and the EOF ($end).
$accept: translation_unit $end
So yylex has to return something for the EOF or the parser will never stop waiting for input, but my grammar cannot support an EOF token. Is there a way to make Bison recognize something other then 0 for the $end condition without modifying my grammar?
Alternatively, is there simply something I can return from the <<EOF>> token in the scanner to satisfy the Bison $end condition?
Normally, you would not include an explicit EOF rule in a lexical analyzer, not because it serves no purpose, but rather because the default is precisely what you want to do. (The purpose it serves is to indicate that the input is complete; otherwise, the parser would accept the valid prefix of certain invalid programs.)
Unfortunately, the C++ interfaces can defeat the simple convenience of the default EOF action, which is to return 0 (or NULL). I assume from your problem description that you have asked bison to generate a parser using complete symbols. In that case, you cannot simply return a 0 from yylex since the parser is expecting a complete symbol, which is a more complex type than int (Although the token which reports EOF does not normally have a semantic value, it does have a location, if you are using locaitons.) For other token types, bison will have automatically generated a function which makes an token, named something like make_FOO_TOKEN, which you will call in your scanner action for a FOO_TOKEN.
While the C bison parser does automatically define the end of file token (called END), it appears that the C++ interface does not. So you need to manually define it in your %token declaration in your bison input file:
%token END 0 "end of file"
(That defines the token type END with an integer value of 0 and the human readable label "end of file". The value 0 is obligatory.)
Once you've done that, you can add an explicit EOF rule in your flex input file:
<<EOF>> return make_END();
If you are using locations, you'll have to give make_END a location argument as well.
Here's another way to prevent the compiler error could not convert 0 from int to ...symbol_type - place this redefinition of the yyterminate macro just below where you redefine YY_DECL
// change curLocation to the name of the location object used in yylex
// qualify symbol_type with the bison namespace used
#define yyterminate() return symbol_type(YY_NULL, curLocation)
The compiler error shows up when bison locations are enabled, e.g. with %define locations - this makes bison add a location parameter to its symbol_type constructors so the constructor without locations
symbol_type(int tok)
turns into this with locations
symbol_type(int tok, location_type l)
rendering it no longer possible to convert an int to a symbol_type which is what the default definition of yyterminate in flex is able to do when bison locations are not enabled
#define yyterminate() return YY_NULL
With this workaround there's no need to handle EOF in flex if you don't need to - there's no need for a superfluous END token in bison if you don't need it
Related
I'm using the calc++ example found in the bison documentation as a starting point to a more complex grammar. One thing I haven't been able to figure out is how to return a character (literal) token from flex to bison.
In pure C examples, I've seen flex simply returning the token as:
"+" { count(); return('+'); }
The calc++ example simply uses token symbols:
"+" return yy::parser::make_PLUS (loc);
But this forces me to use PLUS instead of '+' in the grammar file.
How can I get flex to return a literal value as in the C example when generating C++ code?
Do not define it at all. It will return as the literal and you will be able to use it in parser as '+'
If you use "complete symbols" (that is, %define api.token.constructor), you should be able to use the appropriate parser::symbol_type constructor, as shown in the bison manual section on "complete symbols":
":" return yy::parser::symbol_type (':', loc);
Suppose I have written a c program and I have written print instead of printf.
Now my question is which part of compiler will detect this ?
I'm assuming OP means which part of the compiler internally, such as the lexer, parser, type analyzer, name analyzer, code generator, etc.
Without knowing specifically about gcc/llvm, I would assume that it's the Name Analyzer (more specifically, this is a part of the "Semantic Analyzer" generally, which also does Type Analysis), as that wouldn't be able to match "print" to anything that exists name wise. This is the same thing that prevents things such as:
x = 5;
When x does not exist previously.
Strictly speaking, assume that print will be represented by token in the form:
{ token type = Identifier, token value = 'print' }
This transformation from source characters in tokens is done by lexical analyzer. Lets say you have function get_token, it reads source file characters and returns token (in the form of above structure). We can say that source file is viewed as a sequence of such tokens.
To do higher-level job we call lower-level routines, so assume now that you have function parse_declaration that uses get_token. parse_declaration is responsible to recognize declaration in your program (it is done using parsing algorithm, e.g. recursive descent) If declaration is recognized it will save token value in symbol table, with type information and attributes.
Now, assume you have function parse_expression, it will call get_token, and if token type is Identifier it will perform name lookup. This means that it will search for token value in the symbol table. If search is unsuccessful it will print error message (something like "token value : undeclared identifier").
Of course this concept is simplified. In practice there is rather sophisticated logic for lexical analysis, parsing, semantics (how language 'behaves', name lookup is part of language semantics), and this logic should be as independent (separate) on one another as possible.
I have a complicated Yacc file with a bunch of rules, some of them complicated, for example:
start: program
program: extern_list class
class: T_CLASS T_ID T_LCB field_dec_list method_dec_list T_RCB
The exact rules and the actions I take on them are not important, because what I want to do seems fairly simple: just print out the program as it appears in the source file, using the rules I define for other purposes. But I'm surprised at how difficult doing so is.
First I tried adding printf("%s%s", $1, $2) to the second rule above. This produced "��#P�#". From what I understand, the parsed text is also available as a variable, yytext. I added printf("%s", yytext) to every rule in the file and added extern char* yytext; to the top of the file. This produced (null){void)1133331122222210101010--552222202020202222;;;;||||&&&&;;;;;;;;;;}}}}}}}} from a valid file according to the language's syntax. Finally, I changed extern char* yytext; to extern char yytext[], thinking it would not make a difference. The difference in output it made is best shown as a screenshot
I am using Bison 3.0.2 on Xubuntu 14.04.
If you just want to echo the source to some output while parsing it, it is easiest to do that in the lexer. You don't say what you ware using for a lexer, but you mention yytext, which is used by lex/flex, so I will assume that.
When you use flex to recognize tokens, the variable yytext refers to the internal buffer flex uses to recognize tokens. Within the action of a token, it can be used to get the text of the token, but only temporarily -- once the action completes and the next token is read, it will no longer be valid.
So if you have a flex rule like:
[a-zA-Z_][a-zA-Z_0-9]* { yylval.str = yytext, return T_ID; }
that likely won't work at all, as you'll have dangling pointers running around in your program; probably the source of the random-looking outputs you're seeing. Instead you need to make a copy. If you also want to output the input unchanged, you can do that here too:
[a-zA-Z_][a-zA-Z_0-9]* { yylval.str = strdup(yytext); ECHO; return T_ID; }
This uses the flex macro ECHO which is roughly equivalent to fputs(yytext, yyout) -- copying the input to a FILE * called yyout (which defaults to stdout)
If the first symbol in the corresponding right-hand side is a terminal, $1 in a bison action means "the value of yylval produced by the scanner when it returned the token corresponding to that terminal. If the symbol is a non-terminal, then it refers to the value assigned to $$ during the evaluation of the action which reduced that non-terminal. If there was no such action, then the default $$ = $1 will have been performed, so it will pass through the semantic value of the first symbol in the reduction of that non-terminal.
I apologize if all that was obvious, but your snippet is not sufficient to show:
what the semantic types are for each non-terminal;
what the semantic types are for each terminal;
what values, if any, are assigned to yylval in the scanner actions;
what values, if any, are assigned to $$ in the bison actions.
If any of those semantic types are not, in fact, character strings, then the printf will obviously produce garbage. (gcc might be able to warn you about this, if you compile the generated code with -Wall. Despite the possibility of spurious warnings if you are using old versions of flex/bison, I think it is always worthwhile compiling with -Wall and carefully reading the resulting warnings.)
Using yytext in a bison action is problematic, since it will refer to the text of the last token scanned, typically the look-ahead token. In particular, at the end of the input, yytext will be NULL, and that is what you will pick up in any reductions which occur at the end of input. glibc's printf implementation is nice enough to print (null) instead of segfaulting when your provide (char*)0 to an argument formated as %s, but I don't think it's a great idea to depend on that.
Finally, if you do have a char* semantic value, and you assign yylval = yytext (or yylval.sval = yytext; if you are using unions), then you will run into another problem, which is that yytext points into a temporary buffer owned by the scanner, and that buffer may have completely different contents by the time you get around to using the address. So you always need to make a copy of yytext if you want to pass it through to the parser.
If what you really want to do is see what the parser is doing, I suggest you enable bison's yydebug parser-trace feature. It will give you a lot of useful information, without requiring you to insert printf's into your bison actions at all.
imagine this grammar:
declaration
: declaration_specifiers ';' { /* allocate AST Node and return (1) */}
| declaration_specifiers init_declarator_list ';' { /* allocate AST Node and return (2)*/}
;
init_declarator_list
: init_declarator { /* alloc AST Node and return (3) */}
| init_declarator_list ',' init_declarator { /* allocate AST Node and return (4) */}
;
now imagine there is a error in the ',' token. So we have so far:
declaration -> declaration_specifiers init_declarator_list -> init_declarator_list ',' /*error*/
What happens here?
Does bison execute (4) code? and (2)? If bison does not execute (4) but it does execute (2) what is $3 value ? how can i set a default value for $variables?
How can i delete my AST generated on error properly?
bison only executes an action when the action's production is reduced, which means that it must have exactly matched the input, unless it is an error production in which case a relaxed matching form is used. (See below.) So you can be assured that if an action is performed, then the various semantic values associated with its terminals and non-terminals are the result of the lexer or their respective actions.
During error recovery, however, bison will automatically discard semantic values from the stack. With reasonably recent bison versions, you can specify an action to be performed when a value is discarded using the %destructor declaration. (See the bison manual for details.) You can specify a destructor either by type or by symbol (or both, but the per-symbol destructor takes precedence.)
The %destructor action will be run whenever bison discards a semantic value. Roughly speaking, discarding a semantic value means that your program never had a chance to deal with the semantic value. It does not apply to values popped off the stack when a production is reduced, even if there is no explicit action associated with the reduction. A complete definition of "discarded" is at the end of the bison manual section cited earlier.
Without error productions, there is really not much possible in the way of error recovery other than discarding the entire stack and any lookahead symbols (which bison will do automatically) and then terminating the parse. You can do a bit better by adding error productions to your grammar. An error production includes the special token error; this token matches an empty sequence precisely in the case that there is no other possible match. Unlike normal productions, error productions do not need to be immediately visible; bison will discard states (and corresponding values) from the stack until it finds a state with an error transition, or it reaches the end of the stack. Also, the terminal following error in the error production does not need to be the lookahead token; bison will discard lookahead tokens (and corresponding values) until it is able to continue with the error production (or it reaches the end of the input). See the handy manual for a longer description of the process (or read about it in the Dragon book, if you have a copy nearby).
There are several questions here.
Bison detects an error by being in a parse state in which there is no action (shift or reduce) for the current lookahead token. In your example that would be in the state after shifting the ',' in init_declarator_list. In that state, only tokens in FIRST(init_declarator) will be valid, so any other token will cause an error.
Actions in the bison code will be executed when the corresponding rule is reduced, so action (4) will never be called -- it never got far enough to reduce that rule. Action (3) will run when that rule was reduced, which happened before it shifted the , to the state where the error was detected.
After having an error (and calling yerror with an error message), the parser will attempt to recover by popping states off the stack, looking for one in which the special error token can be shifted. As it pops and discards states, it will call the %destructor action for symbols corresponding to those states, so you can use that to clean up things (free memory) if needed.
In your case, it looks like there are no error rules, so no states in which an error token can be shifted. So it will pop all states, and then return failure from yyparse. If it does find a state that can shift an error, it stop popping there and shift the error token, and attempt to continue parsing in error recovery mode. While in error recovery mode, it counts how many tokens (other than the error token) it has shifted since it last had an error. If it has shifted fewer than 3 tokens before hitting another error, it will not call yyerror for the new error. In addition, if it has shifted 0 tokens, it will try to recover from the error by reading and throwing away input tokens (instead of popping states) until it finds one that can be handled by the current state. As it discards tokens, it calls the %destructor for those tokens, so again you can clean up anything that needs cleaning.
So to answer you last question, you can use a %destructor declaration to delete stuff when an error occurs. The %destructor is called exactly once for each item that is discarded without being passed to a bison action. Items that are passed to actions (as $1, $2, ... in the action) will never have the %destructor called for them, so if you don't need them after the action, you should delete them there.
I am getting the following error message
error: '0' cannot be used as a function
when trying to compile the following line:
NOOP(0 != width);
NOOP is defined as follows:
#define NOOP (void)0
The source code is part of a SDK - so it should be okay. And I have found out that (void)0 actually is a valid way to descibe "no operation" in C++. But why would you want to pass a boolean parameter to a function which does nothing? And how do you get rid of the error message?
The MACRO is not defined with any parameters on it, so after the preprocessor replaces code, that statement ends up looking like this:
(void)0(0 != width);
Which confuses the compiler into thinking you are trying to use the "()" operator on 0. (i.e. using 0 as a function)
I recommend that you drop the "(0 != width)" (it is misleading) and just write NOOP;
"(void)0(0!=width);" is not valid C++, so it's not OK. (void)0; by itself doesn't do anything in C++, so can be used as a noop. Instead of your current define, I would use:
#define NOOP(X) (void)0
This tells the C++ preprocessor that there is a preprocessor function called NOOP that takes one parameter of any type, and replaces that entire function call with (void)0. So if you have a line of code that says NOOP("HELLO WORLD"), then the preprocessor replaces that entire thing with (void)0, which the C++ compiler proceeds to ignore.