What happens on error - Bison - c++

imagine this grammar:
declaration
: declaration_specifiers ';' { /* allocate AST Node and return (1) */}
| declaration_specifiers init_declarator_list ';' { /* allocate AST Node and return (2)*/}
;
init_declarator_list
: init_declarator { /* alloc AST Node and return (3) */}
| init_declarator_list ',' init_declarator { /* allocate AST Node and return (4) */}
;
now imagine there is a error in the ',' token. So we have so far:
declaration -> declaration_specifiers init_declarator_list -> init_declarator_list ',' /*error*/
What happens here?
Does bison execute (4) code? and (2)? If bison does not execute (4) but it does execute (2) what is $3 value ? how can i set a default value for $variables?
How can i delete my AST generated on error properly?

bison only executes an action when the action's production is reduced, which means that it must have exactly matched the input, unless it is an error production in which case a relaxed matching form is used. (See below.) So you can be assured that if an action is performed, then the various semantic values associated with its terminals and non-terminals are the result of the lexer or their respective actions.
During error recovery, however, bison will automatically discard semantic values from the stack. With reasonably recent bison versions, you can specify an action to be performed when a value is discarded using the %destructor declaration. (See the bison manual for details.) You can specify a destructor either by type or by symbol (or both, but the per-symbol destructor takes precedence.)
The %destructor action will be run whenever bison discards a semantic value. Roughly speaking, discarding a semantic value means that your program never had a chance to deal with the semantic value. It does not apply to values popped off the stack when a production is reduced, even if there is no explicit action associated with the reduction. A complete definition of "discarded" is at the end of the bison manual section cited earlier.
Without error productions, there is really not much possible in the way of error recovery other than discarding the entire stack and any lookahead symbols (which bison will do automatically) and then terminating the parse. You can do a bit better by adding error productions to your grammar. An error production includes the special token error; this token matches an empty sequence precisely in the case that there is no other possible match. Unlike normal productions, error productions do not need to be immediately visible; bison will discard states (and corresponding values) from the stack until it finds a state with an error transition, or it reaches the end of the stack. Also, the terminal following error in the error production does not need to be the lookahead token; bison will discard lookahead tokens (and corresponding values) until it is able to continue with the error production (or it reaches the end of the input). See the handy manual for a longer description of the process (or read about it in the Dragon book, if you have a copy nearby).

There are several questions here.
Bison detects an error by being in a parse state in which there is no action (shift or reduce) for the current lookahead token. In your example that would be in the state after shifting the ',' in init_declarator_list. In that state, only tokens in FIRST(init_declarator) will be valid, so any other token will cause an error.
Actions in the bison code will be executed when the corresponding rule is reduced, so action (4) will never be called -- it never got far enough to reduce that rule. Action (3) will run when that rule was reduced, which happened before it shifted the , to the state where the error was detected.
After having an error (and calling yerror with an error message), the parser will attempt to recover by popping states off the stack, looking for one in which the special error token can be shifted. As it pops and discards states, it will call the %destructor action for symbols corresponding to those states, so you can use that to clean up things (free memory) if needed.
In your case, it looks like there are no error rules, so no states in which an error token can be shifted. So it will pop all states, and then return failure from yyparse. If it does find a state that can shift an error, it stop popping there and shift the error token, and attempt to continue parsing in error recovery mode. While in error recovery mode, it counts how many tokens (other than the error token) it has shifted since it last had an error. If it has shifted fewer than 3 tokens before hitting another error, it will not call yyerror for the new error. In addition, if it has shifted 0 tokens, it will try to recover from the error by reading and throwing away input tokens (instead of popping states) until it finds one that can be handled by the current state. As it discards tokens, it calls the %destructor for those tokens, so again you can clean up anything that needs cleaning.
So to answer you last question, you can use a %destructor declaration to delete stuff when an error occurs. The %destructor is called exactly once for each item that is discarded without being passed to a bison action. Items that are passed to actions (as $1, $2, ... in the action) will never have the %destructor called for them, so if you don't need them after the action, you should delete them there.

Related

Make Bison accept an alternative EOF token

I'm writing an ansi-C parser in C++ with flex and bison; it's pretty complex.
The issue I'm having is a compilation error. The error is below, it's because yy_terminate returns YY_NULL which is defined as (an int) 0 and yylex has the return type of yy::AnsiCParser::symbol_type. yy_terminate(); is the automatic action for the <<EOF>> token in scanners generated by flex. Obviously this causes a type issue.
My scanner doesn't produce any special token for the EOF, because EOF has no purpose in a C grammar. I could create a token-rule for the <<EOF>> but if I ignore it then the scanner hangs in an infinite loop in yylex on the YY_STATE_EOF(INITIAL) case.
The compilation error,
ansi-c.yy.cc: In function ‘yy::AnsiCParser::symbol_type yylex(AnsiCDriver&)’:
ansi-c.yy.cc:145:17: error: could not convert ‘0’ from ‘int’ to ‘yy::AnsiCParser::symbol_type {aka yy::AnsiCParser::basic_symbol<yy::AnsiCParser::by_type>}’
ansi-c.yy.cc:938:30: note: in expansion of macro ‘YY_NULL’
ansi-c.yy.cc:1583:2: note: in expansion of macro ‘yyterminate’
Also, Bison generates this rule for my start-rule (translation_unit) and the EOF ($end).
$accept: translation_unit $end
So yylex has to return something for the EOF or the parser will never stop waiting for input, but my grammar cannot support an EOF token. Is there a way to make Bison recognize something other then 0 for the $end condition without modifying my grammar?
Alternatively, is there simply something I can return from the <<EOF>> token in the scanner to satisfy the Bison $end condition?
Normally, you would not include an explicit EOF rule in a lexical analyzer, not because it serves no purpose, but rather because the default is precisely what you want to do. (The purpose it serves is to indicate that the input is complete; otherwise, the parser would accept the valid prefix of certain invalid programs.)
Unfortunately, the C++ interfaces can defeat the simple convenience of the default EOF action, which is to return 0 (or NULL). I assume from your problem description that you have asked bison to generate a parser using complete symbols. In that case, you cannot simply return a 0 from yylex since the parser is expecting a complete symbol, which is a more complex type than int (Although the token which reports EOF does not normally have a semantic value, it does have a location, if you are using locaitons.) For other token types, bison will have automatically generated a function which makes an token, named something like make_FOO_TOKEN, which you will call in your scanner action for a FOO_TOKEN.
While the C bison parser does automatically define the end of file token (called END), it appears that the C++ interface does not. So you need to manually define it in your %token declaration in your bison input file:
%token END 0 "end of file"
(That defines the token type END with an integer value of 0 and the human readable label "end of file". The value 0 is obligatory.)
Once you've done that, you can add an explicit EOF rule in your flex input file:
<<EOF>> return make_END();
If you are using locations, you'll have to give make_END a location argument as well.
Here's another way to prevent the compiler error could not convert 0 from int to ...symbol_type - place this redefinition of the yyterminate macro just below where you redefine YY_DECL
// change curLocation to the name of the location object used in yylex
// qualify symbol_type with the bison namespace used
#define yyterminate() return symbol_type(YY_NULL, curLocation)
The compiler error shows up when bison locations are enabled, e.g. with %define locations - this makes bison add a location parameter to its symbol_type constructors so the constructor without locations
symbol_type(int tok)
turns into this with locations
symbol_type(int tok, location_type l)
rendering it no longer possible to convert an int to a symbol_type which is what the default definition of yyterminate in flex is able to do when bison locations are not enabled
#define yyterminate() return YY_NULL
With this workaround there's no need to handle EOF in flex if you don't need to - there's no need for a superfluous END token in bison if you don't need it

print(instead of printf()) in c/c++ program is detected by which part of compiler

Suppose I have written a c program and I have written print instead of printf.
Now my question is which part of compiler will detect this ?
I'm assuming OP means which part of the compiler internally, such as the lexer, parser, type analyzer, name analyzer, code generator, etc.
Without knowing specifically about gcc/llvm, I would assume that it's the Name Analyzer (more specifically, this is a part of the "Semantic Analyzer" generally, which also does Type Analysis), as that wouldn't be able to match "print" to anything that exists name wise. This is the same thing that prevents things such as:
x = 5;
When x does not exist previously.
Strictly speaking, assume that print will be represented by token in the form:
{ token type = Identifier, token value = 'print' }
This transformation from source characters in tokens is done by lexical analyzer. Lets say you have function get_token, it reads source file characters and returns token (in the form of above structure). We can say that source file is viewed as a sequence of such tokens.
To do higher-level job we call lower-level routines, so assume now that you have function parse_declaration that uses get_token. parse_declaration is responsible to recognize declaration in your program (it is done using parsing algorithm, e.g. recursive descent) If declaration is recognized it will save token value in symbol table, with type information and attributes.
Now, assume you have function parse_expression, it will call get_token, and if token type is Identifier it will perform name lookup. This means that it will search for token value in the symbol table. If search is unsuccessful it will print error message (something like "token value : undeclared identifier").
Of course this concept is simplified. In practice there is rather sophisticated logic for lexical analysis, parsing, semantics (how language 'behaves', name lookup is part of language semantics), and this logic should be as independent (separate) on one another as possible.

How can I print whatever I see in Yacc/Bison?

I have a complicated Yacc file with a bunch of rules, some of them complicated, for example:
start: program
program: extern_list class
class: T_CLASS T_ID T_LCB field_dec_list method_dec_list T_RCB
The exact rules and the actions I take on them are not important, because what I want to do seems fairly simple: just print out the program as it appears in the source file, using the rules I define for other purposes. But I'm surprised at how difficult doing so is.
First I tried adding printf("%s%s", $1, $2) to the second rule above. This produced "��#P�#". From what I understand, the parsed text is also available as a variable, yytext. I added printf("%s", yytext) to every rule in the file and added extern char* yytext; to the top of the file. This produced (null){void)1133331122222210101010--552222202020202222;;;;||||&&&&;;;;;;;;;;}}}}}}}} from a valid file according to the language's syntax. Finally, I changed extern char* yytext; to extern char yytext[], thinking it would not make a difference. The difference in output it made is best shown as a screenshot
I am using Bison 3.0.2 on Xubuntu 14.04.
If you just want to echo the source to some output while parsing it, it is easiest to do that in the lexer. You don't say what you ware using for a lexer, but you mention yytext, which is used by lex/flex, so I will assume that.
When you use flex to recognize tokens, the variable yytext refers to the internal buffer flex uses to recognize tokens. Within the action of a token, it can be used to get the text of the token, but only temporarily -- once the action completes and the next token is read, it will no longer be valid.
So if you have a flex rule like:
[a-zA-Z_][a-zA-Z_0-9]* { yylval.str = yytext, return T_ID; }
that likely won't work at all, as you'll have dangling pointers running around in your program; probably the source of the random-looking outputs you're seeing. Instead you need to make a copy. If you also want to output the input unchanged, you can do that here too:
[a-zA-Z_][a-zA-Z_0-9]* { yylval.str = strdup(yytext); ECHO; return T_ID; }
This uses the flex macro ECHO which is roughly equivalent to fputs(yytext, yyout) -- copying the input to a FILE * called yyout (which defaults to stdout)
If the first symbol in the corresponding right-hand side is a terminal, $1 in a bison action means "the value of yylval produced by the scanner when it returned the token corresponding to that terminal. If the symbol is a non-terminal, then it refers to the value assigned to $$ during the evaluation of the action which reduced that non-terminal. If there was no such action, then the default $$ = $1 will have been performed, so it will pass through the semantic value of the first symbol in the reduction of that non-terminal.
I apologize if all that was obvious, but your snippet is not sufficient to show:
what the semantic types are for each non-terminal;
what the semantic types are for each terminal;
what values, if any, are assigned to yylval in the scanner actions;
what values, if any, are assigned to $$ in the bison actions.
If any of those semantic types are not, in fact, character strings, then the printf will obviously produce garbage. (gcc might be able to warn you about this, if you compile the generated code with -Wall. Despite the possibility of spurious warnings if you are using old versions of flex/bison, I think it is always worthwhile compiling with -Wall and carefully reading the resulting warnings.)
Using yytext in a bison action is problematic, since it will refer to the text of the last token scanned, typically the look-ahead token. In particular, at the end of the input, yytext will be NULL, and that is what you will pick up in any reductions which occur at the end of input. glibc's printf implementation is nice enough to print (null) instead of segfaulting when your provide (char*)0 to an argument formated as %s, but I don't think it's a great idea to depend on that.
Finally, if you do have a char* semantic value, and you assign yylval = yytext (or yylval.sval = yytext; if you are using unions), then you will run into another problem, which is that yytext points into a temporary buffer owned by the scanner, and that buffer may have completely different contents by the time you get around to using the address. So you always need to make a copy of yytext if you want to pass it through to the parser.
If what you really want to do is see what the parser is doing, I suggest you enable bison's yydebug parser-trace feature. It will give you a lot of useful information, without requiring you to insert printf's into your bison actions at all.

Boost Semantic Actions causing parsing issues

I've been working with the Boost mini compiler example. Here is the root of the source code: http://www.boost.org/doc/libs/1_59_0/libs/spirit/example/qi/compiler_tutorial/mini_c/
The snippet that interests me is in statement_def.hpp
The problem I am having is that if you attach semantic actions, for example like such,
statement_ =
variable_declaration[print_this_declaration]
| assignment
| compound_statement
| if_statement
| while_statement
| return_statement
;
And subsequent run the mini_c compiler on a sample program like:
int foo(n) {
if (n == 3) { }
return a;
}
int main() {
return foo(10);
}
It triggers the "Duplicate Function Error" found within the "compile.cpp" file (found using the above link). Here is that snippet for quick reference:
if (functions.find(x.function_name.name) != functions.end())
{
error_handler(x.function_name.id, "Duplicate function: " + x.function_name.name);
return false;
}
For the life of me, I can't figure out why.
I'm not really sure how to characterize this problem, but it seems that somehow whatever is sent to standard out is being picked up by the parser as valid code to parse (but that seems impossible in this scenario).
Another possibility is the semantic action is somehow binding external data to a symbol table, where it is again considered to be part of the originally-parsed input file (when it shouldn't be).
The last and likely option is that probably I don't fully understand the minutiae of this example (or Boost for that matter), and that somewhere a pointer/reference/Iterator is being shifted to another memory location when it shouldn't (as a result of the SA), throwing the whole mini-compiler into disarray.
[...] it seems that somehow whatever is sent to standard out is being picked up by the parser as valid code to parse
As unlikely as it seemed... it is indeed :) No magic occurs.
Another possibility is the semantic action is somehow binding external data to a symbol table, where it is again considered to be part of the originally-parsed input file (when it shouldn't be).
You're not actually far off here. It's not so much "external" data, though. It's binding uninitialized data to the symbol table. And it actually tries to do that twice.
Step by step:
Qi rules that have semantic actions don't do automatic attribute propagation by default. It is assumed that the semantic action will be in charge of assigning a value to the attribute exposed.
This is the root cause. See documentation: Rule/Expression Semantics
Also: How Do Rules Propagate Attributes
Because of this, the actual attribute exposed by the statement_ rule will be a default-constructed object of type ast::statement:
qi::rule<Iterator, ast::statement(), skipper<Iterator> > statement_;
This type ast::statement is a variant, and a default constructed variant holds a default-constructed object of the first element type:
typedef boost::variant<
variable_declaration
, assignment
, boost::recursive_wrapper<if_statement>
, boost::recursive_wrapper<while_statement>
, boost::recursive_wrapper<return_statement>
, boost::recursive_wrapper<statement_list>
>
statement;
Lo and behold, that object is of type variable_declaration!
struct variable_declaration {
identifier lhs;
boost::optional<expression> rhs;
};
So, each time the statement_ rule matched, the AST will be interpreted as a "declaration of a variable with identifier name """. (Needless to say, the initializer (rhs) is also empty).
The second time this declaration is encountered violates the rule that duplicate names cannot exist in the "symbol table".
HOW TO FIX?
You can explicitly indicate that you want automatic attribute propagation even in the presence of Semantic Actions.
Use operator%= instead of operator= to _assign the rule definition:
statement_ %=
variable_declaration [print_this_declaration]
| assignment
| compound_statement
| if_statement
| while_statement
| return_statement
;
Now, everything will work again.

In Windbg, can I skip breaking when specific C++ exceptions are thrown?

In Visual Studio, through the dialog at Debug > Exceptions..., you can set specific C++ exceptions types to break on or skip past. In Windbg, turning on breaking for C++ exceptions with sxe eh is all or nothing.
Is there any way to skip breaking on specific C++ exception types? Conversely, is there a way to break on only specific types?
Note: This answer is 32-bit specific, as I haven't yet done much 64-bit debugging. I don't know how much applies to 64-bit.
Assume the following code:
class foo_exception : public std::exception {};
void throw_foo()
{
throw foo_exception();
}
And let's assume you've turned on breaking on first chance exceptions for C++ exceptions: sxe eh
Now when the debugger breaks, your exception record will be on the top of the stack. So if you just want to see what the type is, you can display the exception record info:
0:000> .exr #esp
ExceptionAddress: 751dc42d (KERNELBASE!RaiseException+0x00000058)
ExceptionCode: e06d7363 (C++ EH exception)
ExceptionFlags: 00000001
NumberParameters: 3
Parameter[0]: 19930520
Parameter[1]: 0027f770
Parameter[2]: 0122ada0
pExceptionObject: 0027f770
_s_ThrowInfo : 0122ada0
Type : class foo_exception
Type : class std::exception
Take a look at the current stack, and you can see where this stuff is sitting:
0027f6c4 e06d7363
0027f6c8 00000001
0027f6cc 00000000
0027f6d0 751dc42d KERNELBASE!RaiseException+0x58
0027f6d4 00000003
0027f6d8 19930520
0027f6dc 0027f770
0027f6e0 0122ada0 langD!_TI2?AVfoo_exception
...
So the exception itself is sitting at 0027f770 in this example, as you can see from the .exr output next to pExceptionObject. And you can see that value on the stack at 0027f6dc, or offset from the top of the stack by 0x18, so #esp+18. Let's see what the debugger tells us about that location.
0:000> dpp #esp+18 L1
0027f6dc 0027f770 01225ffc langD!foo_exception::`vftable'
This command says: starting at #esp+18, dump one pointer-sized value, then deref the value found there as a pointer, too, and write the name of any symbol matching that second address. And in this case it found the vtable for the foo_exception class. That tells us that the object at address 0027f770 is a foo_exception. And we can use that information to create an expression for a conditional breakpoint.
We need a way to get the address of the vtable directly, and that looks like this:
#!"langD!foo_exception::`vftable'"
We have to quote it because of the back tick and apostraphe. We also need to pull the desired stack value:
poi(poi(#esp+18))
The poi operator takes an address and returns a pointer-sized value stored there. The first evaluation turns the stack address into the object address, and the second evaluation turns the object address into the vtable address, which we need for comparison. The whole condition looks like this:
#!"langD!foo_exception::`vftable'" == poi(poi(#esp+18))
Now that we can tell if it's a foo_exception, we can skip breaking on them by setting a command to run automatically when the debugger breaks on C++ exceptions:
sxe -c".if ( #!\"langD!foo_exception::`vftable'\" == poi(poi(#esp+18)) ) {gc}" eh
Translation:
break on first chance for C++ exceptions and run a command that:
compares the foo_exception vtable address to the vtable address of the object at #esp+18
if they are the same, issue the gc command, which continues running if the debugger was running when this command was reached
(don't forget to escape the inner quotes)
And if you want to break only for a foo_exception, change the condition from == to !=.
Something to keep in mind is that sometimes exceptions are thrown as a pointer instead of by value, which means you'll need one more poi() around the #esp part of the expression. You'll be able to tell because when you dump the exception record with .exr, the Type will be class foo_expression *. This is completely dependent on the code that throws the exception and not the exception type itself, so you may need to tailor your .if-condition for the situtation.
Lastly, if you want to break on or skip several exception types, it is doable. I would suggest writing a script with the chained .if, .elsif commands and setting thesxe automatic command to $$><path\to\script. Doing a ton of if-condition chaining on one line can be very difficult to read and get right, especially with the extra escaping. A script won't need the extra escaping. Here's a small example:
.if ( #!"langD!foo_exception::`vftable'" == poi(poi(#esp+0x18)) )
{
$$ skip foo_exceptions
gc
}
.elsif ( #!"langD!bar_exception::`vftable'" == poi(poi(#esp+0x18)) )
{
$$ dump the exception to see the error message, then continue running
dt poi(#esp+18) langD!bar_exception
gc
}
.elsif ( #!"langD!baz_exception::`vftable'" == poi(poi(#esp+0x18)) )
{
$$ show the top 10 frames of the stack and then break (because we don't `gc`)
kc 10
}
(Note: Windbg will complain about a script error whenever this runs because it doesn't like a gc command followed by anything else. But it still runs fine)