I have a complicated Yacc file with a bunch of rules, some of them complicated, for example:
start: program
program: extern_list class
class: T_CLASS T_ID T_LCB field_dec_list method_dec_list T_RCB
The exact rules and the actions I take on them are not important, because what I want to do seems fairly simple: just print out the program as it appears in the source file, using the rules I define for other purposes. But I'm surprised at how difficult doing so is.
First I tried adding printf("%s%s", $1, $2) to the second rule above. This produced "��#P�#". From what I understand, the parsed text is also available as a variable, yytext. I added printf("%s", yytext) to every rule in the file and added extern char* yytext; to the top of the file. This produced (null){void)1133331122222210101010--552222202020202222;;;;||||&&&&;;;;;;;;;;}}}}}}}} from a valid file according to the language's syntax. Finally, I changed extern char* yytext; to extern char yytext[], thinking it would not make a difference. The difference in output it made is best shown as a screenshot
I am using Bison 3.0.2 on Xubuntu 14.04.
If you just want to echo the source to some output while parsing it, it is easiest to do that in the lexer. You don't say what you ware using for a lexer, but you mention yytext, which is used by lex/flex, so I will assume that.
When you use flex to recognize tokens, the variable yytext refers to the internal buffer flex uses to recognize tokens. Within the action of a token, it can be used to get the text of the token, but only temporarily -- once the action completes and the next token is read, it will no longer be valid.
So if you have a flex rule like:
[a-zA-Z_][a-zA-Z_0-9]* { yylval.str = yytext, return T_ID; }
that likely won't work at all, as you'll have dangling pointers running around in your program; probably the source of the random-looking outputs you're seeing. Instead you need to make a copy. If you also want to output the input unchanged, you can do that here too:
[a-zA-Z_][a-zA-Z_0-9]* { yylval.str = strdup(yytext); ECHO; return T_ID; }
This uses the flex macro ECHO which is roughly equivalent to fputs(yytext, yyout) -- copying the input to a FILE * called yyout (which defaults to stdout)
If the first symbol in the corresponding right-hand side is a terminal, $1 in a bison action means "the value of yylval produced by the scanner when it returned the token corresponding to that terminal. If the symbol is a non-terminal, then it refers to the value assigned to $$ during the evaluation of the action which reduced that non-terminal. If there was no such action, then the default $$ = $1 will have been performed, so it will pass through the semantic value of the first symbol in the reduction of that non-terminal.
I apologize if all that was obvious, but your snippet is not sufficient to show:
what the semantic types are for each non-terminal;
what the semantic types are for each terminal;
what values, if any, are assigned to yylval in the scanner actions;
what values, if any, are assigned to $$ in the bison actions.
If any of those semantic types are not, in fact, character strings, then the printf will obviously produce garbage. (gcc might be able to warn you about this, if you compile the generated code with -Wall. Despite the possibility of spurious warnings if you are using old versions of flex/bison, I think it is always worthwhile compiling with -Wall and carefully reading the resulting warnings.)
Using yytext in a bison action is problematic, since it will refer to the text of the last token scanned, typically the look-ahead token. In particular, at the end of the input, yytext will be NULL, and that is what you will pick up in any reductions which occur at the end of input. glibc's printf implementation is nice enough to print (null) instead of segfaulting when your provide (char*)0 to an argument formated as %s, but I don't think it's a great idea to depend on that.
Finally, if you do have a char* semantic value, and you assign yylval = yytext (or yylval.sval = yytext; if you are using unions), then you will run into another problem, which is that yytext points into a temporary buffer owned by the scanner, and that buffer may have completely different contents by the time you get around to using the address. So you always need to make a copy of yytext if you want to pass it through to the parser.
If what you really want to do is see what the parser is doing, I suggest you enable bison's yydebug parser-trace feature. It will give you a lot of useful information, without requiring you to insert printf's into your bison actions at all.
Related
I'm writing an ansi-C parser in C++ with flex and bison; it's pretty complex.
The issue I'm having is a compilation error. The error is below, it's because yy_terminate returns YY_NULL which is defined as (an int) 0 and yylex has the return type of yy::AnsiCParser::symbol_type. yy_terminate(); is the automatic action for the <<EOF>> token in scanners generated by flex. Obviously this causes a type issue.
My scanner doesn't produce any special token for the EOF, because EOF has no purpose in a C grammar. I could create a token-rule for the <<EOF>> but if I ignore it then the scanner hangs in an infinite loop in yylex on the YY_STATE_EOF(INITIAL) case.
The compilation error,
ansi-c.yy.cc: In function ‘yy::AnsiCParser::symbol_type yylex(AnsiCDriver&)’:
ansi-c.yy.cc:145:17: error: could not convert ‘0’ from ‘int’ to ‘yy::AnsiCParser::symbol_type {aka yy::AnsiCParser::basic_symbol<yy::AnsiCParser::by_type>}’
ansi-c.yy.cc:938:30: note: in expansion of macro ‘YY_NULL’
ansi-c.yy.cc:1583:2: note: in expansion of macro ‘yyterminate’
Also, Bison generates this rule for my start-rule (translation_unit) and the EOF ($end).
$accept: translation_unit $end
So yylex has to return something for the EOF or the parser will never stop waiting for input, but my grammar cannot support an EOF token. Is there a way to make Bison recognize something other then 0 for the $end condition without modifying my grammar?
Alternatively, is there simply something I can return from the <<EOF>> token in the scanner to satisfy the Bison $end condition?
Normally, you would not include an explicit EOF rule in a lexical analyzer, not because it serves no purpose, but rather because the default is precisely what you want to do. (The purpose it serves is to indicate that the input is complete; otherwise, the parser would accept the valid prefix of certain invalid programs.)
Unfortunately, the C++ interfaces can defeat the simple convenience of the default EOF action, which is to return 0 (or NULL). I assume from your problem description that you have asked bison to generate a parser using complete symbols. In that case, you cannot simply return a 0 from yylex since the parser is expecting a complete symbol, which is a more complex type than int (Although the token which reports EOF does not normally have a semantic value, it does have a location, if you are using locaitons.) For other token types, bison will have automatically generated a function which makes an token, named something like make_FOO_TOKEN, which you will call in your scanner action for a FOO_TOKEN.
While the C bison parser does automatically define the end of file token (called END), it appears that the C++ interface does not. So you need to manually define it in your %token declaration in your bison input file:
%token END 0 "end of file"
(That defines the token type END with an integer value of 0 and the human readable label "end of file". The value 0 is obligatory.)
Once you've done that, you can add an explicit EOF rule in your flex input file:
<<EOF>> return make_END();
If you are using locations, you'll have to give make_END a location argument as well.
Here's another way to prevent the compiler error could not convert 0 from int to ...symbol_type - place this redefinition of the yyterminate macro just below where you redefine YY_DECL
// change curLocation to the name of the location object used in yylex
// qualify symbol_type with the bison namespace used
#define yyterminate() return symbol_type(YY_NULL, curLocation)
The compiler error shows up when bison locations are enabled, e.g. with %define locations - this makes bison add a location parameter to its symbol_type constructors so the constructor without locations
symbol_type(int tok)
turns into this with locations
symbol_type(int tok, location_type l)
rendering it no longer possible to convert an int to a symbol_type which is what the default definition of yyterminate in flex is able to do when bison locations are not enabled
#define yyterminate() return YY_NULL
With this workaround there's no need to handle EOF in flex if you don't need to - there's no need for a superfluous END token in bison if you don't need it
Suppose I have written a c program and I have written print instead of printf.
Now my question is which part of compiler will detect this ?
I'm assuming OP means which part of the compiler internally, such as the lexer, parser, type analyzer, name analyzer, code generator, etc.
Without knowing specifically about gcc/llvm, I would assume that it's the Name Analyzer (more specifically, this is a part of the "Semantic Analyzer" generally, which also does Type Analysis), as that wouldn't be able to match "print" to anything that exists name wise. This is the same thing that prevents things such as:
x = 5;
When x does not exist previously.
Strictly speaking, assume that print will be represented by token in the form:
{ token type = Identifier, token value = 'print' }
This transformation from source characters in tokens is done by lexical analyzer. Lets say you have function get_token, it reads source file characters and returns token (in the form of above structure). We can say that source file is viewed as a sequence of such tokens.
To do higher-level job we call lower-level routines, so assume now that you have function parse_declaration that uses get_token. parse_declaration is responsible to recognize declaration in your program (it is done using parsing algorithm, e.g. recursive descent) If declaration is recognized it will save token value in symbol table, with type information and attributes.
Now, assume you have function parse_expression, it will call get_token, and if token type is Identifier it will perform name lookup. This means that it will search for token value in the symbol table. If search is unsuccessful it will print error message (something like "token value : undeclared identifier").
Of course this concept is simplified. In practice there is rather sophisticated logic for lexical analysis, parsing, semantics (how language 'behaves', name lookup is part of language semantics), and this logic should be as independent (separate) on one another as possible.
I am currently writing a program that sits on top of a C++ interpreter. The user inputs C++ commands at runtime, which are then passed into the interpreter. For certain patterns, I want to replace the command given with a modified form, so that I can provide additional functionality.
I want to replace anything of the form
A->Draw(B1, B2)
with
MyFunc(A, B1, B2).
My first thought was regular expressions, but that would be rather error-prone, as any of A, B1, or B2 could be arbitrary C++ expressions. As these expressions could themselves contain quoted strings or parentheses, it would be quite difficult to match all cases with a regular expression. In addition, there may be multiple, nested forms of this expression
My next thought was to call clang as a subprocess, use "-dump-ast" to get the abstract syntax tree, modify that, then rebuild it into a command to be passed to the C++ interpreter. However, this would require keeping track of any environment changes, such as include files and forward declarations, in order to give clang enough information to parse the expression. As the interpreter does not expose this information, this seems infeasible as well.
The third thought was to use the C++ interpreter's own internal parsing to convert to an abstract syntax tree, then build from there. However, this interpreter does not expose the ast in any way that I was able to find.
Are there any suggestions as to how to proceed, either along one of the stated routes, or along a different route entirely?
What you want is a Program Transformation System.
These are tools that generally let you express changes to source code, written in source level patterns that essentially say:
if you see *this*, replace it by *that*
but operating on Abstract Syntax Trees so the matching and replacement process is
far more trustworthy than what you get with string hacking.
Such tools have to have parsers for the source language of interest.
The source language being C++ makes this fairly difficult.
Clang sort of qualifies; after all it can parse C++. OP objects
it cannot do so without all the environment context. To the extent
that OP is typing (well-formed) program fragments (statements, etc,.)
into the interpreter, Clang may [I don't have much experience with it
myself] have trouble getting focused on what the fragment is (statement? expression? declaration? ...). Finally, Clang isn't really a PTS; its tree modification procedures are not source-to-source transforms. That matters for convenience but might not stop OP from using it; surface syntax rewrite rule are convenient but you can always substitute procedural tree hacking with more effort. When there are more than a few rules, this starts to matter a lot.
GCC with Melt sort of qualifies in the same way that Clang does.
I'm under the impression that Melt makes GCC at best a bit less
intolerable for this kind of work. YMMV.
Our DMS Software Reengineering Toolkit with its full C++14 [EDIT July 2018: C++17] front end absolutely qualifies. DMS has been used to carry out massive transformations
on large scale C++ code bases.
DMS can parse arbitrary (well-formed) fragments of C++ without being told in advance what the syntax category is, and return an AST of the proper grammar nonterminal type, using its pattern-parsing machinery. [You may end up with multiple parses, e.g. ambiguities, that you'll have decide how to resolve, see Why can't C++ be parsed with a LR(1) parser? for more discussion] It can do this without resorting to "the environment" if you are willing to live without macro expansion while parsing, and insist the preprocessor directives (they get parsed too) are nicely structured with respect to the code fragment (#if foo{#endif not allowed) but that's unlikely a real problem for interactively entered code fragments.
DMS then offers a complete procedural AST library for manipulating the parsed trees (search, inspect, modify, build, replace) and can then regenerate surface source code from the modified tree, giving OP text
to feed to the interpreter.
Where it shines in this case is OP can likely write most of his modifications directly as source-to-source syntax rules. For his
example, he can provide DMS with a rewrite rule (untested but pretty close to right):
rule replace_Draw(A:primary,B1:expression,B2:expression):
primary->primary
"\A->Draw(\B1, \B2)" -- pattern
rewrites to
"MyFunc(\A, \B1, \B2)"; -- replacement
and DMS will take any parsed AST containing the left hand side "...Draw..." pattern and replace that subtree with the right hand side, after substituting the matches for A, B1 and B2. The quote marks are metaquotes and are used to distinguish C++ text from rule-syntax text; the backslash is a metaescape used inside metaquotes to name metavariables. For more details of what you can say in the rule syntax, see DMS Rewrite Rules.
If OP provides a set of such rules, DMS can be asked to apply the entire set.
So I think this would work just fine for OP. It is a rather heavyweight mechanism to "add" to the package he wants to provide to a 3rd party; DMS and its C++ front end are hardly "small" programs. But then modern machines have lots of resources so I think its a question of how badly does OP need to do this.
Try modify the headers to supress the method, then compiling you'll find the errors and will be able to replace all core.
As far as you have a C++ interpreter (as CERN's Root) I guess you must use the compiler to intercept all the Draw, an easy and clean way to do that is declare in the headers the Draw method as private, using some defines
class ItemWithDrawMehtod
{
....
public:
#ifdef CATCHTHEMETHOD
private:
#endif
void Draw(A,B);
#ifdef CATCHTHEMETHOD
public:
#endif
....
};
Then compile as:
gcc -DCATCHTHEMETHOD=1 yourfilein.cpp
In case, user want to input complex algorithms to the application, what I suggest is to integrate a scripting language to the app. So that the user can write code [function/algorithm in defined way] so the app can execute it in the interpreter and get the final results. Ex: Python, Perl, JS, etc.
Since you need C++ in the interpreter http://chaiscript.com/ would be a suggestion.
What happens when someone gets ahold of the Draw member function (auto draw = &A::Draw;) and then starts using draw? Presumably you'd want the same improved Draw-functionality to be called in this case too. Thus I think we can conclude that what you really want is to replace the Draw member function with a function of your own.
Since it seems you are not in a position to modify the class containing Draw directly, a solution could be to derive your own class from A and override Draw in there. Then your problem reduces to having your users use your new improved class.
You may again consider the problem of automatically translating uses of class A to your new derived class, but this still seems pretty difficult without the help of a full C++ implementation. Perhaps there is a way to hide the old definition of A and present your replacement under that name instead, via clever use of header files, but I cannot determine whether that's the case from what you've told us.
Another possibility might be to use some dynamic linker hackery using LD_PRELOAD to replace the function Draw that gets called at runtime.
There may be a way to accomplish this mostly with regular expressions.
Since anything that appears after Draw( is already formatted correctly as parameters, you don't need to fully parse them for the purpose you have outlined.
Fundamentally, the part that matters is the "SYMBOL->Draw("
SYMBOL could be any expression that resolves to an object that overloads -> or to a pointer of a type that implements Draw(...). If you reduce this to two cases, you can short-cut the parsing.
For the first case, a simple regular expression that searches for any valid C++ symbol, something similar to "[A-Za-z_][A-Za-z0-9_\.]", along with the literal expression "->Draw(". This will give you the portion that must be rewritten, since the code following this part is already formatted as valid C++ parameters.
The second case is for complex expressions that return an overloaded object or pointer. This requires a bit more effort, but a short parsing routine to walk backward through just a complex expression can be written surprisingly easily, since you don't have to support blocks (blocks in C++ cannot return objects, since lambda definitions do not call the lambda themselves, and actual nested code blocks {...} can't return anything directly inline that would apply here). Note that if the expression doesn't end in ) then it has to be a valid symbol in this context, so if you find a ) just match nested ) with ( and extract the symbol preceding the nested SYMBOL(...(...)...)->Draw() pattern. This may be possible with regular expressions, but should be fairly easy in normal code as well.
As soon as you have the symbol or expression, the replacement is trivial, going from
SYMBOL->Draw(...
to
YourFunction(SYMBOL, ...
without having to deal with the additional parameters to Draw().
As an added benefit, chained function calls are parsed for free with this model, since you can recursively iterate over the code such as
A->Draw(B...)->Draw(C...)
The first iteration identifies the first A->Draw( and rewrites the whole statement as
YourFunction(A, B...)->Draw(C...)
which then identifies the second ->Draw with an expression "YourFunction(A, ...)->" preceding it, and rewrites it as
YourFunction(YourFunction(A, B...), C...)
where B... and C... are well-formed C++ parameters, including nested calls.
Without knowing the C++ version that your interpreter supports, or the kind of code you will be rewriting, I really can't provide any sample code that is likely to be worthwhile.
One way is to load user code as a DLL, (something like plugins,)
this way, you don't need to compile your actual application, just the user code will be compiled, and you application will load it dynamically.
I have a file that contains an arbitrary number of lines of c++ code, each line of which is self-contained (meaning it is valid by itself in the main function). However, I do not know how many, if any, of the lines will have valid c++ syntax. An example file might be
int length, width; // This one is fine
template <class T className {}; // Throws a syntax error
What I want to do is write to a second file all the lines that have valid syntax. Currently, I've written a program in python that reads each line, places it into the following form
int main() {
// Line goes here
return 0;
}
and attempts to compile it, returning True if the compilation succeeds and False if it doesn't, which I then use to determine which lines to write to the output file. For example, the first line would generate a file containing
int main() {
int length, width;
return 0;
}
which would compile fine and return True to the python program. However, I'm curious if there is any sort of try-catch syntax that works with the compiler so I could put each line of the file in a try-catch block and write it to the output if no exception is thrown, or if there's a way I can tell the compiler to ignore syntax errors.
Edit: I've been asked for details about why I would need to do this, and I'll be the first to admit it's a strange question. The reason I'm doing this is because I have another program (of which I don't know all the implementation details) that writes a large number of lines to a file, each of which should be able to stand alone. I also know that this program will almost certainly write lines that have syntax errors. What I'm trying to do is write a program that will remove any invalid lines so that the resulting file can compile without error. What I have in my python program right now works, but I'm trying to figure out if there is a simpler way to do it.
Edit 2: Though I think I've got my answer - that I can't really play try-catch with the compiler, and that's good enough. Thanks everyone!
Individual lines of code that are syntactically correct in the context of a C++ source file are not necessarily syntactically correct by themselves.
For example this:
int length, width;
happens to be valid either as part of a main function or by itself -- but it has a different meaning (by itself it defines length and width as static objects).
This:
}
is valid in context, but not by itself.
There is typically no way for a compiler to ignore syntax errors. Once a syntax error has been encountered, the compiler has no way to interpret the rest of the code.
When you're reading English text, adfasff iyufoyur; ^^$(( -- but you can usually recover and recognize valid syntax after an error. Compilers for programming languages aren't designed to perform that kind of recovery; probably the nature of C++'s syntax would make it more difficult, and there's just not enough demand to make it worth doing.
I'm not sure what your criterion for a single line of code being "correct" is. One possibility might be to write the line of code to a file, contained in a definition of main:
int main() {
// insert arbitrary line here
}
and then compile the resulting source file. I'm not sure that I can see how that would be particularly useful, but it's about the closest I can come to what you're asking for.
What do you mean by "each line is self-contained"? If the syntax of a line of C++ code is valid may depend largely on the code before or after that line. A given line of code might be valid within a function, but not outside a function body. So, as long as you can't define what you mean by "self-contained" it is hard to solve your problem.
I create an .exe FILE, which can parser an expression, which is generated by lex and yacc. But I do it just get the input from screen, and just return the parser result from screen. I saw some suggestions about using YY_BUFFER_STATE yy_scan_buffer(char *base, yy_size_t size), but I still could not find a good way to do it.
Is it possible that I put some headers (which is compiled by lex yacc) to my main program c++, and then I can use yylex() to call it, giving a string as input, and get the return value in the main program? Thanks for your help, I am confused about how to realize it. Thanks.
yy_scan_string is how you give flex a string as input. You call that first, and then call yylex and it will use that string as the input to get tokens from rather than stdin. When you get an EOF from yylex, it has scanned the entire string. You can then call yy_delete_buffer on the YY_BUFFER_STATE returned by yy_scan_string (to free up memory) and call yy_scan_string again if you want to scan a new string.
You can use yy_scan_buffer instead to save a bit of copying, but then you have to set up the buffer properly yourself (basically, it needs to end with two NUL bytes instead of just one).
Unfortunately, there's no standard header file from flex declaring these. So you need to either declare them yourself somewhere (copy the declarations from the flex documentation), or call them in the 3rd section of the .l file, which is copied verbatim to the end of the lex.y.c file.