I am working with LEX and YACC. I have a question regarding how to define tokens, I mean I have two regular expressions which share some characters, see the example below:
SHARED "+"|"-"|"/"|"="|"%"|"?"|"!"|"."|"$"|"_"|"~"|"&"|"^"|"<"|">"|"("|")"|","
REXP_1 {SHARED}|[a-zA-Z]|[ \t]+|[\\][\\\"]
REXP_2 {SHARED}|[a-zA-Z]|[ \t]+|"*"
Now my point is how to identify when a character from the shared regular expression correspond to REXP_1 or REXP_2 when I define the tokens in the third section of the .lex file.
I think I am misunderstanding something, I guess that the way I write the regular expression is wrong but I do not find a way to put it in a better way. Could you please give me some hints?
More over I would appreciate if someone could advice me some criteria to determine when to define a token (file.lex) or when to define a symbol in the grammar(file.y). For some symbols it is easy to figure out if it is a token or a grammar symbol but for some others I find it difficult to define where to put them.
By the way I am working with this grammar
(Answered in a question edit)
The OP wrote:
Just in case someone find it interesting I am going to write out the lessons I learned. I think that the most important lesson I learnt is that common sense is a great tool to figure out what is a intern token in the .lex file and what is a suitable token to share with the .y file.
Since the term 'common sense' may be a bit ambiguous I post the following example:
ALPHA_NUMERIC [a-bA-B0-9]
SQ_CHAR {SHARED}|{ALPHA_NUMERIC}
SINGLE_QUOTED {SINGLE_QUOTE}{SQ_CHAR}{SQ_CHAR}*{SINGLE_QUOTE}
where ALPHA_NUMERIC is a good intern token (file.lex) but is a bad token to share in the grammar file whereas SINGLE_QUOTED may be a good token to share with the grammar(file.y). I wrote 'may be' because it is very dependent of the specific grammar we are working on, in my concrete case it is a good token to share with the YACC file.
What I did is to define as a token a regexp similar to the one #OGHaza advised me in file.lex and then I use it in the grammar itself (file.y).
Related
I have been trying to solve this issue related to a current school assignment and would highly appreciate if anyone could explain to my why I am getting warnings from my compiler such as decafast.y:201.13-16: warning: rule useless in grammar [-Wother] | Type.
I have provided my code in the two following pastebin files:
decafast.lex: https://pastebin.com/2qzG2cwW
decafast.y: https://pastebin.com/Akg5ehW1
I have also been provided with a file 'decafast.cc' that contains classes and methods that enable me to create lists (I believe that is the purpose), found at:
https://pastebin.com/M7XRJunL
and the specification that I am supposed to follow (grammar found at bottom of page):
http://anoopsarkar.github.io/compilers-class/decafspec.html
My main concern is why I seem to be getting these warnings that are (what I assume to be) causing my code to fail. Nearly every (if not all) of my grammar is deemed to be useless, and despite my online searching (or lack thereof of understanding what has been said already), I am still unsuccessful.
I also had a secondary question if anyone is able to enlighten me. Regarding the .cc file above, I have been given a few classes that implement the decafAST class. In my parser generator file (decafast.y), I attempt to create lists by doing something along the lines of
decafStmtList *s = new decafStmtList();
I assumed that this would allow me to use the push_back() and push_front() methods, which is why I try things such as (in the case of ident_list, line 94) if I see one T_ID, then I create the list for T_ID (identifiers) and push the current T_ID into the list. If I see a situation of ident_list T_COMMA T_ID (which is what I assumed to be the case of a recurring list of comma-separated identifiers), then I would recognize that as an ident_list pattern, and thus push that T_ID into the list as well. Is this the correct way to use the lists that I have been provided?
I would like to stress that as this is an assignment question, I am pleading for any help that you can provide that will allow me to learn on my own terms. I am certain that the users on this website would easily be able to solve this assignment, therefore I would very highly appreciate any insight that you may be able to offer without giving me explicit answers. Thank you all for your time!
The grammar you were provided starts with:
Program = Externs package identifier "{" FieldDecls MethodDecls "}" .
That is, a Program consists of:
A possibly empty list of external declarations (library functions used)
The keyword "package"
An identifier
An open brace
A possibly empty list of field declarations
A possibly empty list of method declarations
A close brace.
Most of the rest of the grammar defines what field and method declarations look like, although a couple of productions define external declarations.
But your grammar is quite different: (I removed the actions because they aren't relevant to the grammar)
start: program
program: extern_list decafpackage
extern_list:
| ExternDefn
decafpackage: T_PACKAGE T_ID T_LCB T_RCB
Your decafpackage consists only of package ID { }, with nothing between the braces.
So most of the rest of the grammar productions, which detail field amd method declarations, can never be used, making them useless.
(Also, your extern_list does not define a list of ExternDecl. It defines an optional ExternDecl. I think you made that same mistake in other list productions.)
The syntax for a bison rule is:
result: components…;
As far as I see none of your rules have the semicolon.
I am currently writing a program that sits on top of a C++ interpreter. The user inputs C++ commands at runtime, which are then passed into the interpreter. For certain patterns, I want to replace the command given with a modified form, so that I can provide additional functionality.
I want to replace anything of the form
A->Draw(B1, B2)
with
MyFunc(A, B1, B2).
My first thought was regular expressions, but that would be rather error-prone, as any of A, B1, or B2 could be arbitrary C++ expressions. As these expressions could themselves contain quoted strings or parentheses, it would be quite difficult to match all cases with a regular expression. In addition, there may be multiple, nested forms of this expression
My next thought was to call clang as a subprocess, use "-dump-ast" to get the abstract syntax tree, modify that, then rebuild it into a command to be passed to the C++ interpreter. However, this would require keeping track of any environment changes, such as include files and forward declarations, in order to give clang enough information to parse the expression. As the interpreter does not expose this information, this seems infeasible as well.
The third thought was to use the C++ interpreter's own internal parsing to convert to an abstract syntax tree, then build from there. However, this interpreter does not expose the ast in any way that I was able to find.
Are there any suggestions as to how to proceed, either along one of the stated routes, or along a different route entirely?
What you want is a Program Transformation System.
These are tools that generally let you express changes to source code, written in source level patterns that essentially say:
if you see *this*, replace it by *that*
but operating on Abstract Syntax Trees so the matching and replacement process is
far more trustworthy than what you get with string hacking.
Such tools have to have parsers for the source language of interest.
The source language being C++ makes this fairly difficult.
Clang sort of qualifies; after all it can parse C++. OP objects
it cannot do so without all the environment context. To the extent
that OP is typing (well-formed) program fragments (statements, etc,.)
into the interpreter, Clang may [I don't have much experience with it
myself] have trouble getting focused on what the fragment is (statement? expression? declaration? ...). Finally, Clang isn't really a PTS; its tree modification procedures are not source-to-source transforms. That matters for convenience but might not stop OP from using it; surface syntax rewrite rule are convenient but you can always substitute procedural tree hacking with more effort. When there are more than a few rules, this starts to matter a lot.
GCC with Melt sort of qualifies in the same way that Clang does.
I'm under the impression that Melt makes GCC at best a bit less
intolerable for this kind of work. YMMV.
Our DMS Software Reengineering Toolkit with its full C++14 [EDIT July 2018: C++17] front end absolutely qualifies. DMS has been used to carry out massive transformations
on large scale C++ code bases.
DMS can parse arbitrary (well-formed) fragments of C++ without being told in advance what the syntax category is, and return an AST of the proper grammar nonterminal type, using its pattern-parsing machinery. [You may end up with multiple parses, e.g. ambiguities, that you'll have decide how to resolve, see Why can't C++ be parsed with a LR(1) parser? for more discussion] It can do this without resorting to "the environment" if you are willing to live without macro expansion while parsing, and insist the preprocessor directives (they get parsed too) are nicely structured with respect to the code fragment (#if foo{#endif not allowed) but that's unlikely a real problem for interactively entered code fragments.
DMS then offers a complete procedural AST library for manipulating the parsed trees (search, inspect, modify, build, replace) and can then regenerate surface source code from the modified tree, giving OP text
to feed to the interpreter.
Where it shines in this case is OP can likely write most of his modifications directly as source-to-source syntax rules. For his
example, he can provide DMS with a rewrite rule (untested but pretty close to right):
rule replace_Draw(A:primary,B1:expression,B2:expression):
primary->primary
"\A->Draw(\B1, \B2)" -- pattern
rewrites to
"MyFunc(\A, \B1, \B2)"; -- replacement
and DMS will take any parsed AST containing the left hand side "...Draw..." pattern and replace that subtree with the right hand side, after substituting the matches for A, B1 and B2. The quote marks are metaquotes and are used to distinguish C++ text from rule-syntax text; the backslash is a metaescape used inside metaquotes to name metavariables. For more details of what you can say in the rule syntax, see DMS Rewrite Rules.
If OP provides a set of such rules, DMS can be asked to apply the entire set.
So I think this would work just fine for OP. It is a rather heavyweight mechanism to "add" to the package he wants to provide to a 3rd party; DMS and its C++ front end are hardly "small" programs. But then modern machines have lots of resources so I think its a question of how badly does OP need to do this.
Try modify the headers to supress the method, then compiling you'll find the errors and will be able to replace all core.
As far as you have a C++ interpreter (as CERN's Root) I guess you must use the compiler to intercept all the Draw, an easy and clean way to do that is declare in the headers the Draw method as private, using some defines
class ItemWithDrawMehtod
{
....
public:
#ifdef CATCHTHEMETHOD
private:
#endif
void Draw(A,B);
#ifdef CATCHTHEMETHOD
public:
#endif
....
};
Then compile as:
gcc -DCATCHTHEMETHOD=1 yourfilein.cpp
In case, user want to input complex algorithms to the application, what I suggest is to integrate a scripting language to the app. So that the user can write code [function/algorithm in defined way] so the app can execute it in the interpreter and get the final results. Ex: Python, Perl, JS, etc.
Since you need C++ in the interpreter http://chaiscript.com/ would be a suggestion.
What happens when someone gets ahold of the Draw member function (auto draw = &A::Draw;) and then starts using draw? Presumably you'd want the same improved Draw-functionality to be called in this case too. Thus I think we can conclude that what you really want is to replace the Draw member function with a function of your own.
Since it seems you are not in a position to modify the class containing Draw directly, a solution could be to derive your own class from A and override Draw in there. Then your problem reduces to having your users use your new improved class.
You may again consider the problem of automatically translating uses of class A to your new derived class, but this still seems pretty difficult without the help of a full C++ implementation. Perhaps there is a way to hide the old definition of A and present your replacement under that name instead, via clever use of header files, but I cannot determine whether that's the case from what you've told us.
Another possibility might be to use some dynamic linker hackery using LD_PRELOAD to replace the function Draw that gets called at runtime.
There may be a way to accomplish this mostly with regular expressions.
Since anything that appears after Draw( is already formatted correctly as parameters, you don't need to fully parse them for the purpose you have outlined.
Fundamentally, the part that matters is the "SYMBOL->Draw("
SYMBOL could be any expression that resolves to an object that overloads -> or to a pointer of a type that implements Draw(...). If you reduce this to two cases, you can short-cut the parsing.
For the first case, a simple regular expression that searches for any valid C++ symbol, something similar to "[A-Za-z_][A-Za-z0-9_\.]", along with the literal expression "->Draw(". This will give you the portion that must be rewritten, since the code following this part is already formatted as valid C++ parameters.
The second case is for complex expressions that return an overloaded object or pointer. This requires a bit more effort, but a short parsing routine to walk backward through just a complex expression can be written surprisingly easily, since you don't have to support blocks (blocks in C++ cannot return objects, since lambda definitions do not call the lambda themselves, and actual nested code blocks {...} can't return anything directly inline that would apply here). Note that if the expression doesn't end in ) then it has to be a valid symbol in this context, so if you find a ) just match nested ) with ( and extract the symbol preceding the nested SYMBOL(...(...)...)->Draw() pattern. This may be possible with regular expressions, but should be fairly easy in normal code as well.
As soon as you have the symbol or expression, the replacement is trivial, going from
SYMBOL->Draw(...
to
YourFunction(SYMBOL, ...
without having to deal with the additional parameters to Draw().
As an added benefit, chained function calls are parsed for free with this model, since you can recursively iterate over the code such as
A->Draw(B...)->Draw(C...)
The first iteration identifies the first A->Draw( and rewrites the whole statement as
YourFunction(A, B...)->Draw(C...)
which then identifies the second ->Draw with an expression "YourFunction(A, ...)->" preceding it, and rewrites it as
YourFunction(YourFunction(A, B...), C...)
where B... and C... are well-formed C++ parameters, including nested calls.
Without knowing the C++ version that your interpreter supports, or the kind of code you will be rewriting, I really can't provide any sample code that is likely to be worthwhile.
One way is to load user code as a DLL, (something like plugins,)
this way, you don't need to compile your actual application, just the user code will be compiled, and you application will load it dynamically.
A little something that could be borrowed from IDEs. So the idea would be to highlight function arguments (and maybe scoped variable names) inside function bodies. This is the default behaviour for some C:
Well, if I were to place the cursor inside func I would like to see the arguments foo and bar highlighted to follow the algorithm logic better. Notice that the similarly named foo in func2 wouldn't get highlit. This luxury could be omitted though...
Using locally scoped variables, I would also like have locally initialized variables highlit:
Finally to redemonstrate the luxury:
Not so trivial to write this. I used the C to give a general idea. Really I could use this for Scheme/Clojure programming better:
This should recognize let, loop, for, doseq bindings for instance.
My vimscript-fu isn't that strong; I suspect we would need to
Parse (non-regexply?) the arguments from the function definition under the cursor. This would be language specific of course. My priority would be Clojure.
define a syntax region to cover the given function/scope only
give the required syntax matches
As a function this could be mapped to a key (if very resource intensive) or CursorMoved if not so slow.
Okay, now. Has anyone written/found something like this? Do the vimscript gurus have an idea on how to actually start writing such a script?
Sorry about slight offtopicness and bad formatting. Feel free to edit/format. Or vote to close.
This is much harder than it sounds, and borderline-impossible with the vimscript API as it stands, because you don't just need to parse the file; if you want it to work well, you need to parse the file incrementally. That's why regular syntax files are limited to what you can do with regexes - when you change a few characters, vim can figure out what's changed in the syntax highlighting, without redoing the whole file.
The vim syntax highlighter is limited to dealing with regexes, but if you're hellbent on doing this, you can roll your own parser in vimscript, and have it generate a buffer-local syntax that refers to tokens in the file by line and column, using the \%l and \%c atoms in a regex. This would have to be rerun after every change. Unfortunately there's no autocmd for "file changed", but there is the CursorHold autocmd, which runs when you've been idle for a configurable duration.
One possible solution can be found here. Not the best way because it highlights every occurrence in the whole file and you have to give the command every time (probably the second one can be avoided, don't know about the first). Give it a look though.
How to create (and is this possible) regexp parsing pascal-like function declaration with body ?
I've created some regexp
function\s+(\w+)(\(((((var\s*)?(\w+)(\s*\,+\s*)?)+?\s*\:\s*(\w+)\s*\;?\s*?)\s*)+\))?\s*\:\s*(\w+)
which can pool only functions prototypes (it works only if there is no comments, so i clear comments before parsing ) and i have no idea how to change it to make it pool functions with bodies. The problem is there are can be many of "begin - end" blocks, so it is hard to find functions ending
Sorry, but you are using the wrong tool. Programming languages have a context-free structure that regular expressions simply cannot recognize reliably. Properly nested parentheses like { () [] } { } are an example for such a context-free structure for which you cannot find a regular expression that checks the proper nesting.
To solve the problem, you could use regular expression to break down program code into a stream of tokens and then use a (manually coded) top-down parser to check the structure of this token stream. To learn about this, consult any book about compiler design. Scanning (breaking into tokens) and parsing (checking structure) are always the first chapters. The Wikipedia entry for a top-down parser provides an example.
I'm writing a C/C++/... build system (I understand this is madness ;)), and I'm having trouble designing my parser.
My "recipes" look like this:
global
{
SOURCE_DIRS src
HEADER_DIRS include
SOURCES bitwise.c \
framing.c
HEADERS \
ogg/os_types.h \
ogg/ogg.h
}
lib static ogg_static
{
NAME ogg
}
lib shared ogg_shared
{
NAME ogg
}
(This being based on the super simple libogg source tree)
# are comments, \ are "newline escapes", meaning the line continues on the next line (see QMake syntac). {} are scopes, like in C++, and global are settings that apply to every "target". This is all background, and not that relevant... I really don't know how to work with my scopes. I will need to be able to have multiple scopes, and also a form of conditional processing, in the lines of:
win32:DEFINES NO_CRT_SECURE_DEPRECATE
The parsing function will need to know on what level of scope it's at, and call itself whenever the scope is increased. There is also the problem with the location of the braces ( global { or global{ or as in the example).
How could I go about this, using Standard C++ and STL? I understand this is a whole lot of work, and that's exactly why I need a good starting point. Thanks!
What I have already is the whole ifstream and internal string/stringstream storage, so I can read word per word.
I would suggest (and this is more or less right out of the compiler textbooks) that you approach the problem in phases. This breaks things down so that the problem is much more manageable in each phase.
Focus first on the lexer phase. Your lexing phase should take the raw text and give you a sequence of tokens, such as words and special characters. The lexer phase can take care of line continuations, and handle whitespace or comments as appropriate. By handling whitespace, the lexer can simplify your parser's task: you can write the lexer so that global{, global {, and even
global
{
will all yield two tokens: one representing global and one representing {.
Also note that the lexer can tack line and column numbers onto the tokens for use later if you hit errors.
Once you've got a nice stream of tokens flowing, work on your parsing phase. The parser should take that sequence of tokens and build an abstract syntax tree, which models the syntactic structures of your document. At this point, you shouldn't be worrying about ifstream and operator>>, since the lexer should have done all that reading for you.
You've indicated an interest in calling the parsing function recursively once you see a scope. That's certainly one way to go. As you'll see, the design decision you'll have to repeatedly make is whether you literally want to call the same parse function recursively
(allowing for constructions like global { global { ... } } which you may want to disallow syntactically), or whether you want to define a slightly (or even significantly) different set of syntax rules that apply inside a scope.
Once you find yourself having to vary the rules: the key is to reuse, by refactoring into functions, as much stuff as you can reuse between the different variants of syntax. If you keep heading in this direction – using separate functions that represent the different chunks of syntax you want to deal with and having them call each other (possibly recursively) where needed – you'll ultimately end up with what we call a recursive descent parser. The Wikipedia entry has got a good simple example of one; see http://en.wikipedia.org/wiki/Recursive_descent_parser .
If you find yourself really wanting to delve deeper into the theory and practice of lexers and parsers, I do recommend you get a good solid compiler textbook to help you out. The Stack Overflow topic mentioned in the comments above will get you started: Learning to write a compiler
boost::spirit is a good recursive descent parser generator that uses C++ templates as a language extension to describe parser and lexer. It works well for native C++ compilers, but won't compile under Managed C++.
Codeproject has a tutorial article that may help.
ANTLR (use ANTLRWorks), after that you can look for FLEX/BISON and others like lemon. There are many tools out there but ANTLR and flex/bison should be enough. I personally like ANTLRWorks too much to recommend something else.
LATER: With ANTLR you can generate parser/lexer code for a variety of languages.
Unless the point of the project is specifically learning how to write a lexer and shift-reduce parser, I'd recommending using Flex and Bison, which will handle much of the parsing grunt-work for you. Writing the grammar and semantic analysis will still be a whole lot of work, don't worry ;)