To handle large compile times and reuse of grammars I've composed my grammar into several sub-grammars which are called in sequence. One of them (call it: SETUP grammar) offers some configuration of the parser (via symbols parser), so later sub grammars logically depend on that one (again via different symbols parsers). So, after SETUP is parsed, the symbols parsers of the following sub grammars need to be altered.
My question is, how to approach this efficiently while preserving loose coupling between the sub grammars?
Currently I see only two possibilities:
The on_success handler of the SETUP grammar, which could do the work, but this would introduce quite some coupling.
After the SETUP, parse everything into a string, build up a new parser (from the altered symbols) and parse that string in a second step. This would leave quite some overhead.
What I would like to have is a on_before_parse handler, which could be implemented by any grammar which needs to do some work before each parsing. From my point of view, this would introduce less coupling and some setup of the parser could come handy in other situations, too. Is something like this possible?
Update:
Sorry for being sketchy, that wasn't my intention.
The task is to parse an input I with some keywords like #task1 and #task2. But there will be cases where these keywords need to be different, say $$task1 and $$task2.
So the parsed file will start with
setup {
#task1=$$task1
#task2=$$task2
}
realwork {
...
}
Some code sketches: Given is a main parser, consisting of several (at least two) parsers.
template<typename Iterator>
struct MainParser: qi::grammar<Iterator, Skipper<Iterator>> {
MainParser() : MainParser::base_type(start) {
start = setup >> realwork;
}
Setup<Iterator> setup;
RealWork<Iterator> realwork;
qi::rule<Iterator, Skipper<Iterator> > start;
}
Setup and RealWork are themselves parsers (my sub parsers from above). During the setup part, some keywords of the grammar may be altered, so the setup part has a qi::symbols<char, keywords> rule. In the beginning these symbols will contain #task1 and #task2. After parsing the first part of the file, they contain $$task1 and $$task2.
Since the keywords have changed and since RealWork needs to parse I, it needs to know about the new keywords. So I have to transfer the symbols from Setup to RealWork during the paring of the file.
The two approaches I see are:
Make the Setup aware of RealWork and transfer the symbols from Setup to RealWork in the qi::on_success handler of Setup. (bad, coupling)
Switch to two parsing steps. start of MainParser will look like
start = setup >> unparsed_rest
and there will be a second parser afer MainParser. Schematically:
SymbolTable Table;
string Unparsed_Rest;
MainParser.parse(Input, (Unparsed_Rest, Table));
RealWordParser.setupFromAlteredSymbolTable(Table);
RealWorkParser.parse(Unparsed_Rest);
Overhead of several parsing steps.
So, up to now, attributes are not into play. Just changing the parser at parse time to handle several kinds of input files.
My hope is a handler qi::on_before_parse like qi::on_success. From the idea this handler would be triggered each time the parser starts parsing an input. Theoretically just an interception at the beginning of parsing, like we have the interceptions on_success and on_error.
Sadly, you showed no code, and your description is a bit... sketchy. So here's a fairly generic answer that addresses some of the points I was able to distill from your question:
Separation of concerns
It sounds very much like you need to separate AST building from transformation/processing steps.
Parser composition
Of course you can compose grammars. Simply compose grammars as you would rules and hide the implementation of these grammars in any traditional way you would (pImpl idiom, const static internal rules, whatever fits the bill).
However, the composition usually doesn't require an 'event' driven element: if you feel the need to parse in two phases, it sounds to me you're just struggling to keep the overview, but recursive descent or PEG grammars are naturally well-suited to describe grammars like that in one swoop (or one pass, if you will).
However, if you find that
(a) your grammar gets complicated
(b) or you want to be able to selectively plugin subgrammars depending on runtime features
You could consider
The Nabialek trick (I've shown/mentioned this on several occasions in my [tag:boost-spirit] answers on this site
You could build rules dynamically (this is not readily recommended because you'll run in deadly traps having to do with copying Proto expression trees which leads to dangling references). I have also shown some answers doing this on occasion:
Generating Spirit parser expressions from a variadic list of alternative parser expressions
C++ Boost qi recursive rule construction
Boost.Spirit.Qi: dynamically create "difference" parser at parse time
REPEAT: don't try this unless you know how to detect UB and fix things with Proto
Hope these things help you on track. If not, I suggest you come back with a concrete question. I'm much more at home with code than 'ideas' because ideas often mean something else to you than to me.
Related
What is the proper way of handling C++ code blocks in Xtext/ANTLR?
We are writing an Xtext-based eclipse plugin for a DSL that supports adding C++ function-level code within well-defined scopes (mostly serial { /* ... */ } blocks) such as this:
module m {
chare c {
entry void foo() {
serial {
// C++ code block
}
}
}
}
See here for a more comprehensive example. This is then handed over to an external tool to handle further compilation/linking steps, so we don't generate any code from eclipse.
The issue here is how to handle these C++ code blocks, especially given that they may contain braces of their own. This is very similar to How to include Java Code Block in Xtext DSL? but for now we are content with just ignoring that block (i.e. not having content assist or syntax highlighting is not ideal but acceptable.)
In our bison/flex-based tool this is done by sharing a variable between the parser and lexer that toggles a "C++ parsing mode" within certain grammar rules that makes the lexer return a CPROGRAM token for everything except the relevant delimiters (e.g. braces.) The natural translation seems to have a custom ANTLR lexer that uses semantic predicates to the same effect, e.g.
RULE_NON_BRACES: {in_braces}? ~('{' | '}')+;
as the first lexing rule, but I cannot find how to access that global variable from the Xtext grammar since there doesn't seem to be a concept of "rule action" as in bison. There are other non-"serial" grammar contexts where C++ code is expected, so there needs to be some coordination between the parser and lexer.
Your question seems more focused on how the DSL lexer avoids getting lost in C++ code. The basic answer is you need to match parentheses (e.g, ensure that they nest properly).
I don't know how you define an Xtext/ANTLR lexical rule to do that; I presume there is an ugly way to drop down into procedural code and start reading characters one-by-one. This may have some complications; your paren matching logic may have to worry about various types of quoting in the C++ code. For instance,
{ // this } isn't a match
and
{ char x[]="} this isnt a match { either" }
Other C++ string quotes may make this even more difficult to see. What will you do about a C++ macro used like this?
{
#define rcb }
{ rcb
}
You'll will probably have to make some special rules about how } is processed in the embedded C++ code, and your character-by-character scanning will have to know this rule.
Rather than make this complicated, I think what you should do is pick a really unlikely sequence of characters in C++ as your termination, e.g.,
][[
which I believe cannot occur in C++ except in a string or comment, or
}}}
and simply use that. No need to match parens at all. In almost all cases, the C++ to drop in won't have to be touched; in the rare, rare case where it happens to contain that sequence an trivial edit (insert a space or linebreak) fixes it. Now your lexer rule is simple and can be expressed (I think) using your standard lexer.
If you go this way, I'd suggest you chose a corresponding opening sequence to introduce the C++ code, just to remind the reader that a funny sequence is required, e.g.,
serial {{{ <C++code> }}}
or
serial ]][ <C++code> ][[
With this convention, even my ugly macro example is easy:
serial {{{
{
#define rcb }
{ rcb
}
}}}
PS: This funny notational trick is called a "domain (notation) escape". This problem occurs in every system (yes not that many in the wild, but I have one :) that allows one to mix multiple notations. The sequence varies across language/system according to taste.
If you really cannot change the syntax and need to rely on matching curly braces, than you need to reimplement your flex-based solution in Java (e.g. use jflex) and make Xtext use that lexer.
I have covered that briefly in this blog post. It also contains a pointer to example code where I have used a jflex based lexer in Xtext.
I am currently writing a program that sits on top of a C++ interpreter. The user inputs C++ commands at runtime, which are then passed into the interpreter. For certain patterns, I want to replace the command given with a modified form, so that I can provide additional functionality.
I want to replace anything of the form
A->Draw(B1, B2)
with
MyFunc(A, B1, B2).
My first thought was regular expressions, but that would be rather error-prone, as any of A, B1, or B2 could be arbitrary C++ expressions. As these expressions could themselves contain quoted strings or parentheses, it would be quite difficult to match all cases with a regular expression. In addition, there may be multiple, nested forms of this expression
My next thought was to call clang as a subprocess, use "-dump-ast" to get the abstract syntax tree, modify that, then rebuild it into a command to be passed to the C++ interpreter. However, this would require keeping track of any environment changes, such as include files and forward declarations, in order to give clang enough information to parse the expression. As the interpreter does not expose this information, this seems infeasible as well.
The third thought was to use the C++ interpreter's own internal parsing to convert to an abstract syntax tree, then build from there. However, this interpreter does not expose the ast in any way that I was able to find.
Are there any suggestions as to how to proceed, either along one of the stated routes, or along a different route entirely?
What you want is a Program Transformation System.
These are tools that generally let you express changes to source code, written in source level patterns that essentially say:
if you see *this*, replace it by *that*
but operating on Abstract Syntax Trees so the matching and replacement process is
far more trustworthy than what you get with string hacking.
Such tools have to have parsers for the source language of interest.
The source language being C++ makes this fairly difficult.
Clang sort of qualifies; after all it can parse C++. OP objects
it cannot do so without all the environment context. To the extent
that OP is typing (well-formed) program fragments (statements, etc,.)
into the interpreter, Clang may [I don't have much experience with it
myself] have trouble getting focused on what the fragment is (statement? expression? declaration? ...). Finally, Clang isn't really a PTS; its tree modification procedures are not source-to-source transforms. That matters for convenience but might not stop OP from using it; surface syntax rewrite rule are convenient but you can always substitute procedural tree hacking with more effort. When there are more than a few rules, this starts to matter a lot.
GCC with Melt sort of qualifies in the same way that Clang does.
I'm under the impression that Melt makes GCC at best a bit less
intolerable for this kind of work. YMMV.
Our DMS Software Reengineering Toolkit with its full C++14 [EDIT July 2018: C++17] front end absolutely qualifies. DMS has been used to carry out massive transformations
on large scale C++ code bases.
DMS can parse arbitrary (well-formed) fragments of C++ without being told in advance what the syntax category is, and return an AST of the proper grammar nonterminal type, using its pattern-parsing machinery. [You may end up with multiple parses, e.g. ambiguities, that you'll have decide how to resolve, see Why can't C++ be parsed with a LR(1) parser? for more discussion] It can do this without resorting to "the environment" if you are willing to live without macro expansion while parsing, and insist the preprocessor directives (they get parsed too) are nicely structured with respect to the code fragment (#if foo{#endif not allowed) but that's unlikely a real problem for interactively entered code fragments.
DMS then offers a complete procedural AST library for manipulating the parsed trees (search, inspect, modify, build, replace) and can then regenerate surface source code from the modified tree, giving OP text
to feed to the interpreter.
Where it shines in this case is OP can likely write most of his modifications directly as source-to-source syntax rules. For his
example, he can provide DMS with a rewrite rule (untested but pretty close to right):
rule replace_Draw(A:primary,B1:expression,B2:expression):
primary->primary
"\A->Draw(\B1, \B2)" -- pattern
rewrites to
"MyFunc(\A, \B1, \B2)"; -- replacement
and DMS will take any parsed AST containing the left hand side "...Draw..." pattern and replace that subtree with the right hand side, after substituting the matches for A, B1 and B2. The quote marks are metaquotes and are used to distinguish C++ text from rule-syntax text; the backslash is a metaescape used inside metaquotes to name metavariables. For more details of what you can say in the rule syntax, see DMS Rewrite Rules.
If OP provides a set of such rules, DMS can be asked to apply the entire set.
So I think this would work just fine for OP. It is a rather heavyweight mechanism to "add" to the package he wants to provide to a 3rd party; DMS and its C++ front end are hardly "small" programs. But then modern machines have lots of resources so I think its a question of how badly does OP need to do this.
Try modify the headers to supress the method, then compiling you'll find the errors and will be able to replace all core.
As far as you have a C++ interpreter (as CERN's Root) I guess you must use the compiler to intercept all the Draw, an easy and clean way to do that is declare in the headers the Draw method as private, using some defines
class ItemWithDrawMehtod
{
....
public:
#ifdef CATCHTHEMETHOD
private:
#endif
void Draw(A,B);
#ifdef CATCHTHEMETHOD
public:
#endif
....
};
Then compile as:
gcc -DCATCHTHEMETHOD=1 yourfilein.cpp
In case, user want to input complex algorithms to the application, what I suggest is to integrate a scripting language to the app. So that the user can write code [function/algorithm in defined way] so the app can execute it in the interpreter and get the final results. Ex: Python, Perl, JS, etc.
Since you need C++ in the interpreter http://chaiscript.com/ would be a suggestion.
What happens when someone gets ahold of the Draw member function (auto draw = &A::Draw;) and then starts using draw? Presumably you'd want the same improved Draw-functionality to be called in this case too. Thus I think we can conclude that what you really want is to replace the Draw member function with a function of your own.
Since it seems you are not in a position to modify the class containing Draw directly, a solution could be to derive your own class from A and override Draw in there. Then your problem reduces to having your users use your new improved class.
You may again consider the problem of automatically translating uses of class A to your new derived class, but this still seems pretty difficult without the help of a full C++ implementation. Perhaps there is a way to hide the old definition of A and present your replacement under that name instead, via clever use of header files, but I cannot determine whether that's the case from what you've told us.
Another possibility might be to use some dynamic linker hackery using LD_PRELOAD to replace the function Draw that gets called at runtime.
There may be a way to accomplish this mostly with regular expressions.
Since anything that appears after Draw( is already formatted correctly as parameters, you don't need to fully parse them for the purpose you have outlined.
Fundamentally, the part that matters is the "SYMBOL->Draw("
SYMBOL could be any expression that resolves to an object that overloads -> or to a pointer of a type that implements Draw(...). If you reduce this to two cases, you can short-cut the parsing.
For the first case, a simple regular expression that searches for any valid C++ symbol, something similar to "[A-Za-z_][A-Za-z0-9_\.]", along with the literal expression "->Draw(". This will give you the portion that must be rewritten, since the code following this part is already formatted as valid C++ parameters.
The second case is for complex expressions that return an overloaded object or pointer. This requires a bit more effort, but a short parsing routine to walk backward through just a complex expression can be written surprisingly easily, since you don't have to support blocks (blocks in C++ cannot return objects, since lambda definitions do not call the lambda themselves, and actual nested code blocks {...} can't return anything directly inline that would apply here). Note that if the expression doesn't end in ) then it has to be a valid symbol in this context, so if you find a ) just match nested ) with ( and extract the symbol preceding the nested SYMBOL(...(...)...)->Draw() pattern. This may be possible with regular expressions, but should be fairly easy in normal code as well.
As soon as you have the symbol or expression, the replacement is trivial, going from
SYMBOL->Draw(...
to
YourFunction(SYMBOL, ...
without having to deal with the additional parameters to Draw().
As an added benefit, chained function calls are parsed for free with this model, since you can recursively iterate over the code such as
A->Draw(B...)->Draw(C...)
The first iteration identifies the first A->Draw( and rewrites the whole statement as
YourFunction(A, B...)->Draw(C...)
which then identifies the second ->Draw with an expression "YourFunction(A, ...)->" preceding it, and rewrites it as
YourFunction(YourFunction(A, B...), C...)
where B... and C... are well-formed C++ parameters, including nested calls.
Without knowing the C++ version that your interpreter supports, or the kind of code you will be rewriting, I really can't provide any sample code that is likely to be worthwhile.
One way is to load user code as a DLL, (something like plugins,)
this way, you don't need to compile your actual application, just the user code will be compiled, and you application will load it dynamically.
I need to parse a C++ class file (.h) and extract the following informations:
Function names
Return types
List of parameter types of each function
Assume that there is a special tag using which I can recognize if I need to parse a function or not.
For eg.
#include <someHeader>
class Test
{
public:
Test();
void fun1();
// *Expose* //
void fun2();
};
So I need to parse only fun2().
I read the basic grammar here, but found it too complex to comprehend.
Q1. I can't make out how complex this task is. Can someone provide a simpler grammar for a function declaration to perform this parsing?
Q2. Is my approach right or should I consider using some library rather than reinventing?
Edit: Just to clarify, I don't have problem parsing, problem is more of understanding the grammar I need to parse.
A C++ header may include arbitrary C++ code. Hence, parsing the header might be as hard as parsing all kinds of C++ code.
Your task becomes easier, if you can make certain assumptions about your header file. For instance, if you always have an EXPOSE-tag in front of your function and the functions are always on a single line, you could first grep for those lines:
grep -A1 EXPOSE <files>
And then you could apply a regular expression to filter out the information you need.
Nevertheless, I'd recommend using existing tools. This seems to be a tutorial on how to do it with clang and Python.
GCC XML is an open source tool that emits the AST (Abstract Syntax Tree). See this other answer where I posted about the usage I made of it.
You should consider to use only if you are proficient (or akin to learn) with an XML analyzer for inspecting the AST. It's a fairly complex structure...
You will need anyway to 'grep' for the comments identifying your required snippets, as comments are lost in output XML.
IF you are doing this just for documentation doxygen could be a good bet.
Either way it may give you some pointers as to how to do this.
I'm writing a C/C++/... build system (I understand this is madness ;)), and I'm having trouble designing my parser.
My "recipes" look like this:
global
{
SOURCE_DIRS src
HEADER_DIRS include
SOURCES bitwise.c \
framing.c
HEADERS \
ogg/os_types.h \
ogg/ogg.h
}
lib static ogg_static
{
NAME ogg
}
lib shared ogg_shared
{
NAME ogg
}
(This being based on the super simple libogg source tree)
# are comments, \ are "newline escapes", meaning the line continues on the next line (see QMake syntac). {} are scopes, like in C++, and global are settings that apply to every "target". This is all background, and not that relevant... I really don't know how to work with my scopes. I will need to be able to have multiple scopes, and also a form of conditional processing, in the lines of:
win32:DEFINES NO_CRT_SECURE_DEPRECATE
The parsing function will need to know on what level of scope it's at, and call itself whenever the scope is increased. There is also the problem with the location of the braces ( global { or global{ or as in the example).
How could I go about this, using Standard C++ and STL? I understand this is a whole lot of work, and that's exactly why I need a good starting point. Thanks!
What I have already is the whole ifstream and internal string/stringstream storage, so I can read word per word.
I would suggest (and this is more or less right out of the compiler textbooks) that you approach the problem in phases. This breaks things down so that the problem is much more manageable in each phase.
Focus first on the lexer phase. Your lexing phase should take the raw text and give you a sequence of tokens, such as words and special characters. The lexer phase can take care of line continuations, and handle whitespace or comments as appropriate. By handling whitespace, the lexer can simplify your parser's task: you can write the lexer so that global{, global {, and even
global
{
will all yield two tokens: one representing global and one representing {.
Also note that the lexer can tack line and column numbers onto the tokens for use later if you hit errors.
Once you've got a nice stream of tokens flowing, work on your parsing phase. The parser should take that sequence of tokens and build an abstract syntax tree, which models the syntactic structures of your document. At this point, you shouldn't be worrying about ifstream and operator>>, since the lexer should have done all that reading for you.
You've indicated an interest in calling the parsing function recursively once you see a scope. That's certainly one way to go. As you'll see, the design decision you'll have to repeatedly make is whether you literally want to call the same parse function recursively
(allowing for constructions like global { global { ... } } which you may want to disallow syntactically), or whether you want to define a slightly (or even significantly) different set of syntax rules that apply inside a scope.
Once you find yourself having to vary the rules: the key is to reuse, by refactoring into functions, as much stuff as you can reuse between the different variants of syntax. If you keep heading in this direction – using separate functions that represent the different chunks of syntax you want to deal with and having them call each other (possibly recursively) where needed – you'll ultimately end up with what we call a recursive descent parser. The Wikipedia entry has got a good simple example of one; see http://en.wikipedia.org/wiki/Recursive_descent_parser .
If you find yourself really wanting to delve deeper into the theory and practice of lexers and parsers, I do recommend you get a good solid compiler textbook to help you out. The Stack Overflow topic mentioned in the comments above will get you started: Learning to write a compiler
boost::spirit is a good recursive descent parser generator that uses C++ templates as a language extension to describe parser and lexer. It works well for native C++ compilers, but won't compile under Managed C++.
Codeproject has a tutorial article that may help.
ANTLR (use ANTLRWorks), after that you can look for FLEX/BISON and others like lemon. There are many tools out there but ANTLR and flex/bison should be enough. I personally like ANTLRWorks too much to recommend something else.
LATER: With ANTLR you can generate parser/lexer code for a variety of languages.
Unless the point of the project is specifically learning how to write a lexer and shift-reduce parser, I'd recommending using Flex and Bison, which will handle much of the parsing grunt-work for you. Writing the grammar and semantic analysis will still be a whole lot of work, don't worry ;)
It should turn this
int Yada (int yada)
{
return yada;
}
into this
int Yada (int yada)
{
SOME_HEIDEGGER_QUOTE;
return yada;
}
but for all (or at least a big bunch of) syntactically legal C/C++ - function and method constructs.
Maybe you've heard of some Perl library that will allow me to perform these kinds of operations in a view lines of code.
My goal is to add a tracer to an old, but big C++ project in order to be able to debug it without a debugger.
Try Aspect C++ (www.aspectc.org). You can define an Aspect that will pick up every method execution.
In fact, the quickstart has pretty much exactly what you are after defined as an example:
http://www.aspectc.org/fileadmin/documentation/ac-quickref.pdf
If you build using GCC and the -pg flag, GCC will automatically issue a call to the mcount() function at the start of every function. In this function you can then inspect the return address to figure out where you were called from. This approach is used by the linux kernel function tracer (CONFIG_FUNCTION_TRACER). Note that this function should be written in assembler, and be careful to preserve all registers!
Also, note that this should be passed only in the build phase, not link, or GCC will add in the profiling libraries that normally implement mcount.
I would suggest using the gcc flag "-finstrument-functions". Basically, it automatically calls a specific function ("__cyg_profile_func_enter") upon entry to each function, and another function is called ("__cyg_profile_func_exit") upon exit of the function. Each function is passed a pointer to the function being entered/exited, and the function which called that one.
You can turn instrumenting off on a per-function or per-file basis... see the docs for details.
The feature goes back at least as far as version 3.0.4 (from February 2002).
This is intended to support profiling, but it does not appear to have side effects like -pg does (which compiles code suitable for profiling).
This could work quite well for your problem (tracing execution of a large program), but, unfortunately, it isn't as general purpose as it would have been if you could specify a macro. On the plus side, you don't need to worry about remembering to add your new code into the beginning of all new functions that are written.
There is no such tool that I am aware of. In order to recognise the correct insertion point, the tool would have to include a complete C++ parser - regular expressions are not enough to accomplish this.
But as there are a number of FOSS C++ parsers out there, such a tool could certainly be written - a sort of intelligent sed for C++ code. The biggest problem would probably be designing the specification language for the insert/update/delete operation - regexes are obviously not the answer, though they should certainly be included in the language somehow.
People are always asking here for ideas for projects - how about this for one?
I use this regex,
"(?<=[\\s:~])(\\w+)\\s*\\([\\w\\s,<>\\[\\].=&':/*]*?\\)\\s*(const)?\\s*{"
to locate the functions and add extra lines of code.
With that regex I also get the function name (group 1) and the arguments (group 2).
Note: you must filter out names like, "while", "do", "for", "switch".
This can be easily done with a program transformation system.
The DMS Software Reengineering Toolkit is a general purpose program transformation system, and can be used with many languages (C#, COBOL, Java, EcmaScript, Fortran, ..) as well as specifically with C++.
DMS parses source code (using full langauge front end, in this case for C++),
builds Abstract Syntax Trees, and allows you to apply source-to-source patterns to transform your code from one C# program into another with whatever properties you wish. THe transformation rule to accomplish exactly the task you specified would be:
domain CSharp.
insert_trace():function->function
"\visibility \returntype \fnname(int \parametername)
{ \body } "
->
"\visibility \returntype \fnname(int \parametername)
{ Heidigger(\CppString\(\methodname\),
\CppString\(\parametername\),
\parametername);
\body } "
The quote marks (") are not C++ quote marks; rather, they are "domain quotes", and indicate that the content inside the quote marks is C++ syntax (because we said, "domain CSharp"). The \foo notations are meta syntax.
This rule matches the AST representing the function, and rewrites that AST into the traced form. The resulting AST is then prettyprinted back into source form, which you can compile. You probably need other rules to handle other combinations of arguments; in fact, you'd probably generalize the argument processing to produce (where practical) a string value for each scalar argument.
It should be clear you can do a lot more than just logging with this, and a lot more than just aspect-oriented programming, since you can express arbitrary transformations and not just before-after actions.