Build parser from grammar at runtime

Build parser from grammar at runtime - c++

Many (most) regular expression libraries for C++ allow for creating the expression from a string during runtime. Is anyone aware of any C++ parser generators that allow for feeding a grammar (preferably BNF) represented as a string into a generator at runtime? All the implementations I've found either require an explicit code generator to be run or require the grammar to be expressed via clever template meta-programming.

It should be pretty easy to build a recursive descent, backtracking parser that accepts a grammar as input. You can reduce all your rules to the following form (or act as if you have):
A = B C D ;
Parsing such a rule by recursive descent is easy: call a routine that corresponds to finding a B, then one that finds a C, then one that finds a D. Given you are doing a general parser, you can always call a "parse_next_sentential_form(x)" function, and pass the name of the desired form (terminal or nonterminal token) as x (e.g., "B", "C", "D").
In processing such a rule, the parser wants to produce an A, by finding a B, then C, then D. To find B (or C or D), you'd like to have an indexed set of rules in which all the left-hand sides are the same, so one can easily enumerate the B-producing rules, and recurse to process their content. If your parser gets a failure, it simply backtracks.
This won't be a lightning fast parser, but shouldn't be terrible if well implemented.
One could also use an Earley parser, which parses by creating states of partially-processed rules.
If you wanted it to be fast, I suppose you could simply take the guts of Bison and make it into a library. Then if you have grammar text or grammar rules (different entry points into Bison), you could start it and have it produce its tables in memory (which it must do in some form). Don't spit them out; simply build an LR parsing engine that uses them. Voila, on-the-fly efficient parser generation.
You have to worry about ambiguities and the LALR(1)ness of your grammar if you do this; the previous two solutions work with any context free grammar.

I am not aware of an existing library for this. However if performance and robustness are not critical, then you can spin off bison or any other tool that generates C code (via popen(3) or similar), spin off gcc on the generated code, link it into shared library and load the library via dlopen(3)/dlsym(3). On Windows -- DLL and LoadLibrary() instead.

The easiest option is to embed some scripting language or even a full-blown VM (e.g., Mono), and run your generated parsers on top of it. Lua has quite a powerful JIT compiler, decent metaprogramming capabilities and several Packrat implementations ready to use, so probably it would be the least effort way.

I just came across this http://cocom.sourceforge.net/ammunition++-13.html
The last one is an Earley Parser and it appears to take the grammar as a string.
One of the functions is:
Public function `parse_grammar'
`int parse_grammar (int strict_p, const char *description)'
is another function which tunes the parser to given grammar.
The grammar is given by string `description'.
The description is similiar YACC one.
The actual code is at http://sourceforge.net/projects/cocom/
EDIT
A newer version is at https://github.com/vnmakarov/yaep

boost::spirit is a C++ parsing framework that can be used to construct parsers dynamically at runtime.

Related

What does a parser for C++ do until it can differentiate between comparisons and template instantiations?

After reading this question I am left wondering what happens (regarding the AST) when major C++ compilers parse code like this:
struct foo
{
void method() { a<b>c; }
// a b c may be declared here
};
Do they handle it like a GLR parser would or in a different way? What other ways are there to parse this and similar cases?
For example, I think it's possible to postpone parsing the body of the method until the whole struct has been parsed, but is this really possible and practical?

Although it is certainly possible to use GLR techniques to parse C++ (see a number of answers by Ira Baxter), I believe that the approach commonly used in commonly-used compilers such as gcc and clang is precisely that of deferring the parse of function bodies until the class definition is complete. (Since C++ source code passes through a preprocessor before being parsed, the parser works on streams of tokens and that is what must be saved in order to reparse the function body. I don't believe that it is feasible to reparse the source code.)
It's easy to know when a function definition is complete, since braces ({}) must balance even if it is not known how angle brackets nest.
C++ is not the only language in which it is useful to defer parsing until declarations have been handled. For example, a language which allows users to define new operators with different precedences would require all expressions to be (re-)parsed once the names and precedences of operators are known. A more pathological example is COBOL, in which the precedence of OR in a = b OR c depends on whether c is an integer (a is equal to one of b or c) or a boolean (a is equal to b or c is true). Whether designing languages in this manner is a good idea is another question.

The answer will obviously depend on the compiler, but the article How Clang handles the type / variable name ambiguity of C/C++ by Eli Bendersky explains how Clang does it. I will simply note some key points from the article:
Clang has no need for a lexer hack: the information goes in a single direction from lexer to parser
Clang knows when an identifier is a type by using a symbol table
C++ requires declarations to be visible throughout the class, even in code that appears before it
Clang gets around this by doing a full parsing/semantic analysis of the declaration, but leaving the definition for later; in other words, it's lexed but parsed after all the declared types are available

Boost.Spirit adding #include feature into calculator example

Following Boost.Spirit compiler examples I am migrating my Flex/Bison based calculator-like grammar to Spirit based. I want to add a feature #include<another_input.inp>. I have defined the include_statement grammar successfully. Should I follow the way error handling was doing: on_success(include_statement, annotation_function(...)), i.e. for each successful matching of include_statement, get the new input file name and call phrase_parse() again ? or like the Flex/Bison to push/pop the input stack?
Thanks.

Guessing, from the little information that is here, that you meant to ask whether you can reuse the same grammar instance, or it should be better to instantiate a new instance to parse the includes, it depends.
You can do both.
When the grammar is stateless (hint: it usually is if you can use it const) there's no difference. Otherwise, prefer to instantiate a separate instance.
However,
the point is somewhat moot since it appears you already decided to parse the includes after parsing the main document (if I get your comment right)
there's always the danger of global state; Even if the grammar object is const, you could potentially modify external state (e.g. using phx::ref from semantic action) so, this would be an issue, regardless of whether you used separate grammar instances.

what are computational macros and syntax macros

I was reading the paper "An Object oriented preprocessor fit for C++".
"http://www.informatik.uni-bremen.de/st/lehre/Arte-fakt/Seminar/papers/17/An%20Object-Oriented%20preprocessor%20fit%20for%20C++.pdf"
It discusses three different types of macros.
text macros. // pretty much the same as C preprocessor
computational macros // text replaced as a result of computation
syntax macros. // text replaced by the syntax tree representating a linguistically consistent construct.
Can somebody please explain the last two type of macros in an elaborate way.
It says that inline functions and templates are examples of computational macros, how ?

Looking at the original Cheatham's paper from 1966 that the Willink's and Muchnick's paper refer to I'd summarize the different macro types like this:
Text macros do text replacements before scanning and parsing.
Syntactic macros are processed during scanning and parsing. Calling a syntax macro replaces the macro call with another piece of AST.
Computational macros can happen at any point after the AST has been built by the scanner and the parser. The point is that at this point we are no longer processing any text but instead manipulating the nodes of the AST i.e., we are dealing with objects that might already even have semantic information attached to them.
I'm no C++ internals expert but I'd assume that the inlining of function calls and instantiating templates is about manipulating the syntax tree before, while and after it's been annotated with semantic information necessary to compile it properly as both of those seem to assume knowing a lot of stuff (like type info and if something is good to be inlined) that is not yet known during scanning and parsing.

By 2. it sounds like they mean that some computation is done at compile time and the resulting instructions executed at runtime only involve the result. I wouldn't think inline functions particularly represent this, but template meta-programming does exactly this. Also constexpr in C++11.
I think 3. could also be represented by the use of templates. A template does represent a syntax tree, and instantiating it involves taking the generic syntax tree, filling in the parameterized, unknown bits, and using the resulting syntax tree.

Is there a tool to add the "override" identifier to existing C++ code

The task
I am trying to work out how best to add C++0x's override identifier to all existing methods that are already overrides in a large body of C++ code, without doing it manually.
(We have many, many hundreds of thousands of lines of code, and doing it manually would be a complete non-starter.)
Current idea
Our coding standards say that we should add the virtual keyword against all implicitly virtual methods in derived classes, even though strictly unnecessary (to aid comprehension).
So if I were to script the addition myself, I'd write a script that read all our headers, found all functions beginning with virtual, and insert override before the following semi-colon. Then compile it on a compiler that supports override, and fix all the errors in base classes.
But I'd really much rather not use this home-grown way, as:
it's obviously going to be tedious and error-prone.
not everyone has remembered, every time, to add the virtual keyword, so this method would miss out some existing overrides
Is there an existing tool?
So, is there already a tool that parses C++ code, detects existing methods that overrides, and appends override to their declarations?
(I am aware of static analysis tools such as PC-lint that warn about functions that look like they should be overrides. What I'm after is something that would actually munge our code, so that future errors in overrides will be detected at compiler-time, rather than later on in static analysis)
(In case anyone is tempted to point out that C++03 doesn't support 'override'... In practice, I'd be adding a macro, rather than the actual "override" identifier, to use our code on older compilers that don't support this feature. So after the identifier was added, I'd run a separate script to replace it with whatever macro we're going to use...)
Thanks in advance...

There is a tool under development by the LLVM project called "cpp11-migrate" which currently has the following features:
convert loops to range-based for loops
convert null pointer constants (like NULL or 0) to C++11 nullptr
replace the type specifier in variable declarations with the auto type specifier
add the override specifier to applicable member functions
This tool is documented here and should be released as part of clang 3.3.
However, you can download the source and build it yourself today.
Edit
Some more info:
Status of the C++11 Migrator - a blog post, dated 2013-04-15
cpp11-migrate User’s Manual
Edit 2: 2013-09-07
"cpp11-migrate" has been renamed to "clang-modernize". For windows users, it is now included in the new LLVM Snapshot Builds.
Edit 3: 2020-10-07
"clang-modernize" has bee renamed to "Clang-Tidy".

Our DMS Software Reengineering Toolkit with its C++11-capable C++ Front End can do this.
DMS is a general purpose program transformation system for arbitrary programming languages; the C++ front end allows it to process C++. DMS parses, builds ASTs and symbol tables that are accurate (this is hard to do for C++), provides support for querying properties of the AST nodes and trees, allows procedural and source-to-source transformations on the tree. After all changes are made, the modified tree can be regenerated with comments retained.
Your problem requires that you find derived virtual methods and change them. A DMS source-to-source transformation rule to do that would look something like:
source domain Cpp. -- tells DMS the following rules are for C++
rule insert_virtual_keyword (n:identifier, a: arguments, s: statements):
method_declaration -> method_declaration " =
" void \n(\a) { \s } " -> " virtual void \n(\a) { \s }"
if is_implicitly_virtual(n).
Such rules match against the syntax trees, so they can't mismatch to a comment, string, or whatever. The funny quotes are not C++ string quotes; they are meta-quotes to allow the rule language to know that what is inside them has to be treated as target language ("Cpp") syntax. The backslashes are escapes from the target language text, allowing matches to arbitrary structures e.g., \a indicates a need for an "a", which is defined to be the syntactic category "arguments".
You'd need more rules to handle cases where the function returns a non-void result, etc. but you shouldn't need a lot of them.
The fun part is implementing the predicate (returning TRUE or FALSE) controlling application of the transformation: is_implicitly_virtual. This predicate takes (an abstract syntax tree for) the method name n.
This predicate would consult the full C++ symbol table to determine what n really is. We already know it is a method from just its syntactic setting, but we want to know in what class context.
The symbol table provides the linkage between the method and class, and the symbol table information for the class tells us what the class inherits from, and for those classes, which methods they contain and how they are declared, eventually leading to the discovery (or not) that the parent class method is virtual. The code to do this has to be implemented as procedural code going against the C++ symbol table API. However, all the hard work is done; the symbol table is correct and contains references to all the other data needed. (If you don't have this information, you can't possibly decide algorithmically, and any code changes will likely be erroneous).
DMS has been used to carry out massive changes on C++ code in the past using program transformations.(Check the Papers page at the web site for C++ rearchitecting topics).
(I'm not a C++ expert, merely the DMS architect, so if I have minor detail wrong, please forgive.)

I did something like this a few months ago with about 3 MB worth of code and while you say that "doing it manually would be a complete non-starter," I think it is the only way. The reason is that you should be applying the override keyword to the prototypes that are intended to override base class methods. Any tool that adds it will put it on the prototypes that actually override base class methods. The compiler already knows which methods those are so adding the keyword doesn't change anything. (Please note that I am not terribly familiar with the new standard and I am assuming the override keyword is optional. Visual Studio has supported override since at least VS2005.)
I used a search for "virtual" in the header files to find most of them and I still occasionally find another prototype that is missing the override keyword.
I found two bugs by going through that.

Eclipse CDT has a working C++ parser and semantic utilities. The latest version IIRC also has markers for overriding methods.
It wouldn't require much code to write a plug-in which would base on that and rewrite the code to contain the override tags where appropriate.

one option is to
Enable suggest-override compiler warning And then write a script
which can insert override keyword to location pointed by the emitted warnings

Static source code analysis with LLVM

I recently discover the LLVM (low level virtual machine) project, and from what I have heard It can be used to performed static analysis on a source code. I would like to know if it is possible to extract the different function call through function pointer (find the caller function and the callee function) in a program.
I could find the kind of information in the website so it would be really helpful if you could tell me if such an library already exist in LLVM or can you point me to the good direction on how to build it myself (existing source code, reference, tutorial, example...).
EDIT:
With my analysis I actually want to extract caller/callee function call. In the case of a function pointer, I would like to return a set of possible callee. both caller and callee must be define in the source code (this does not include third party function in a library).

I think that Clang (the analyzer that is part of LLVM) is geared towards the detection of bugs, which means that the analyzer tries to compute possible values of some expressions (to reduce false positives) but it sometimes gives up (in this case, emitting no alarm to avoid a deluge of false positives).
If your program is C only, I recommend you take a look at the Value Analysis in Frama-C. It computes supersets of possible values for any l-value at each point of the program, under some hypotheses that are explained at length here. Complexity in the analyzed program only means that the returned supersets are more approximated, but they still contain all the possible run-time values (as long as you remain within the aforementioned hypotheses).
EDIT: if you are interested in possible values of function pointers for the purpose of slicing the analyzed program, you should definitely take a look at the existing dependencies and slicing computations in Frama-C. The website doesn't have any nice example for slicing, here is one from a discussion on the mailing-list

You should take a look at Elsa. It is relatively easy to extend and lets you parse an AST fairly easily. It handles all of the parsing, lexing and AST generation and then lets you traverse the tree using the Visitor pattern.
class CallGraphGenerator : public ASTVisitor
{
//...
virtual bool visitFunction(Function *func);
virtual bool visitExpression(Expression *expr);
}
You can then detect function declarations, and probably detect function pointer usage. Finally you could check the function pointers' declarations and generate a list of the declared functions that could have been called using that pointer.

In our project, we perform static source code analysis by converting LLVM bytecode into C code with help of llc program that is shipped with LLVM. Then we analyze C code with CIL (C Intermediate Language), but for C language a lot of tools is available. The pitfail that the code generated by llc is AWFUL and suffers from a great loss of precision. But still, it's one way to go.
Edit: in fact, I wouldn't recommend anyone to o like this. But still, just for a record...

I think your question is flawed. The title says "Static source code analysis". Yet your underlying reason appears to be the construction of (part of ) a call graph including calls through a function pointer. The essence of function pointers is that you cannot know their values at compile time, i.e. at the point where you do static source code analysis. Consider this bit of code:
void (*pFoo)() = GetFoo();
pFoo();
Static code analysis cannot tell you what GetFoo() returns at runtime, although it might tell you that the result is subsequently used for a function call.
Now, what values could GetFoo() possibly return? You simply can't say this in general (equivalent to solving the halting problem). You will be able to guess some trivial cases. The guessable percentage will of course go up depending on how much effort you are willing to invest.

The DMS Software Reengineering Toolkit provides various types of control, data flow, and global points-to analyzers for large systems of C code, and constructs call graphs using that global points-to analysis (with the appropriate conservative assumptions). More discussion and examples of the analyses can be found at the web site.
DMS has been tested on monolithic systems of C code with 25 million lines. (The call graph for this monster had 250,000 functions in it).
Engineering all this machinery from basic C ASTs and symbol tables is a huge amount of work; been there, done that. You don't want to do this yourself if you have something else to do with your life, like implement other applications.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js