what are computational macros and syntax macros - c++

I was reading the paper "An Object oriented preprocessor fit for C++".
"http://www.informatik.uni-bremen.de/st/lehre/Arte-fakt/Seminar/papers/17/An%20Object-Oriented%20preprocessor%20fit%20for%20C++.pdf"
It discusses three different types of macros.
text macros. // pretty much the same as C preprocessor
computational macros // text replaced as a result of computation
syntax macros. // text replaced by the syntax tree representating a linguistically consistent construct.
Can somebody please explain the last two type of macros in an elaborate way.
It says that inline functions and templates are examples of computational macros, how ?

Looking at the original Cheatham's paper from 1966 that the Willink's and Muchnick's paper refer to I'd summarize the different macro types like this:
Text macros do text replacements before scanning and parsing.
Syntactic macros are processed during scanning and parsing. Calling a syntax macro replaces the macro call with another piece of AST.
Computational macros can happen at any point after the AST has been built by the scanner and the parser. The point is that at this point we are no longer processing any text but instead manipulating the nodes of the AST i.e., we are dealing with objects that might already even have semantic information attached to them.
I'm no C++ internals expert but I'd assume that the inlining of function calls and instantiating templates is about manipulating the syntax tree before, while and after it's been annotated with semantic information necessary to compile it properly as both of those seem to assume knowing a lot of stuff (like type info and if something is good to be inlined) that is not yet known during scanning and parsing.

By 2. it sounds like they mean that some computation is done at compile time and the resulting instructions executed at runtime only involve the result. I wouldn't think inline functions particularly represent this, but template meta-programming does exactly this. Also constexpr in C++11.
I think 3. could also be represented by the use of templates. A template does represent a syntax tree, and instantiating it involves taking the generic syntax tree, filling in the parameterized, unknown bits, and using the resulting syntax tree.

Related

What does a parser for C++ do until it can differentiate between comparisons and template instantiations?

After reading this question I am left wondering what happens (regarding the AST) when major C++ compilers parse code like this:
struct foo
{
void method() { a<b>c; }
// a b c may be declared here
};
Do they handle it like a GLR parser would or in a different way? What other ways are there to parse this and similar cases?
For example, I think it's possible to postpone parsing the body of the method until the whole struct has been parsed, but is this really possible and practical?
Although it is certainly possible to use GLR techniques to parse C++ (see a number of answers by Ira Baxter), I believe that the approach commonly used in commonly-used compilers such as gcc and clang is precisely that of deferring the parse of function bodies until the class definition is complete. (Since C++ source code passes through a preprocessor before being parsed, the parser works on streams of tokens and that is what must be saved in order to reparse the function body. I don't believe that it is feasible to reparse the source code.)
It's easy to know when a function definition is complete, since braces ({}) must balance even if it is not known how angle brackets nest.
C++ is not the only language in which it is useful to defer parsing until declarations have been handled. For example, a language which allows users to define new operators with different precedences would require all expressions to be (re-)parsed once the names and precedences of operators are known. A more pathological example is COBOL, in which the precedence of OR in a = b OR c depends on whether c is an integer (a is equal to one of b or c) or a boolean (a is equal to b or c is true). Whether designing languages in this manner is a good idea is another question.
The answer will obviously depend on the compiler, but the article How Clang handles the type / variable name ambiguity of C/C++ by Eli Bendersky explains how Clang does it. I will simply note some key points from the article:
Clang has no need for a lexer hack: the information goes in a single direction from lexer to parser
Clang knows when an identifier is a type by using a symbol table
C++ requires declarations to be visible throughout the class, even in code that appears before it
Clang gets around this by doing a full parsing/semantic analysis of the declaration, but leaving the definition for later; in other words, it's lexed but parsed after all the declared types are available

Boost.Spirit adding #include feature into calculator example

Following Boost.Spirit compiler examples I am migrating my Flex/Bison based calculator-like grammar to Spirit based. I want to add a feature #include<another_input.inp>. I have defined the include_statement grammar successfully. Should I follow the way error handling was doing: on_success(include_statement, annotation_function(...)), i.e. for each successful matching of include_statement, get the new input file name and call phrase_parse() again ? or like the Flex/Bison to push/pop the input stack?
Thanks.
Guessing, from the little information that is here, that you meant to ask whether you can reuse the same grammar instance, or it should be better to instantiate a new instance to parse the includes, it depends.
You can do both.
When the grammar is stateless (hint: it usually is if you can use it const) there's no difference. Otherwise, prefer to instantiate a separate instance.
However,
the point is somewhat moot since it appears you already decided to parse the includes after parsing the main document (if I get your comment right)
there's always the danger of global state; Even if the grammar object is const, you could potentially modify external state (e.g. using phx::ref from semantic action) so, this would be an issue, regardless of whether you used separate grammar instances.

How to identify which forms are macros and which are functions while looking at a Clojure code?

Lisp/Clojure code have consistency in their syntax and it is a plus point as one doesn't need to understand various different constructs.
But at times It is easier to understand by looking at a piece of code just by the different syntax being used like this is a switch case or this is the pattern matching construct etc without actually reading the text.
I have started out with Clojure couple of months ago and I have realized I can't understand the code without reading the name of the form and then googling whether it is a macro or a function and how it works.
So it turns out that, a piece of Clojure code, irrespective fo the uniformity of the syntax isn't uniform.
It may seem like a function but if at all it is a macro then it might not be evaluating all its arguments.
Is there a naming convention or indentation style that all macros use so it is easier for someone to grasp by the name what is going on ?
The most useful intuition in my opinion comes from understanding the purpose of a given operator / Var. Well-designed macros simply could not be written as functions and still offer the same functionality with the same syntax, for if they could, they would in fact be written as functions (see the "well-designed" part above!).1 So, if you're dealing with a construct which couldn't possibly be a regular function, then you know it isn't; otherwise it likely is.
Additionally, the usual ways of learning about the Vars exported by a library tell you whether you're dealing with a macro or a function up front. That is true of doc ((doc foo) says that foo is a macro near the top of its output if that is indeed the case), source (since it gives you the entire code) and M-. (jump to definition in Emacs with nrepl.el or swank-clojure; M-, jumps back). Documentation may be expected to mention what is a macro and what isn't (except that's not necessarily true of docstrings, since all usual ways of accessing a docstring already tell you whether you're dealing with a macro, as explained above).
If you're skimming a body of code with the intention of forming a rough understanding of what it probably does on the assumption that the various operators perform the functions suggested by their names, then either (1) the names are suggestive enough and you get an idea of what's intended by the code, so you don't even need to care which operators happen to be macros, or (2) the names are not suggestive enough, so you'll need to dive into the docs or the source for some of the operators anyway, and then the first thing you'll learn is which of them are registered as macros.
Finally, there is no single naming style for macros, although there are certain conventions specific to particular use cases. For example with-foo-style constructs tend to be convenience macros whose purpose is to simplify dealing with resources of type foo; dofoo-style constructs tend to be macros which take a body of expressions to be executed (how many times and with which additional context set up depends on the macro; the most basic member of this family, do, is actually a special form rather than a macro); deffoo-style constructs introduce new Vars or type-like entities.
It's worth pointing out that similar patterns are sometimes broken. For instance, most threading constructs (-> & Co.) are macros, but xml-> from clojure.data.zip.xml is a function. That makes perfect sense when one considers the functionality provided, which brings us back to the point about the purpose of an operator being the most useful source of intuition.
1 There might be some exceptions to this rule. One would expect these to be documented. Some projects are of course not documented at all (or very nearly so); here the issue goes away completely, since one must go to the source to make sense of things anyway.
There are two attributes that typically distinguish a macro (or sometimes special form) from a function:
When the form does some sort of binding (i.e. declaring new identifiers for later use)
When some of the arguments are evaluated lazily
Examples of the first case are let, letfn, binding and with-local-vars. Strangely though, defn is defined as a function, but I'm pretty sure it has something to do with Clojure's bootstrapping process (defn is defined before defmacro is defined).
Examples of the second would be and, or and lazy-seq. In all these constructs, the arguments are evaluated lazily by either putting them in conditional branches (like if) or moving them inside a function body.
Both of those attributes are really just manifestations of the macro manipulating the Clojure syntax. I don't think the threading macros (-> and ->>) fit very well into either of those categories, but the nil-safe versions (-?> and -?>>) kind of fall under having lazy arguments.
As far as I know there is no enforced naming convention.
As a rule of thumb, functions are preferred wherever possible, but macros can sometimes be spotted when they follow the pattern def<something> for setting up a something or with-<resource> for doing something with an open resource.
Because of this, you may find clojure's doc macro helpful. It will tell you whether a form is a macro/function/special form, as well as give it's arg list and doc string (if present). For example
(use 'clojure.repl)
(doc and)
Will print the following to the repl.
clojure.core/and
([] [x] [x & next])
Macro
Evaluates exprs one at a time, from left to right. If a form
returns logical false (nil or false), and returns that value and
doesn't evaluate any of the other expressions, otherwise it returns
the value of the last expr. (and) returns true.
Some editors (e.g. emacs) will provide this documentation as a pop-up or on a key combination, which makes accessing it (and reading) much faster.

Build parser from grammar at runtime

Many (most) regular expression libraries for C++ allow for creating the expression from a string during runtime. Is anyone aware of any C++ parser generators that allow for feeding a grammar (preferably BNF) represented as a string into a generator at runtime? All the implementations I've found either require an explicit code generator to be run or require the grammar to be expressed via clever template meta-programming.
It should be pretty easy to build a recursive descent, backtracking parser that accepts a grammar as input. You can reduce all your rules to the following form (or act as if you have):
A = B C D ;
Parsing such a rule by recursive descent is easy: call a routine that corresponds to finding a B, then one that finds a C, then one that finds a D. Given you are doing a general parser, you can always call a "parse_next_sentential_form(x)" function, and pass the name of the desired form (terminal or nonterminal token) as x (e.g., "B", "C", "D").
In processing such a rule, the parser wants to produce an A, by finding a B, then C, then D. To find B (or C or D), you'd like to have an indexed set of rules in which all the left-hand sides are the same, so one can easily enumerate the B-producing rules, and recurse to process their content. If your parser gets a failure, it simply backtracks.
This won't be a lightning fast parser, but shouldn't be terrible if well implemented.
One could also use an Earley parser, which parses by creating states of partially-processed rules.
If you wanted it to be fast, I suppose you could simply take the guts of Bison and make it into a library. Then if you have grammar text or grammar rules (different entry points into Bison), you could start it and have it produce its tables in memory (which it must do in some form). Don't spit them out; simply build an LR parsing engine that uses them. Voila, on-the-fly efficient parser generation.
You have to worry about ambiguities and the LALR(1)ness of your grammar if you do this; the previous two solutions work with any context free grammar.
I am not aware of an existing library for this. However if performance and robustness are not critical, then you can spin off bison or any other tool that generates C code (via popen(3) or similar), spin off gcc on the generated code, link it into shared library and load the library via dlopen(3)/dlsym(3). On Windows -- DLL and LoadLibrary() instead.
The easiest option is to embed some scripting language or even a full-blown VM (e.g., Mono), and run your generated parsers on top of it. Lua has quite a powerful JIT compiler, decent metaprogramming capabilities and several Packrat implementations ready to use, so probably it would be the least effort way.
I just came across this http://cocom.sourceforge.net/ammunition++-13.html
The last one is an Earley Parser and it appears to take the grammar as a string.
One of the functions is:
Public function `parse_grammar'
`int parse_grammar (int strict_p, const char *description)'
is another function which tunes the parser to given grammar.
The grammar is given by string `description'.
The description is similiar YACC one.
The actual code is at http://sourceforge.net/projects/cocom/
EDIT
A newer version is at https://github.com/vnmakarov/yaep
boost::spirit is a C++ parsing framework that can be used to construct parsers dynamically at runtime.

C++ vs. D , Ada and Eiffel (horrible error messages with templates)

One of the problems of C++ are horrible error messages that we are getting from code which intensively uses templates and template metaprogramming. The concepts are designed to solve this problem, but unfortunately they will not be in the next standard.
I'm wondering, is this problem common for all languages, which are supporting generic programming? Or something is wrong with C++ templates?
Unfortunately I don't know any other language, that supports generic programming (Java and C# generics are too simplified and not as powerful as C++ templates).
So I'm asking you guys: are D,Ada,Eiffel templates (generics) producing such ugly error messages too? And Is it possible to have language with powerful generic programming paradigm, but without ugly error messages? And if yes, how these languages are solving this problem ?
Edit: for downvoters. I really love C++ and templates. I'm not saying that templates are bad. Actually I'm a big fan of generic programming and template metaprogramming. I'm just asking why I'm getting such ugly error messages from compilers.
In general I found Ada compiler error messages for generics really not significantly more difficult to read than any other Ada compiler error messages.
C++ template error messages, on the other hand, are notorious for being error novels. The main difference I think is the way C++ does template instantiation. The thing is, C++ templates are much more flexible than Ada generics. It is so flexible, it is almost like a macro preprocessor. The clever folks in Boost have used this to implement things like lambdas and even whole other languages.
Because of that flexibility, the entire template hierarchy basically has to be compiled anew every time its particular permutation of template parameters is first encountered. Thus issues that resolve down to incompatibilities several layers down a API end up being presented to the poor API client to decipher.
In Ada, Generics are actually strongly typed, and provide full information hiding to the client, just like normal packages and subroutines do. So if you do get an error message, it is typically just referencing the one generic you are trying to instatiate, not the entire hierarchy used to implement it.
So yes, C++ template error messages are way worse than Ada's.
Now debugging is a different story entirely...
The problem, at heart, is that error recovery is difficult, whatever the context.
And when you factor in C and C++ horrid grammars, you can only wonder that error messages are not worse than that! I am afraid that the C grammar has been designed by people who didn't have a clue about the essential properties of a grammar, one of them being that the less reliance on the context the better and the other being that you should strive to make it as unambiguous as possible.
Let us illustrate a common error: forgetting a semi-colon.
struct CType {
int a;
char b;
}
foo
bar() { /**/ }
Okay so this is wrong, where should the missing semi-colon go ? Well unfortunately it's ambiguous, it can go either before or after foo because:
C considers it normal to declare a variable in stride after defining a struct
C considers it normal not to specify a return type for a function (in which case it defaults to int)
If we reason about, we could see that:
if foo names a type, then it belongs to the function declaration
if not, it probably denotes a variable... unless of course we made a typo and it was meant to be written fool, which happens to be a type :/
As you can see, error recovery is downright difficult, because we need to infer what the writer meant, and the grammar is far from being receptive. It is not impossible though, and most errors can indeed be diagnosed more or less correctly, and even recovered from... it just takes considerable effort.
It seems that people working on gcc are more interested in producing fast code (and I mean fast, search for the latest benchmarks on gcc 4.6) and adding interesting features (gcc already implement most - if not all - of C++0x) than producing easy to read error messages. Can you blame them ? I can't.
Fortunately there are people who think that accurate error reporting and good error recovery are a very worthy goal, and some of those have been working on CLang for quite a bit, and they are continuing to do so.
Some nice features, off the top of my head:
Terse but complete error messages, which include the source ranges to expose exactly where the error emanated from
Fix-It notes when it's obvious what was meant
In which case the compiler parses the rest of the file as if the fix had been there already, instead of spewing lines upon lines of gibberish
(recent) avoid including the include stack for notes, to cut out on the cruft
(recent) trying only to expose the template parameter types that the developper actually wrote, and preserving typedefs (thus talking about std::vector<Name> instead of std::vector<std::basic_string<char, std::allocator<char>>, std::allocator<std::basic_string<char, std::allocator<char>> > which makes all the difference)
(recent) recovering correctly in case of a missing template in case it's missing in a call to a template method from within another template method
But each of those has required several hours to days of work.
They certainly didn't come for free.
Now, concepts should have (normally) made our lives easier. But they were mostly untested and so it was deemed preferable to remove them from the draft. I must say I am glad for this. Given C++ relative inertia, it's better not to include features that haven't been thoroughly revised, and the concept maps didn't really thrilled me. Neither did they thrilled Bjarne or Herb it seems, as they said that they would be rethinking Concepts from scratch for the next standard.
The article Generic Programming outlines many of the pros and cons of generics in several languages, including Ada in particular. Although lacking template specialization, all Ada generic instances are "equivalent to the instance declaration…immediately followed by the instance body". As a practical matter, error messages tend to occur at compile-time, and they typically represent familiar violations of type-safety.
D has two features to improve the quality of template error messages: Constraints and static assert.
// Use constraints to only allow a function to operate on random access
// ranges as defined in std.range. If something that doesn't satisfy this
// is passed, the compiler will error before even trying to instantiate
// fun().
void fun(R)(R range) if(isRandomAccessRange!(R)) {
// Do stuff.
}
// Use static assert to check a high level invariant. If
// the predicate is false, the error message will be
// printed and compilation will stop before a screen
// worth of more confusing errors are encountered.
// This function takes any number of ranges to merge sort
// and the same number of temporary buffers to merge into.
void mergeSort(R...)(R ranges) {
static assert(R.length % 2 == 0,
"Must have equal number of ranges to be sorted and temporary buffers.");
static assert(allSatisfy!(isRandomAccessRange, R),
"All arguments to mergeSort must be random access ranges.");
// Implementation
}
Eiffel has the best of all error messages because it is has the best of all template systems. It is fully integrated into the language and works well because it is the only language which is using covarianz in arguments.
Therefore it is much more then a simple compiler copy and paste. Unfortunately explaining the difference in a few lines is impossible. Just go and have a look at EiffelStudio.
There are some efforts to improve the error messages. Clang, for example, has put quite a lot of emphasis on generating more easily readable compiler error messages. I've only been using it for a short while, but my experience of it so far has been quite positive compared to GCC's equivalent errors.