After reading this question I am left wondering what happens (regarding the AST) when major C++ compilers parse code like this:
struct foo
{
void method() { a<b>c; }
// a b c may be declared here
};
Do they handle it like a GLR parser would or in a different way? What other ways are there to parse this and similar cases?
For example, I think it's possible to postpone parsing the body of the method until the whole struct has been parsed, but is this really possible and practical?
Although it is certainly possible to use GLR techniques to parse C++ (see a number of answers by Ira Baxter), I believe that the approach commonly used in commonly-used compilers such as gcc and clang is precisely that of deferring the parse of function bodies until the class definition is complete. (Since C++ source code passes through a preprocessor before being parsed, the parser works on streams of tokens and that is what must be saved in order to reparse the function body. I don't believe that it is feasible to reparse the source code.)
It's easy to know when a function definition is complete, since braces ({}) must balance even if it is not known how angle brackets nest.
C++ is not the only language in which it is useful to defer parsing until declarations have been handled. For example, a language which allows users to define new operators with different precedences would require all expressions to be (re-)parsed once the names and precedences of operators are known. A more pathological example is COBOL, in which the precedence of OR in a = b OR c depends on whether c is an integer (a is equal to one of b or c) or a boolean (a is equal to b or c is true). Whether designing languages in this manner is a good idea is another question.
The answer will obviously depend on the compiler, but the article How Clang handles the type / variable name ambiguity of C/C++ by Eli Bendersky explains how Clang does it. I will simply note some key points from the article:
Clang has no need for a lexer hack: the information goes in a single direction from lexer to parser
Clang knows when an identifier is a type by using a symbol table
C++ requires declarations to be visible throughout the class, even in code that appears before it
Clang gets around this by doing a full parsing/semantic analysis of the declaration, but leaving the definition for later; in other words, it's lexed but parsed after all the declared types are available
Problem:
I am using VC++2010, so except a few supported features like decltype pre-C++11 is required.
Given a C++ identifier, is it possible to use some meta-programming techniques to check if that identifier is a type name or variable name. In order words, given the code below :
void f() {
s=(uint32)-1;
}
is it possible to somehow identify if uint32 is:
the name of a variable which means the RHS of the assignment is a subtraction;or
a type name where the RHS operand is literal -1 typecasted to (uint32)
by something like mytemplate<uint32> or similiar?
Rationale: I am using my own in-house developed mini parser to analyze/instrument C++ source code. But my mini parser lacks many features like building a table of identifier types so it always interpret the above code as either subtraction or typecast. My parser is able to modify the source code so that I can insert/modify anything surrounding the uint32 e.g.
void f() {
s=(mytemplate<uint32>(...))-1;
s=myfunc(uint32)-1;
}
But my inserted code will cause syntax error depending on the meaning of the identifier uint32 (type name Vs variable name). I am looking for some generic code that I can insert to cater for both cases.
In classic Compiler theory, the first 2 phases are Lexical Analysis and Parsing. They're in a pipeline. Lexical Analysis recognizes tokens as the input of Parsing.
But I came across some cases which are hard to be correctly recognized in Lexical Analysis. For example, the following code about C++ template:
map<int, vector<int>>
the >> would be recognized as bitwise right shift in a "regular" Lexical Analysis, but it's not correct. My feeling is it's hard to divide the handling of this kind of grammars into 2 phases, the lexing work has to be done in the parsing phase, because correctly parsing the >> relies on the grammar, not only the simple lexical rule.
I'd like to know the theory and practice about this problem. Also, I'd like to know how does C++ compiler handle this case?
The C++ standard requires that an implementation perform lexical analysis to produce a stream of tokens, before the parsing stage. According to the lexical analysis rules, two consecutive > characters (not followed by =) will always be interpreted as one >> token. The grammar provided with the C++ standard is defined in terms of these tokens.
The requirement that in certain contexts (such as when expecting a > within a template-id) the implementation should interpret >> as two > is not specified within the grammar. Instead the rule is specified as a special case:
14.2 Names of template specializations [temp.names] ###
After name lookup (3.4) finds that a name is a template-name or that an operator-function-id or a literal-operator-id refers to a set of overloaded functions any member of which is a function template if this is
followed by a <, the < is always taken as the delimiter of a template-argument-list and never as the less-than
operator. When parsing a template-argument-list, the first non-nested > is taken as the ending delimiter
rather than a greater-than operator. Similarly, the first non-nested >> is treated as two consecutive but
distinct > tokens, the first of which is taken as the end of the template-argument-list and completes the
template-id. [ Note: The second > token produced by this replacement rule may terminate an enclosing
template-id construct or it may be part of a different construct (e.g. a cast).—end note ]
Note the earlier rule, that in certain contexts < should be interpreted as the < in a template-argument-list. This is another example of a construct that requires context in order to disambiguate the parse.
The C++ grammar contains many such ambiguities which cannot be resolved during parsing without information about the context. The most well known of these is known as the Most Vexing Parse, in which an identifier may be interpreted as a type-name depending on context.
Keeping track of the aforementioned context in C++ requires an implementation to perform some semantic analysis in parallel with the parsing stage. This is commonly implemented in the form of semantic actions that are invoked when a particular grammatical construct is recognised in a given context. These semantic actions then build a data structure that represents the context and permits efficient queries. This is often referred to as a symbol table, but the structure required for C++ is pretty much the entire AST.
These kind of context-sensitive semantic actions can also be used to resolve ambiguities. For example, on recognising an identifier in the context of a namespace-body, a semantic action will check whether the name was previously defined as a template. The result of this will then be fed back to the parser. This can be done by marking the identifier token with the result, or replacing it with a special token that will match a different grammar rule.
The same technique can be used to mark a < as the beginning of a template-argument-list, or a > as the end. The rule for context-sensitive replacement of >> with two > poses essentially the same problem and can be resolved using the same method.
You are right, the theoretical clean distinction between lexer and parser is not always possible. I remember a porject I worked on as a student. We were to implement a C compiler, and the grammar we used as a basis would treat typedefined names as types in some cases, as identifiers in others. So the lexer had to switch between these two modes. The way I implemented this back then was using special empty rules, which reconfigured the lexer depending on context. To accomplish this, it was vital to know that the parser would always use exactly one token of look-ahead. So any change to lexer behaviour would have to occur at least one lexiacal token before the affected location. In the end, this worked quite well.
In the C++ case of >> you mention, I don't know what compilers actually do. willj quoted how the specification phrases this, but implementations are allowed to do things differently internally, as long as the visible result is the same. So here is how I'd try to tackle this: upon reading a >, the lexer would emit token GREATER, but also switch to a state where each subsequent > without a space in between would be lexed to GREATER_REPEATED. Any other symbol would switch the state back to normal. Instead of state switches, you could also do this by lexing the regular expression >+, and emitting multiple tokens from this rule. In the parser, you could then use rules like the following:
rightAngleBracket: GREATER | GREATER_REPEATED;
rightShift: GREATER GREATER_REPEATED;
With a bit of luck, you could make template argument rules use rightAngleBracket, while expressions would use rightShift. Depending on how much look-ahead your parser has, it might be neccessary to introduce additional non-terminals to hold longer sequences of ambiguous content, until you encounter some context which allows you to eventually make the decision between these cases.
Is it better to use static const vars than #define preprocessor? Or maybe it depends on the context?
What are advantages/disadvantages for each method?
Pros and cons between #defines, consts and (what you have forgot) enums, depending on usage:
enums:
only possible for integer values
properly scoped / identifier clash issues handled nicely, particularly in C++11 enum classes where the enumerations for enum class X are disambiguated by the scope X::
strongly typed, but to a big-enough signed-or-unsigned int size over which you have no control in C++03 (though you can specify a bit field into which they should be packed if the enum is a member of struct/class/union), while C++11 defaults to int but can be explicitly set by the programmer
can't take the address - there isn't one as the enumeration values are effectively substituted inline at the points of usage
stronger usage restraints (e.g. incrementing - template <typename T> void f(T t) { cout << ++t; } won't compile, though you can wrap an enum into a class with implicit constructor, casting operator and user-defined operators)
each constant's type taken from the enclosing enum, so template <typename T> void f(T) get a distinct instantiation when passed the same numeric value from different enums, all of which are distinct from any actual f(int) instantiation. Each function's object code could be identical (ignoring address offsets), but I wouldn't expect a compiler/linker to eliminate the unnecessary copies, though you could check your compiler/linker if you care.
even with typeof/decltype, can't expect numeric_limits to provide useful insight into the set of meaningful values and combinations (indeed, "legal" combinations aren't even notated in the source code, consider enum { A = 1, B = 2 } - is A|B "legal" from a program logic perspective?)
the enum's typename may appear in various places in RTTI, compiler messages etc. - possibly useful, possibly obfuscation
you can't use an enumeration without the translation unit actually seeing the value, which means enums in library APIs need the values exposed in the header, and make and other timestamp-based recompilation tools will trigger client recompilation when they're changed (bad!)
consts:
properly scoped / identifier clash issues handled nicely
strong, single, user-specified type
you might try to "type" a #define ala #define S std::string("abc"), but the constant avoids repeated construction of distinct temporaries at each point of use
One Definition Rule complications
can take address, create const references to them etc.
most similar to a non-const value, which minimises work and impact if switching between the two
value can be placed inside the implementation file, allowing a localised recompile and just client links to pick up the change
#defines:
"global" scope / more prone to conflicting usages, which can produce hard-to-resolve compilation issues and unexpected run-time results rather than sane error messages; mitigating this requires:
long, obscure and/or centrally coordinated identifiers, and access to them can't benefit from implicitly matching used/current/Koenig-looked-up namespace, namespace aliases etc.
while the trumping best-practice allows template parameter identifiers to be single-character uppercase letters (possibly followed by a number), other use of identifiers without lowercase letters is conventionally reserved for and expected of preprocessor defines (outside the OS and C/C++ library headers). This is important for enterprise scale preprocessor usage to remain manageable. 3rd party libraries can be expected to comply. Observing this implies migration of existing consts or enums to/from defines involves a change in capitalisation, and hence requires edits to client source code rather than a "simple" recompile. (Personally, I capitalise the first letter of enumerations but not consts, so I'd be hit migrating between those two too - maybe time to rethink that.)
more compile-time operations possible: string literal concatenation, stringification (taking size thereof), concatenation into identifiers
downside is that given #define X "x" and some client usage ala "pre" X "post", if you want or need to make X a runtime-changeable variable rather than a constant you force edits to client code (rather than just recompilation), whereas that transition is easier from a const char* or const std::string given they already force the user to incorporate concatenation operations (e.g. "pre" + X + "post" for string)
can't use sizeof directly on a defined numeric literal
untyped (GCC doesn't warn if compared to unsigned)
some compiler/linker/debugger chains may not present the identifier, so you'll be reduced to looking at "magic numbers" (strings, whatever...)
can't take the address
the substituted value need not be legal (or discrete) in the context where the #define is created, as it's evaluated at each point of use, so you can reference not-yet-declared objects, depend on "implementation" that needn't be pre-included, create "constants" such as { 1, 2 } that can be used to initialise arrays, or #define MICROSECONDS *1E-6 etc. (definitely not recommending this!)
some special things like __FILE__ and __LINE__ can be incorporated into the macro substitution
you can test for existence and value in #if statements for conditionally including code (more powerful than a post-preprocessing "if" as the code need not be compilable if not selected by the preprocessor), use #undef-ine, redefine etc.
substituted text has to be exposed:
in the translation unit it's used by, which means macros in libraries for client use must be in the header, so make and other timestamp-based recompilation tools will trigger client recompilation when they're changed (bad!)
or on the command line, where even more care is needed to make sure client code is recompiled (e.g. the Makefile or script supplying the definition should be listed as a dependency)
My personal opinion:
As a general rule, I use consts and consider them the most professional option for general usage (though the others have a simplicity appealing to this old lazy programmer).
Personally, I loathe the preprocessor, so I'd always go with const.
The main advantage to a #define is that it requires no memory to store in your program, as it is really just replacing some text with a literal value. It also has the advantage that it has no type, so it can be used for any integer value without generating warnings.
Advantages of "const"s are that they can be scoped, and they can be used in situations where a pointer to an object needs to be passed.
I don't know exactly what you are getting at with the "static" part though. If you are declaring globally, I'd put it in an anonymous namespace instead of using static. For example
namespace {
unsigned const seconds_per_minute = 60;
};
int main (int argc; char *argv[]) {
...
}
If this is a C++ question and it mentions #define as an alternative, then it is about "global" (i.e. file-scope) constants, not about class members. When it comes to such constants in C++ static const is redundant. In C++ const have internal linkage by default and there's no point in declaring them static. So it is really about const vs. #define.
And, finally, in C++ const is preferable. At least because such constants are typed and scoped. There are simply no reasons to prefer #define over const, aside from few exceptions.
String constants, BTW, are one example of such an exception. With #defined string constants one can use compile-time concatenation feature of C/C++ compilers, as in
#define OUT_NAME "output"
#define LOG_EXT ".log"
#define TEXT_EXT ".txt"
const char *const log_file_name = OUT_NAME LOG_EXT;
const char *const text_file_name = OUT_NAME TEXT_EXT;
P.S. Again, just in case, when someone mentions static const as an alternative to #define, it usually means that they are talking about C, not about C++. I wonder whether this question is tagged properly...
#define can lead to unexpected results:
#include <iostream>
#define x 500
#define y x + 5
int z = y * 2;
int main()
{
std::cout << "y is " << y;
std::cout << "\nz is " << z;
}
Outputs an incorrect result:
y is 505
z is 510
However, if you replace this with constants:
#include <iostream>
const int x = 500;
const int y = x + 5;
int z = y * 2;
int main()
{
std::cout << "y is " << y;
std::cout << "\nz is " << z;
}
It outputs the correct result:
y is 505
z is 1010
This is because #define simply replaces the text. Because doing this can seriously mess up order of operations, I would recommend using a constant variable instead.
Using a static const is like using any other const variables in your code. This means you can trace wherever the information comes from, as opposed to a #define that will simply be replaced in the code in the pre-compilation process.
You might want to take a look at the C++ FAQ Lite for this question:
http://www.parashift.com/c++-faq-lite/newbie.html#faq-29.7
A static const is typed (it has a type) and can be checked by the compiler for validity, redefinition etc.
a #define can be redifined undefined whatever.
Usually you should prefer static consts. It has no disadvantage. The prprocessor should mainly be used for conditional compilation (and sometimes for really dirty trics maybe).
Defining constants by using preprocessor directive #define is not recommended to apply not only in C++, but also in C. These constants will not have the type. Even in C was proposed to use const for constants.
Always prefer to use the language features over some additional tools like preprocessor.
ES.31: Don't use macros for constants or "functions"
Macros are a major source of bugs. Macros don't obey the usual scope
and type rules. Macros don't obey the usual rules for argument
passing. Macros ensure that the human reader sees something different
from what the compiler sees. Macros complicate tool building.
From C++ Core Guidelines
As a rather old and rusty C programmer who never quite made it fully to C++ because other things came along and is now hacking along getting to grips with Arduino my view is simple.
#define is a compiler pre processor directive and should be used as such, for conditional compilation etc.. E.g. where low level code needs to define some possible alternative data structures for portability to specif hardware. It can produce inconsistent results depending on the order your modules are compiled and linked. If you need something to be global in scope then define it properly as such.
const and (static const) should always be used to name static values or strings. They are typed and safe and the debugger can work fully with them.
enums have always confused me, so I have managed to avoid them.
Please see here: static const vs define
usually a const declaration (notice it doesn't need to be static) is the way to go
If you are defining a constant to be shared among all the instances of the class, use static const. If the constant is specific to each instance, just use const (but note that all constructors of the class must initialize this const member variable in the initialization list).
I was reading the paper "An Object oriented preprocessor fit for C++".
"http://www.informatik.uni-bremen.de/st/lehre/Arte-fakt/Seminar/papers/17/An%20Object-Oriented%20preprocessor%20fit%20for%20C++.pdf"
It discusses three different types of macros.
text macros. // pretty much the same as C preprocessor
computational macros // text replaced as a result of computation
syntax macros. // text replaced by the syntax tree representating a linguistically consistent construct.
Can somebody please explain the last two type of macros in an elaborate way.
It says that inline functions and templates are examples of computational macros, how ?
Looking at the original Cheatham's paper from 1966 that the Willink's and Muchnick's paper refer to I'd summarize the different macro types like this:
Text macros do text replacements before scanning and parsing.
Syntactic macros are processed during scanning and parsing. Calling a syntax macro replaces the macro call with another piece of AST.
Computational macros can happen at any point after the AST has been built by the scanner and the parser. The point is that at this point we are no longer processing any text but instead manipulating the nodes of the AST i.e., we are dealing with objects that might already even have semantic information attached to them.
I'm no C++ internals expert but I'd assume that the inlining of function calls and instantiating templates is about manipulating the syntax tree before, while and after it's been annotated with semantic information necessary to compile it properly as both of those seem to assume knowing a lot of stuff (like type info and if something is good to be inlined) that is not yet known during scanning and parsing.
By 2. it sounds like they mean that some computation is done at compile time and the resulting instructions executed at runtime only involve the result. I wouldn't think inline functions particularly represent this, but template meta-programming does exactly this. Also constexpr in C++11.
I think 3. could also be represented by the use of templates. A template does represent a syntax tree, and instantiating it involves taking the generic syntax tree, filling in the parameterized, unknown bits, and using the resulting syntax tree.