How does a parser for C++ differentiate between comparisons and template instantiations? - c++

In C++, the symbols '<' and '>' are used for comparisons as well as for signifying a template argument. Thus, the code snippet
[...] Foo < Bar > [...]
might be interpreted as any of the following two ways:
An object of type Foo with template argument Bar
Compare Foo to Bar, then compare the result to whatever comes next
How does the parser for a C++ compiler efficiently decide between those two possibilities?

If Foo is known to be a template name (e.g. a template <...> Foo ... declaration is in scope, or the compiler sees a template Foo sequence), then Foo < Bar cannot be a comparison. It must be a beginning of a template instantiation (or whatever Foo < Bar > is called this week).
If Foo is not a template name, then Foo < Bar is a comparison.
In most cases it is known what Foo is, because identifiers generally have to be declared before use, so there's no problem to decide one way or the other. There's one exception though: parsing template code. If Foo<Bar> is inside a template, and the meaning of Foo depends on a template parameter, then it is not known whether Foo is a template or not. The language standard directs to treat it as a non-template unless preceded by the keyword template.
The parser might implement this by feeding context back to the lexer. The lexer recognizes Foo as different types of tokens, depending on the context provided by the parser.

The important point to remember is that C++ grammar is not context-free. I.e., when the parser sees Foo < Bar (in most cases) knows that Foo refers to a template definition (by looking it up in the symbol table), and thus < cannot be a comparison.
There are difficult cases, when you literally have to guide the parser. For example, suppose that are writing a class template with a template member function, which you want to specialize explicitly. You might have to use syntax like:
a->template foo<int>();
(in some cases; see Calling template function within template class for details)
Also, comparisons inside non-type template arguments must be surrounded by parentheses, i.e.:
foo<(A > B)>
not
foo<A > B>
Non-static data member initializers bring more fun: http://open-std.org/JTC1/SC22/WG21/docs/cwg_active.html#325

C and C++ parsers are "context sensitive", in other words, for a given token or lexeme, it is not guaranteed to be distinct and have only one meaning - it depends on the context within which the token is used.
So, the parser part of the compiler will know (by understanding "where in the source it is") that it is parsing some kind of type or some kind of comparison (This is NOT simple to know, which is why reading the source of competent C or C++ compiler is not entirely straight forward - there are lots of conditions and function calls checking "is this one of these, if so do this, else do something else").
The keyword template helps the compiler understand what is going on, but in most cases, the compiler simply knows because < doesn't make sense in the other aspect - and if it doesn't make sense in EITHER form, then it's an error, so then it's just a matter of trying to figure out what the programmer might have wanted - and this is one of the reasons that sometimes, a simple mistake such as a missing } or template can lead the entire parsing astray and result in hundreds or thousands of errors [although sane compilers stop after a reasonable number to not fill the entire universe with error messages]

Most of the answers here confuse determining the meaning of the symbol (what I call "name resolution") with parsing (defined narrowly as "can read the syntax of the program").
You can do these tasks separately..
What this means is that you can build a completely context-free parser for C++ (as my company, Semantic Designs does), and leave the issues of deciding what the meaning of the symbol is to a explicitly seperate following task.
Now, that task is driven by the possible syntax interpretations of the source code. In our parsers, these are captured as ambiguities in the parse.
What name resolution does is collect information about the declarations of names, and use that information to determine which of the ambiguous parses doesn't make sense, and simply drop those. What remains is a single valid parse, with a single valid interpretation.
The machinery to accomplish name resolution in practice is a big mess. But that's the C++ committee's fault, not the parser or name resolver. The ambiguity removal with our tool is actually done automatically, making that part actually pretty nice but if you don't look inside our tools you would not appreciate that, but we do because it means a small engineering team was able to build it.
See an example of resolution of template-vs-less than on C++s most vexing parse done by our parser.

Related

In OCaml Menhir, how to write a parser for C++/Rust/Java-style generics

In C++, a famous parsing ambiguity happens with code like
x<T> a;
Is it if T is a type, it is what it looks like (a declaration of a variable a of type x<T>, otherwise it is (x < T) > a (<> are comparison operators, not angle brackets).
In fact, we could make a change to make this become unambiguous: we can make < and > nonassociative. So x < T > a, without brackets, would not be a valid sentence anyway even if x, T and a were all variable names.
How could one resolve this conflict in Menhir? At first glance it seems we just can't. Even with the aforementioned modification, we need to lookahead an indeterminate number of tokens before we see another closing >, and conclude that it was a template instantiation, or otherwise, to conclude that it was an expression. Is there any way in Menhir to implement such an arbitrary lookahead?
Different languages (including the ones listed in your title) actually have very different rules for templates/generics (like what type of arguments there can be, where templates/generics can appear, when they are allowed to have an explicit argument list and what the syntax for template/type arguments on generic methods is), which strongly affect the options you have for parsing. In no language that I know is it true that the meaning of x<T> a; depends on whether T is a type.
So let's go through the languages C++, Java, Rust and C#:
In all four of those languages both types and functions/methods can be templates/generic. So we'll not only have to worry about an ambiguity with variable declarations, but also function/method calls: is f<T>(x) a function/method call with an explicit template/type argument or is it two relational operators with the last operand parenthesized? In all four languages template/generic functions/methods can be called without template/type when those can be inferred, but that inference isn't always possible, so just disallowing explicit template/type arguments for function/method calls is not an option.
Even if a language does not allow relational operators to be chained, we could get an ambiguity in expressions like this: f(a<b, c, d>(e)). Is this calling f with the three arguments a<b, c and d>e or with the single argument a<b, c, d>(e) calling a function/method named a with the type/template arguments b,c,d?
Now beyond this common foundation, most everything else is different between these languages:
Rust
In Rust the syntax for a variable declaration is let variableName: type = expr;, so x<T> a; couldn't possibly be a variable declaration because that doesn't match the syntax at all. In addition it's also not a valid expression statement (anymore) because comparison operators can't be chained (anymore).
So there's no ambiguity here or even a parsing difficulty. But what about function calls? For function calls, Rust avoided the ambiguity by simply choosing a different syntax to provide type arguments: instead of f<T>(x) the syntax is f::<T>(x). Since type arguments for function calls are optional when they can be inferred, this ugliness is thankfully not necessary very often.
So in summary: let a: x<T> = ...; is a variable declaration, f(a<b, c, d>(e)); calls f with three arguments and f(a::<b, c, d>(e)); calls a with three type arguments. Parsing is easy because all of these are sufficiently different to be distinguished with just one token of lookahead.
Java
In Java x<T> a; is in fact a valid variable declaration, but it is not a valid expression statement. The reason for that is that Java's grammar has a dedicated non-terminal for expressions that can appear as an expression statement and applications of relational operators (or any other non-assignment operators) are not matched by that non-terminal. Assignments are, but the left side of assignment expressions is similarly restricted. In fact, an identifier can only be the start of an expression statement if the next token is either a =, ., [ or (. So an identifier followed by a < can only be the start of a variable declaration, meaning we only need one token of lookahead to parse this.
Note that when accessing static members of a generic class, you can and must refer to the class without type arguments (i.e. FooClass.bar(); instead of FooClass<T>.bar()), so even in that case the class name would be followed by a ., not a <.
But what about generic method calls? Something like y = f<T>(x); could still run into the ambiguity because relational operators are of course allowed on the right side of =. Here Java chooses a similar solution as Rust by simply changing the syntax for generic method calls. Instead of object.f<T>(x) the syntax is object.<T>f(x) where the object. part is non-optional even if the object is this. So to call a generic method with an explicit type argument on the current object, you'd have to write this.<T>f(x);, but like in Rust the type argument can often be inferred, allowing you to just write f(x);.
So in summary x<T> a; is a variable declaration and there can't be expression statements that start with relational operations; in general expressions this.<T>f(x) is a generic method call and f<T>(x); is a comparison (well, a type error, actually). Again, parsing is easy.
C#
C# has the same restrictions on expression statements as Java does, so variable declarations aren't a problem, but unlike the previous two languages, it does allow f<T>(x) as the syntax for function calls. In order to avoid ambiguities, relational operators need to be parenthesized when used in a way that could also be valid call of a generic function. So the expression f<T>(x) is a method call and you'd need to add parentheses f<(T>(x)) or (f<T)>(x) to make it a comparison (though actually those would be type errors because you can't compare booleans with < or >, but the parser doesn't care about that) and similarly f(a<b, c, d>(e)) calls a generic method named a with the type arguments b,c,d whereas f((a<b), c, (d<e)) would involve two comparisons (and you can in fact leave out one of the two pairs of parentheses).
This leads to a nicer syntax for method calls with explicit type arguments than in the previous two languages, but parsing becomes kind of tricky. Considering that in the above example f(a<b, c, d>(e)) we can actually place an arbitrary number of arguments before d>(e) and a<b is a perfectly valid comparison if not followed by d>(e), we actually need an arbitrary amount of lookahead, backtracking or non-determinism to parse this.
So in summary x<T> a; is a variable declaration, there is no expression statement that starts with a comparison, f<T>(x) is a method call expression and (f<T)>(x) or f<(T>(x)) would be (ill-typed) comparisons. It is impossible to parse C# with menhir.
C++
In C++ a < b; is a valid (albeit useless) expression statement, the syntax for template function calls with explicit template arguments is f<T>(x) and a<b>c can be a perfectly valid (even well-typed) comparison. So statements like a<b>c; and expressions like a<b>(c) are actually ambiguous without additional information. Further, template arguments in C++ don't have to be types. That is, Foo<42> x; or even Foo<c> x; where c is defined as const int x = 42;, for example, could be perfectly valid instantiations of the Foo template if Foo is defined to take an integer as a template argument. So that's a bummer.
To resolve this ambiguity, the C++ grammar refers to the rule template-name instead of identifier in places where the name of a template is expected. So if we treated these as distinct entities, there'd be no ambiguity here. But of course template-name is defined simply as template-name: identifier in the grammar, so that seems pretty useless, ... except that the standard also says that template-name should only be matched when the given identifier names a template in the current scope. Similarly it says that identifiers should only be interpreted as variable names when they don't refer to a template (or type name).
Note that, unlike the previous three languages, C++ requires all types and templates to be declared before they can be used. So when we see the statement a<b>c;, we know that it can only be a template instantiation if we've previously parsed a declaration for a template named a and it is currently in scope.
So, if we keep track of scopes while parsing, we can simply use if-statements to check whether the name a refers to a previously parsed template or not in a hand-written parser. In parser generators that allow semantic predicates, we can do the same thing. Doing this does not even require any lookahead or backtracking.
But what about parser generators like yacc or menhir that don't support semantic predicates? For these we can use something known as the lexer hack, meaning we make the lexer generate different tokens for type names, template names and ordinary identifiers. Then we have a nicely unambiguous grammar that we can feed our parser generator. Of course the trick is getting the lexer to actually do that. In order to accomplish that, we need to keep track of which templates and types are currently in scope using a symbol table and then access that symbol table from the lexer. We'll also need to tell the lexer when we're reading the name of a definition, like the x in int x;, because then we want to generate a regular identifier even if a template named x is currently in scope (the definition int x; would shadow the template until the variable goes out of scope).
This same approach is used to resolve the casting ambiguity (is (T)(x) a cast of x to type T or a function call of a function named T?) in C and C++.
So in summary, foo<T> a; and foo<T>(x) are template instantiations if and only if foo is a template. Parsing's a bitch, but possible without arbitrary lookahead or backtracking and even using menhir when applying the lexer hack.
AFAIK C++'s template syntax is a well-known example of real-world non-LR grammar. Strictly speaking, it is not LR(k) for any finite k... So C++ parsers are usually hand-written with hacks (like clang) or generated by a GLR grammar (LR with branching). So in theory it is impossible to implement a complete C++ parser in Menhir, which is LR.
However even the same syntax for generics can be different. If generic types and expressions involving comparison operators never appear under the same context, the grammar may still be LR compatible. For example, consider the rust syntax for variable declaration (for this part only):
let x : Vec<T> = ...
The : token indicates that a type, rather than an expression follows, so in this case the grammar can be LR, or even LL (not verified).
So the final answer is, it depends. But for the C++ case it should be impossible to implement the syntax in Menhir.

c++ auto for type and nontype templates

In c++17 template <auto> allows to declare templates with arbitrary type parameters. Partially inspired by this question, it would be useful to have an extension of template <auto> that captures both type and nontype template parameters, and also allows for a variadic version of it.
Are there plans for such an extension in the next c++20 release? Is there some fundamental problem in having a syntax like template<auto... X>, with X any type or nontype template parameter?
Are there plans for such an extension in the next c++20 release?
No.
Is there some fundamental problem in having a syntax like template<auto... X>, with X any type or nontype template parameter?
It would be a totally new concept in the language - having a name refer to either a type or a value in the same place. So it'd come with all sorts of additional questions - and probably additional language features to check if X is a type or not.
The syntax likely cannot be template <auto... X> struct Y { }; since that syntax already has meaning and means a bunch of values and Y<int>{} is ill-formed.
There are definitely places where such a thing would be useful though. A proposal would just have to address these issues.
The big issue with trying to do something like that is grammar. Template parameters state up-front whether they are templates, types, or values, and the most important reason for this is grammatical.
C++ is a context-sensitive grammar. That means that you cannot know, just from a sequence of tokens, what a particular sequence of tokens means. For example, IDENTIFIER LEFT_PAREN RIGHT_PAREN SEMICOLON. What does that mean?
It could mean to call a function named by IDENTIFIER with no parameters. It could mean to default initialize a prvalue of a class named by IDENTIFIER. These are rather different things; you might conceptually see them as similar, but C++'s grammar does not.
Templates are not macros; they're not doing token pasting. There is some understanding that a piece of code in a template is supposed to mean a specific thing. And you can only do that if you at least know what kind of thing a template parameter is.
In order to retain this ability, these "omni template parameters" cannot be utilized until you actually know what they mean. So in order to create such a feature in C++, you would need to:
Create a new syntax to declare omni template parameters (auto isn't going to fly, as it already has a specific meaning).
Provide a syntax for determining what kind of thing an omni template parameter is.
Require the user to invoke that syntax before they can use such parameter names in most ways. This would typically be via some form of specialized if constexpr block, but pattern matching proposals represent an interesting alternative/additional way to handle them (since they can be expressions as well as statements). And expansion statements represent a possible way to access all of the omni parameters in a parameter pack.
I can't see how it would be useful that a template argument could be dynamically either a type or a value? The code statements that use types are very different to those that which use constant values introduced through the template argument.
The only way would be a big "if constexpr" which would make it pointless in my view.
Ok, having looked more closely at the referenced question, I guess there is room there for generically pass-through wrapping the various explicit base template implementations that use different parameter orderings. I still fail to see a huge benefit. The compiler errors when it goes wrong are going to be unfathomable, if nothing else!
I remember being told that overloading and templates were going to rid the world of the unfathomable error messages generated from macros. I have yet to see it!

Has the C++ standard committee considered templated namespaces?

Namespaces are in many was like classes with no constructors, no destructors, no inheritance, final, and only static methods and members. After all, this kind of classes can essentially be used only the way namespaces are used: a named scope for declarations and definitions.
... except that the above is not true, since classes can be templated - and namespaces cannot. There have been a couple of questions here on the site similar to "can I template a namespace", but what I'd like to know is - has the C++ standard committee ever considered a proposal to make namespaces templatable? If it has, was the proposal rejected? If it was, what were the reasons?
The inability to have a template namespace is actually just one way in which they differ from classes. Others would be things like new namespace, and sizeof (namespace) - how could a compiler implement that, given that a namespace may extend over many compilation units?
Looking just at template namespaces... While it can at times be hard to keep up with all the proposals for new C++ features, I don't recall ever seeing one that attempted to add a feature such as you describe.
Would it ever be considered, assuming someone were to write a proposal? As Stroustrup indicates in this interview (http://www.stroustrup.com/devXinterview.html):
For C++ to remain viable for decades to come, it is essential that
Standard C++ isn't extended to support every academic and commercial
fad. Most language facilities that people ask for can be adequately
addressed through libraries using only current C++ facilities.
As you indicate yourself, what you are asking for is basically already there: just use a templated class with static members. This seems to disqualify it as a potential new feature, at least in the eyes of Stroustrup.
How would ADL work if namespaces can be templated? Are we supposed to create special template deduction rules for ADL then?
More importantly, can you justify the added complexity to the language by demonstrating a use-case that can't be filled by, just make a template struct with only static members? If a template namespace is just like a gimped template struct, that doesn't seem to be very compelling.
Also. I understand you weren't satisfied with the other questions about namespace / template hybrids, but one point in this answer seems to be relevant to your question:
Why can't namespaces be template parameters?
Possibly difficult: A namespace isn't a complete, self-contained entity. Different members of a namespace can be declared in different headers and even different compilation units.
If a namespace is a template, how will this even work? Can you still "reopen" the namespace like you can with a regular namespace? If that's allowed, then what is the point of instantiation of the namespace?
It sounds like it could potentially be extremely complicated.
Also: Will the language still be easily parsable after your proposed feature?
One of the most vexing things in C++ is the need to write template often when defining templates that refer to other templates, in order to resolve ambiguity in the grammar regarding whether < is a less than operator or a template parameter list.
3.4.5 [basic.lookup.classref]
In a class member access expression (5.2.5), if the . or -> token is immediately followed by an identifier followed by a <, the identifier must be looked up to determine whether the < is the beginning of a template argument list (14.2) or a less-than operator. The identifier is first looked up in the class of the object expression. If the identifier is not found, it is then looked up in the context of the entire postfix-expression and shall name a class template. If the lookup in the class of the object expression finds a template, the name is also looked up in the context of the entire postfix-expression and
— if the name is not found, the name found in the class of the object expression is used, otherwise
— if the name is found in the context of the entire postfix-expression and does not name a class template, the name found in the class of the object expression is used, otherwise
— if the name found is a class template, it shall refer to the same entity as the one found in the class of the object expression, otherwise the program is ill-formed.
If namespaces can be templates, don't we have to write template for them also, whenever you will refer to a template after a :: operator? For the same reason that foo::bar < 1 ... could be a namespace template bar inside of template foo with a non-type template parameter, or it could be a comparison of 1 with int foo::bar.
How do we disambiguate between that and the third possibility, foo is a namespace and bar is a regular class template inside of it`?

For C++ templates, is there a way find types that are "valid" inputs?

I have a library where template classes/functions often access explicit members of the input type, like this:
template <
typename InputType>
bool IsSomethingTrue(
InputType arg1) {
typename InputType::SubType1::SubType2 &a;
//Do something
}
Here, SubType1 and SubType2 are themselves generic types that were used to instantiate InputType. Is there a way to quickly find all the types in the library that are valid to pass in for InputType (likewise for SubType1 and SubType2)? So far I have just been searching the entire code base for classes containing the appropriate members, but the template input names are reused in a lot of places so it is very cumbersome.
From a coding perspective, what is the point of using a template like this when there is only a limited set of valid input types that are probably already defined? Why not just overload this function with explicit types rather than making them generic?
From a coding perspective, what is the point of using a template like this when there is only a limited set of valid input types that are probably already defined? Why not just overload this function with explicit types rather than making them generic?
First of all, because those overload would have the exact same body, or very similar ones. If the body of the function is long enough, having more versions of it is a problem for maintenance. When you need to change the algorithm, you now have to do it N times and hope you won't make mistakes. Most of the times, redundancy is bad.
Moreover, even though now there could be just a few such types which satisfy the syntactic requirements of your function, there may be more in future. Having a function template allows you to let your algorithm work with new types without the need to write a new overload every time one new such type is introduced.
The advantage of using generic types is not on the template end: if you're willing to explicitly name them and edit the template code every time, it's the same.
What happens, however, when you introduce a subclass or variant of a type accepted by the template? No modification needed on the other end.
In other words, when you say that all types are known beforehand, you are excluding code modifications and extensions, which is half the point of using templates.

Why is the type of a name determined during the first phase of template evaluation, even for dependent names?

As a corollary to a question I asked previously, I am curious why the type of the type of a name ('category' of that name) in a template is set in the first phase of the 2 phase lookup, when the category itself can also depend on the template parameter. What is the real gain of this behavior?
A little clarification - I think I have a fair understanding of how the 2 phase look-up works; what I'm trying to understand is why a category of a token is definitively determined in phase 1, which differs from when dependent types are determined (in phase 2). My argument is that there is a very real gain in simplifying a difficult syntax, to make code easier to write and to read, so I am curious what the compelling reason to restrict category evaluation to phase 1 is. Is it simply for better template validation/error messages before template instantiation, or a marginal increase in speed? Or is there some fundamental attribute of templates that makes phase 2 category evaluation unfeasible?
The question could be two fold: why do we want two phase lookup in the first place, and given that we have two phase lookup, why are the interpretation of the tokens fixed during the first phase. The first is the harder question to answer, as it is a design decision in the language and as such it has its advantages and disadvantages and depending on where you stand the ones or the others will have more weight.
The second part, which is what you are interested in, is actually much simpler. Why, in a C++ language with two phase lookup are the token meaning fixed during the first phase and cannot be left to be interpreted in the second phase. The reason is that C++ has a contextual grammar, and the interpretation of the tokens is highly dependent on the context. Without fixating the meaning of the tokens during the first phase you won't even know what names need to be looked up in the first place.
Consider a slightly modified version of your original code, where the literal 5 is substituted by a constant expression, and assuming that you did not need to provide the template or typename keywords that bit you the last time:
const int b = 5;
template<typename T>
struct Derived : public Base<T> {
void Foo() {
Base<T>::Bar<false>(b); // [1]
std::cout << b; // [2]
}
};
What are the possible meanings of [1] (ignoring the fact that in C++ this is determined by adding typename and template)?
Bar is a static template function that takes a single bool as template argument and an integer as argument. b is a non-dependent name that refers to the constant 5. *
Bar is a nested template type that takes a single bool as template argument. b is an instance of that type defined inside the function Derived<T>::Foo and not used.
Bar is a static member variable of a type X for which there is a comparison operator< that takes a bool and yields as result an object of type U that can be compared with operator> with an integer.
Now the question is how do we proceed resolving the names before the template arguments are substituted in (i.e. during the first phase). If we are in case 1. or 3. then b needs to be looked up and the result can be substituted in the expression. In the first case yielding your original code: Base<T>::template Bar<false>(5), in the latter case yielding operator>( operator<( Base<T>::Bar,false ), 5 ). In the third case (2.) the code after the first phase would be exactly the same as the original code: Base<T>::Bar<false> b; (removing the extra ()).
The meaning of the second line [2] is then dependent on how we interpreted the first one [1]. In the 2. case it represents a call to operator<<( std::cout, Base<T>::Bar<false> & ), while in the other two cases it represents operator<<( std::cout, 5 ). Again the implications extend beyond what type is the second argument, as in the 2. case name b within Derived<T>::Foo is dependent, and thus it cannot be resolved during the first phase but rather postponed to the second phase (where it will also affect lookup by adding the namespaces of Base and the instantiating type T to the Argument Dependent Lookup).
As the example shows, the interpretation of the tokens impact the meaning of the names, and that in turn affects what the rest of the code means, what names are dependent or not and thus what else needs to be looked up or not during the first phase. At the same time, the compiler does perform checks during the first pass, and if the tokens could be reinterpreted during the second pass, then the checks and the results of the lookup during the first pass would be rendered useless (imagine that during the first pass b had been substituted with 5 only to find out that we are in case 2. during the second phase!), and everything would have to be checked during the second phase.
The existence of two phase lookup depends on the tokens being interpreted and it's meaning selected during the first phase. The alternative is a single pass lookup as VS does.
* I am simplifying the cases here, in the Visual Studio compiler, that does not implement two-phase lookup, b could also be a member of Base<T> for the currently instantiating type T (i.e. it can be a dependent name)
Much of the advantages of C++ is that it's a strictly checked language. You express the intent of your program as clearly as possible, and the compiler tells you if that intent is violated.
I can't imagine that you would ever write Base<T>::Bar<false>(b); (from Dribeas's example) and not have a particular interpretation that you want. By telling the interpretation to the compiler (Base<T>::typename Bar<false>(b);), it can generate a meaningful error if someone provides a type that has a static member Bar or a nested template type instead of a member function template.
Other languages are designed to stress terseness over static analysis; for example many dynamic languages have a great number of "do what I mean" rules. Which causes fun when the compiler turns non-sensible code into something unpredictable, with no errors. (Case in point: Perl. I love it for text manipulation, but goodness DWIM is annoying. Almost everything is a runtime error, there's barely any static checking to speak of)