Compiler: limitation of lexical analysis - c++

In classic Compiler theory, the first 2 phases are Lexical Analysis and Parsing. They're in a pipeline. Lexical Analysis recognizes tokens as the input of Parsing.
But I came across some cases which are hard to be correctly recognized in Lexical Analysis. For example, the following code about C++ template:
map<int, vector<int>>
the >> would be recognized as bitwise right shift in a "regular" Lexical Analysis, but it's not correct. My feeling is it's hard to divide the handling of this kind of grammars into 2 phases, the lexing work has to be done in the parsing phase, because correctly parsing the >> relies on the grammar, not only the simple lexical rule.
I'd like to know the theory and practice about this problem. Also, I'd like to know how does C++ compiler handle this case?

The C++ standard requires that an implementation perform lexical analysis to produce a stream of tokens, before the parsing stage. According to the lexical analysis rules, two consecutive > characters (not followed by =) will always be interpreted as one >> token. The grammar provided with the C++ standard is defined in terms of these tokens.
The requirement that in certain contexts (such as when expecting a > within a template-id) the implementation should interpret >> as two > is not specified within the grammar. Instead the rule is specified as a special case:
14.2 Names of template specializations [temp.names] ###
After name lookup (3.4) finds that a name is a template-name or that an operator-function-id or a literal-operator-id refers to a set of overloaded functions any member of which is a function template if this is
followed by a <, the < is always taken as the delimiter of a template-argument-list and never as the less-than
operator. When parsing a template-argument-list, the first non-nested > is taken as the ending delimiter
rather than a greater-than operator. Similarly, the first non-nested >> is treated as two consecutive but
distinct > tokens, the first of which is taken as the end of the template-argument-list and completes the
template-id. [ Note: The second > token produced by this replacement rule may terminate an enclosing
template-id construct or it may be part of a different construct (e.g. a cast).—end note ]
Note the earlier rule, that in certain contexts < should be interpreted as the < in a template-argument-list. This is another example of a construct that requires context in order to disambiguate the parse.
The C++ grammar contains many such ambiguities which cannot be resolved during parsing without information about the context. The most well known of these is known as the Most Vexing Parse, in which an identifier may be interpreted as a type-name depending on context.
Keeping track of the aforementioned context in C++ requires an implementation to perform some semantic analysis in parallel with the parsing stage. This is commonly implemented in the form of semantic actions that are invoked when a particular grammatical construct is recognised in a given context. These semantic actions then build a data structure that represents the context and permits efficient queries. This is often referred to as a symbol table, but the structure required for C++ is pretty much the entire AST.
These kind of context-sensitive semantic actions can also be used to resolve ambiguities. For example, on recognising an identifier in the context of a namespace-body, a semantic action will check whether the name was previously defined as a template. The result of this will then be fed back to the parser. This can be done by marking the identifier token with the result, or replacing it with a special token that will match a different grammar rule.
The same technique can be used to mark a < as the beginning of a template-argument-list, or a > as the end. The rule for context-sensitive replacement of >> with two > poses essentially the same problem and can be resolved using the same method.

You are right, the theoretical clean distinction between lexer and parser is not always possible. I remember a porject I worked on as a student. We were to implement a C compiler, and the grammar we used as a basis would treat typedefined names as types in some cases, as identifiers in others. So the lexer had to switch between these two modes. The way I implemented this back then was using special empty rules, which reconfigured the lexer depending on context. To accomplish this, it was vital to know that the parser would always use exactly one token of look-ahead. So any change to lexer behaviour would have to occur at least one lexiacal token before the affected location. In the end, this worked quite well.
In the C++ case of >> you mention, I don't know what compilers actually do. willj quoted how the specification phrases this, but implementations are allowed to do things differently internally, as long as the visible result is the same. So here is how I'd try to tackle this: upon reading a >, the lexer would emit token GREATER, but also switch to a state where each subsequent > without a space in between would be lexed to GREATER_REPEATED. Any other symbol would switch the state back to normal. Instead of state switches, you could also do this by lexing the regular expression >+, and emitting multiple tokens from this rule. In the parser, you could then use rules like the following:
rightAngleBracket: GREATER | GREATER_REPEATED;
rightShift: GREATER GREATER_REPEATED;
With a bit of luck, you could make template argument rules use rightAngleBracket, while expressions would use rightShift. Depending on how much look-ahead your parser has, it might be neccessary to introduce additional non-terminals to hold longer sequences of ambiguous content, until you encounter some context which allows you to eventually make the decision between these cases.

Related

What does a parser for C++ do until it can differentiate between comparisons and template instantiations?

After reading this question I am left wondering what happens (regarding the AST) when major C++ compilers parse code like this:
struct foo
{
void method() { a<b>c; }
// a b c may be declared here
};
Do they handle it like a GLR parser would or in a different way? What other ways are there to parse this and similar cases?
For example, I think it's possible to postpone parsing the body of the method until the whole struct has been parsed, but is this really possible and practical?
Although it is certainly possible to use GLR techniques to parse C++ (see a number of answers by Ira Baxter), I believe that the approach commonly used in commonly-used compilers such as gcc and clang is precisely that of deferring the parse of function bodies until the class definition is complete. (Since C++ source code passes through a preprocessor before being parsed, the parser works on streams of tokens and that is what must be saved in order to reparse the function body. I don't believe that it is feasible to reparse the source code.)
It's easy to know when a function definition is complete, since braces ({}) must balance even if it is not known how angle brackets nest.
C++ is not the only language in which it is useful to defer parsing until declarations have been handled. For example, a language which allows users to define new operators with different precedences would require all expressions to be (re-)parsed once the names and precedences of operators are known. A more pathological example is COBOL, in which the precedence of OR in a = b OR c depends on whether c is an integer (a is equal to one of b or c) or a boolean (a is equal to b or c is true). Whether designing languages in this manner is a good idea is another question.
The answer will obviously depend on the compiler, but the article How Clang handles the type / variable name ambiguity of C/C++ by Eli Bendersky explains how Clang does it. I will simply note some key points from the article:
Clang has no need for a lexer hack: the information goes in a single direction from lexer to parser
Clang knows when an identifier is a type by using a symbol table
C++ requires declarations to be visible throughout the class, even in code that appears before it
Clang gets around this by doing a full parsing/semantic analysis of the declaration, but leaving the definition for later; in other words, it's lexed but parsed after all the declared types are available

Double closing angle brackets (>>) generate syntax error in SPECIFIC case

Eclipse (Luna, 4.4.2) tells me that I have a syntax error on the following line:
static_cast<Vec<int, DIM>>(a.mul(b));
I remembered that double closing angle brackets >> can lead to problems with some compilers, so I put a blank in between: > >. The syntax error disappears.
BUT I have many >> in my program where no syntax error is detected, such as:
Node<Element<DIM>> * e= a.get();
Why do I get an error the above mentioned specific case? This is NOT a duplicate to error: 'varName' was not declared in this scope, since I'm specifically asking why my compiler does accept a >> sometimes, but not always.
You have used a pre c++11 standard compiler. The older standard had a problem letting the parser disambiguate a pair of closing angle brackets >> used in a nested template type specifier, from the operator>>(). Thus you had to write a space between them.
The samples like >>> or >>* are falling under a different case for the old parsers, thus they work without error message.
I have to admit, I don't actually know what exactly was done in the c++11 (the current) standards definitions, that this situation can be clearly disambiguated by a c++11 compliant parser.
The "right angle bracket fix" is found in §14.2 [temp.names]/p3 (emphasis mine):
When parsing a template-argument-list, the first non-nested > is
taken as the ending delimiter rather than a greater-than operator.
Similarly, the first non-nested >> is treated as two consecutive but
distinct > tokens, the first of which is taken as the end of the
template-argument-list and completes the
template-id. [ Note: The second > token produced by this replacement rule may terminate an enclosing
template-id construct or it may be part of a different construct (e.g. a cast).—end note ]
If the static_cast is otherwise valid, then both pieces of code in the OP are perfectly valid in C++11 and perfectly invalid in C++03. If your IDE reports an error on one but not the other, then it's a bug with that IDE.
It's difficult (and also somewhat pointless) for us to speculate on the source of the bug. A potential cause can be that the second > are closing different constructs (the first case it's closing a cast, the second is closing a template argument list) and the parser's implementation somehow missed the "second > being part of a different construct" case. But that's just wild speculation.

Problems with matches containing space for gtksourceview?

I'm working on improving syntax highlighting for Ada in gtksourceview (currently, it is very outdated and very incomplete). An issue I'm having, is Ada is very positional, so matching many constructs requires matching those positions. I was able to do this in nano fairly easily.
So, let's consider a type declaration such as:
type Trit is range 0..2;
Keywords like "type", "is" and "range" are recognized (and were originally). However, type names were treated as keywords (a bad design decision, as Ada regularly defines new types, even for simple types like integers). What the use gets, is the types in Standard being colored, and all other types looking like normal text, defeating the purpose of highlighting. In some languages this might be a notable problem. However, the majority of type names occur after two regex patterns:
type\s+(\w|\.|_)+
:\s+(\w|\.|_)+
It might just be a matter of implementation (nano and gtksourceview seem to use different regex implementations). I thought the problem was recognizing spaces. As it turns out, putting the type context above the keyword context results in types now being highlighted, but the "type" keyword, or ":" operator are then not highlighted properly (they are highlighted as "type"). I was able to override this in nano, resulting in correct highlighting, but cannot seem to find out how gtksourceview does this.
Here you can see the old gtksourceview definition in action, which doesn't work for a file with many custom types. My nano definition in action sidebyside for comparison; matching by position is definately possible and works.
Here is what happens when I put the type context below the keyword context.
Here is what happens when I put the type context above the keyword context.
In both cases the context is the same, just a simple pattern to get started.
<context id="type" style-ref="type">
<match>(type)\s+\w+</match>
</context>
You may want to consider generating the parser from the formal description of the syntax of Ada in annex P of the Language Reference Manual.
Unfortunately this doesn't answer your question of how to formulate the syntax for a GtkSourceView.

Boost.Spirit adding #include feature into calculator example

Following Boost.Spirit compiler examples I am migrating my Flex/Bison based calculator-like grammar to Spirit based. I want to add a feature #include<another_input.inp>. I have defined the include_statement grammar successfully. Should I follow the way error handling was doing: on_success(include_statement, annotation_function(...)), i.e. for each successful matching of include_statement, get the new input file name and call phrase_parse() again ? or like the Flex/Bison to push/pop the input stack?
Thanks.
Guessing, from the little information that is here, that you meant to ask whether you can reuse the same grammar instance, or it should be better to instantiate a new instance to parse the includes, it depends.
You can do both.
When the grammar is stateless (hint: it usually is if you can use it const) there's no difference. Otherwise, prefer to instantiate a separate instance.
However,
the point is somewhat moot since it appears you already decided to parse the includes after parsing the main document (if I get your comment right)
there's always the danger of global state; Even if the grammar object is const, you could potentially modify external state (e.g. using phx::ref from semantic action) so, this would be an issue, regardless of whether you used separate grammar instances.

What is the meaning of "qualifier"?

What is the meaning of "qualifier" and the difference between "qualifier" and "keyword"?
For the volatile qualifier in C and we can say that volatile is a keyword, so what is the meaning of "qualifier"?
A qualifier adds an extra "quality", such as specifying volatility or constness of a variable. They're similar to adjectives: "a fickle man", "a volatile int", "an incorruptible lady", "a const double". With or without a qualifier, the variable itself still occupies the same amount of memory, and each bit has the same interpretation or contribution to the state/value. Qualifiers just specify something about how it may be accessed or where it is stored.
keywords are predefined reserved identifiers (arguably, see below) that the language itself assigns some meaning to, rather than leaving free for you to use for your own purposes (i.e. naming your variables, types, namespaces, functions...).
Examples
volatile and const are both qualifiers and keywords
if, class, namespace are keywords but not qualifiers
std, main, iostream, x, my_counter are all identifiers but neither keywords nor qualifiers
There's a full list of keywords at http://www.cppreference.com/wiki/keywords/start. C++ doesn't currently have any qualifiers that aren't keywords (i.e. they're all "words" rather than some punctuation symbols).
Where do qualifiers appear relative to other type information?
A quick aside from "what does qualifier mean" into the syntax of using a qualifier - as Zaibis comments below:
...[qualifiers] only qualify what follows [when] there is nothing preceding. so if you want a const pointer to non-const object you had to write char * const var...
A bit (lot?) about identifiers
identifiers themselves are lexical tokens (distinct parts of the C++ source code) that:
begin with a alpha/letter character or underscore
continue with 0 or more alphanumerics or underscores
If it helps, you can think of identifiers as specified by the regexp "[A-Za-z_][A-Za-z_0-9]*". Examples are "egg", "string", "__f", "x0" but not "4e4" (a double literal), "0x0a" (that's a hex literal), "(f)" (that's three lexical tokens, the middle being the identifier "f").
But are keywords identifiers?
For C++, the terminology isn't used consistently. In general computing usage, keywords are a subset of identifiers, and some places/uses in the C++11 Standard clearly reflect that:
"The identifiers shown in Table 4 are reserved for use as keywords" (first sentence in 2.12 Keywords)
"Identifiers that are keywords or operators in C++..." (from 17.6.1.2 footnote 7)
(There are alternative forms of some operators - not, and, xor, or - though annoyingly Visual C++ disables them by default to avoid breaking old code that used them but not as operators.)
As Potatoswatter points out in a comment, in many other places the Standard defines lexical tokens identifier and keyword as mutually exclusive tokens in the Grammar:
"There are five kinds of tokens: identifiers, keywords, ..." (2.7 Tokens)
There's also an edge case where the determination's context sensitive:
If a keyword (2.12) or an alternative token (2.6) that satisfies the syntactic requirements of an identifier (2.11) is contained in an attribute-token, it is considered an identifier. (7.6.1. Attribute Syntax and Semantics 2)
Non-keyword identifiers you still shouldn't use
Some identifiers, like "std" or "string", have a specific usage specified in the C++ Standard - they are not keywords though. Generally, the compiler itself doesn't treat them any differently to your own code, and if you don't include any Standard-specified headers then the compiler probably won't even know about the Standard-mandated use of "std". You might be able to create your own function, variable or type called "std". Not a good idea though... while it's nice to understand the general division between keywords and the Standard library, implementations have freedom to blur the boundaries so you should just assume C++ features work when relevant headers are included and your usage matches documentation, and not do anything that might conflict.