spaces not needed in case labels? - c++

May be this is a silly question, but did anyone already know whether or not there would be a mandatory space between the case keyword and its constant expression in the switch statement?
The standard seems not say anything about...
Consider the following code:
switch(int_expression)
{
case1: /*anything*/ // no space before 1
caseZERO: /*anything*/ // no space before ZERO
// ZERO being defined as 0
// by the pre-processor
}
Both my reference compilers accept this code and once run it works nicely.
How does the preprocessor recognize that ZERO must be substituted?
Note also that if the control expression was a char type instead of a int type the following is also compiled but this time it no longer works
switch(char_expression)
{
case'a': /* anything */ // NO SPACES embedded
caseZERO: /* anything */ // NO SPACES again
// ZERO being defined as '0'
// by the pre-processor
}
THIS COMPILES but even if the value of the char_expression was '0'
no statement in caseZERO is executed.
Can anyone explain this?

These are general labels for use with goto and won't actually match the expression as you intend. The syntax is valid, but the semantics are different.
You don't specify whether this is a C compiler or a C++ compiler, so here's the documentation for goto in C as well.
Also, if you look in the actual grammar provided by the ISO C standard in Appendix A.2.3 (6.8.1) defines a labeled-statement as follows:
(6.8.1) labeled-statement:
identifier : statement
case constant-expression : statement
default : statement
Note specifically that the case is a single token which must be followed by at least one token separating character in the context of the grammar. A numeral as in your example of case1 does not count as a token separating character, so your example falls into the first branch of the grammar which is an identifier not a case statement, and thus can't be used as the target of a switch statement.
To answer your other question regarding ZERO being defined in a preprocessor macro, the substitution is not happening as you believe. Again, the preprocessor operates on tokens, and thus caseZERO is a single token which does not match the ZERO macro and so will not be substituted at all. Again, this is just defining a labeled-statement using the identifier branch where the identifier is the entire token caseZERO and not case0 or case'0' as you believe. The preprocessor does have means of doing "token pasting" using the ## operator, but that would require you to use case ## ZERO. However, this still would not have the behavior that you probably intend.

The type of the case expression is not relevant. It's the syntactic form that matters.
switch (expr) {
caseZERO: /*...*/;
case0: /*...*/;
case'a': /*...*/;
}
caseZERO and case0 are both valid identifiers. In this context, they are labels, but not case labels. If you had a statement
goto caseZERO;
or
goto case0;
in the same function, it would branch to the corresponding labeled statement.
(This is one of the many cases in C where a typo results in code that's still syntactically valid, but with a substantially different meaning.)
case'a', on the other hand, is two tokens, because the 'a' is a character constant. (But it should still be written as case 'a': for the benefit of the human reader.)
The rules for splitting source code into tokens require white space between an identifier, keyword, or numeric literal and another identifier, keyword, or numeric literal, because otherwise it would be ambiguous. They do not require white space in other contexts. That's why, for example, you can write:
x=y+func(42);
rather than
x = y + func ( 42 ) ;
Adding some white space will make the code more legible to human readers:
x = y + func(42);
but the compiler doesn't care.
(Another case where whitespace is important is in the definition of a function-like macro. The ( must immediately follow the macro name; otherwise it's treated as the first token of the expansion rather than introducing the parameter list.)

Related

why does C++ allow a declaration with no space between the type and a parenthesized variable name? [duplicate]

This question already has an answer here:
Is white space considered a token in C like languages?
(1 answer)
Closed 8 months ago.
A previous C++ question asked why int (x) = 0; is allowed. However, I noticed that even int(x) = 0; is allowed, i.e. without a space before the (x). I find the latter quite strange, because it causes things like this:
using Oit = std::ostream_iterator<int>;
Oit bar(std::cout);
*bar = 6; // * is optional
*Oit(bar) = 7; // * is NOT optional!
where the final line is because omitting the * makes the compiler think we are declaring bar again and initializing to 7.
Am I interpreting this correctly, that int(x) = 0; is indeed equivalent to int x = 0, and Oit(bar) = 7; is indeed equivalent to Oit bar = 7;? If yes, why specifically does C++ allow omitting the space before the parentheses in such a declaration + initialization?
(my guess is because the C++ compiler does not care about any space before a left paren, since it treats that parenthesized expression as it's own "token" [excuse me if I'm butchering the terminology], i.e. in all cases, qux(baz) is equivalent to qux (baz))
It is allowed in C++ because it is allowed in C and requiring the space would be an unnecessary C-compatibility breaking change. Even setting that aside, it would be surprising to have int (x) and int(x) behave differently, since generally (with few minor exceptions) C++ is agnostic to additional white-space as long as tokens are properly separated. And ( (outside a string/character literal) is always a token on its own. It can't be part of a token starting with int(.
In C int(x) has no other potential meaning for which it could be confused, so there is no reason to require white-space separation at all. C also is generally agnostic to white-space, so it would be surprising there as well to have different behavior with and without it.
One requirement when defining the syntax of a language is that elements of the language can be separated. According to the C++ syntax rules, a space separates things. But also according to the C++ syntax rules, parentheses also separate things.
When C++ is compiled, the first step is the parsing. And one of the first steps of the parsing is separating all the elements of the language. Often this step is called tokenizing or lexing. But this is just the technical background. The user does not have to know this. He or she only has to know that things in C++ must be clearly separted from each others, so that there is a sequence "*", "Oit", "(", "bar", ")", "=", "7", ";".
As explained, the rule that the parenthesis always separates is established on a very low level of the compiler. The compiler determines even before knowing what the purpose of the parenthesis is, that a parenthesis separates things. And therefore an extra space would be redundant.
When you ever use parser generators, you will see that most of them just ignore spaces. That means, when the lexer has produced the list of tokens, the spaces do not exist any more. See above in the list. There are no spaces any more. So you have no chance to specify something that explicitly requires a space.

GCC's implementation of angle-brackets includes. Why does it have to be as described below?

This document in its section 2.6 Computed Includes has the following paragraph:
If the line expands to a token stream beginning with a < token and
including a >  token, then the tokens between the <  and the
first > are combined to form the filename to be included. Any
whitespace between tokens is reduced to a single space; then any space
after the initial < is retained, but a trailing space before the
closing > is ignored. CPP searches for the file according to the rules
for angle-bracket includes.
I know this is implementation defined, but why does it have to be this way for GCC? I'm referring specifically to the highlighted sentence above.
EDIT
I have just noticed that the third paragraph before the one quoted above says the following:
You must be careful when you define the macro. #define saves tokens,
not text. The preprocessor has no way of knowing that the macro will
be used as the argument of #include, so it generates ordinary
tokens, not a header name. This is unlikely to cause problems if you
use double-quote includes, which are close enough to string constants.
If you use angle brackets, however, you may have trouble.
Does anyone know what kind of trouble is being pointed out here?
I guess the implementor chose the simplest way when they implemented this functionality, without giving it much thought.
It seems that the initial implementation landed in 2000-07-03 (two decades ago!). The relevant part looks like (source):
for (;;)
{
t = cpp_get_token (pfile);
if (t->type == CPP_GREATER || t->type == CPP_EOF)
break;
CPP_RESERVE (pfile, TOKEN_LEN (t));
if (t->flags & PREV_WHITE)
CPP_PUTC_Q (pfile, ' ');
pfile->limit = spell_token (pfile, t, pfile->limit);
}
Notably, it breaks out when it sees the CPP_GREATER token (i.e. >), before reserving memory for the token. This makes sense, since there's no need to allocate memory when the token will not be written to the buffer.
Then, only after memory is reserved, the preprocessor checks whether the token has preceding whitespace (t->flags & PREV_WHITE) and when it does, writes a whitespace character to the buffer.
As a result, in < foo / bar >, only the whitespaces before foo (that is, after the initial <), /, and bar are kept.

What does the parenthesis operator does on its own in C++

During writing some code I had a typo that lead to unexpected compilation results and caused me to play and test what would be acceptable by the compiler (VS 2010).
I wrote an expression consisting of only the parenthesis operator with a number in it (empty parenthesis give a compilation error):
(444);
When I ran the code in debug mode, it seems that the line is simply skipped by the program. What is the meaning of the parenthesis operator when it appears by itself?
If I can answer informally,
(444);
is a statement. It can be written wherever the language allows you to write a statement, such as in a function. It consists of an expression 444, enclosed in parentheses (which is also an expression) followed by the statement terminator ;.
Of course, any sane compiler operating in accordance with the as-if rule, will remove it during compilation.
One place where at least one statement is required is in a switch block (even if program control never reaches that point):
switch (1){
case 0:
; // Removing this statement causes a compilation error
}
(444); is a statement consisting of a parenthesized expression (444) and a statement terminator ;
(444) consists of parentheses () and a prvalue expression 444
A parenthesized expression (E) is a primary expression whose type, value, and value category are identical to those of E. The parenthesized expression can be used in exactly the same contexts as those where E can be used, and with the same meaning, except as otherwise indicated.
So in this particular case, parentheses have no additional significance,
so (444); becomes 444; which is then optimized out by the compiler.

What are the definitions for valid and invalid pp-tokens?

I want to extensively use the ##-operator and enum magic to handle a huge bunch of similar access-operations, error handling and data flow.
If applying the ## and # preprocessor operators results in an invalid pp-token, the behavior is undefined in C.
The order of preprocessor operation in general is not defined (*) in C90 (see The token pasting operator). Now in some cases it happens (said so in different sources, including the MISRA Committee, and the referenced page) that the order of multiple ##/#-Operators influences the occurrence of undefined behavior. But I have a really hard time to understand the examples of these sources and pin down the common rule.
So my questions are:
What are the rules for valid pp-tokens?
Are there difference between the different C and C++ Standards?
My current problem: Is the following legal with all 2 operator orders?(**)
#define test(A) test_## A ## _THING
int test(0001) = 2;
Comments:
(*) I don't use "is undefined" because this has nothing to do with undefined behavior yet IMHO, but rather unspecified behavior. More than one ## or # operator being applied do not in general render the program to be erroneous. There is obviously an order — we just can't predict which — so the order is unspecified.
(**) This is no actual application for the numbering. But the pattern is equivalent.
What are the rules for valid pp-tokens?
These are spelled out in the respective standards; C11 §6.4 and C++11 §2.4. In both cases, they correspond to the production preprocessing-token. Aside from pp-number, they shouldn't be too surprising. The remaining possibilities are identifiers (including keywords), "punctuators" (in C++, preprocessing-op-or-punc), string and character literals, and any single non-whitespace character which doesn't match any other production.
With a few exceptions, any sequence of characters can be decomposed into a sequence of valid preprocessing-tokens. (One exception is unmatched quotes and apostrophes: a single quote or apostrophe is not a valid preprocessing-token, so a text including an unterminated string or character literal cannot be tokenised.)
In the context of the ## operator, though, the result of the concatenation must be a single preprocessing-token. So an invalid concatenation is a concatenation whose result is a sequence of characters which comprise multiple preprocessing-tokens.
Are there differences between C and C++?
Yes, there are slight differences:
C++ has user defined string and character literals, and allows "raw" string literals. These literals will be tokenized differently in C, so they might be multiple preprocessing-tokens or (in the case of raw string literals) even invalid preprocessing-tokens.
C++ includes the symbols ::, .* and ->*, all of which would be tokenised as two punctuator tokens in C. Also, in C++, some things which look like keywords (eg. new, delete) are part of preprocessing-op-or-punc (although these symbols are valid preprocessing-tokens in both languages.)
C allows hexadecimal floating point literals (eg. 1.1p-3), which are not valid preprocessing-tokens in C++.
C++ allows apostrophes to be used in integer literals as separators (1'000'000'000). In C, this would probably result in unmatched apostrophes.
There are minor differences in the handling of universal character names (eg. \u0234).
In C++, <:: will be tokenised as <, :: unless it is followed by : or >. (<::: and <::> are tokenised normally, using the longest-match rule.) In C, there are no exceptions to the longest-match rule; <:: is always tokenised using the longest-match rule, so the first token will always be <:.
Is it legal to concatenate test_, 0001, and _THING, even though concatenation order is unspecified?
Yes, that is legal in both languages.
test_ ## 0001 => test_0001 (identifier)
test_0001 ## _THING => test_0001_THING (identifier)
0001 ## _THING => 0001_THING (pp-number)
test_ ## 0001_THING => test_0001_THING (identifier)
What are examples of invalid token concatenation?
Suppose we have
#define concat3(a, b, c) a ## b ## c
Now, the following are invalid regardless of concatenation order:
concat3(., ., .)
.. is not a token even though ... is. But the concatenation must proceed in some order, and .. would be a necessary intermediate value; since that is not a single token, the concatenation would be invalid.
concat3(27,e,-7)
Here, -7 is two tokens, so it cannot be concatenated.
And here is a case in which concatenation order matters:
concat3(27e, -, 7)
If this is concatenated left-to-right, it will result in 27e- ## 7, which is the concatenation of two pp-numbers. But - cannot be concatenated with 7, because (as above) -7 is not a single token.
What exactly is a pp-number?
In general terms, a pp-number is a superset of tokens which might be converted into (single) numeric literals or might be invalid. The definition is intentionally broad, partly in order to allow (some) token concatenations, and partly to insulate the preprocessor from the periodic changes in numeric formats. The precise definition can be found in the respective standards, but informally a token is a pp-number if:
It starts with a decimal digit or a period (.) followed by a decimal digit.
The rest of the token is letters, numbers and periods, possibly including sign characters (+, -) if preceded by an exponent symbol. The exponent symbol can be E or e in both languages; and also P and p in C since C99.
In C++, a pp-number can also include (but not start with) an apostrophe followed by a letter or digit.
Note: Above, letter includes underscore. Also, universal character names can be used (except following an apostrophe in C++).
Once preprocessing is terminated, all pp-numbers will be converted to numeric literals if possible. If the conversion is not possible (because the token doesn't correspond to the syntax for any numeric literal), the program is invalid.
#define test(A) test_## A ## _THING
int test(0001) = 2;
This is legal with both LTR and RTL evaluation, since both test_0001 and 0001_THING are valid preprocessor-tokens. The former is an identifier, while the latter is a pp-number; pp-numbers are not checked for suffix correctness until a later stage of compilation; think e.g. 0001u an unsigned octal literal.
A few examples to show that the order of evaluation does matter:
#define paste2(a,b) a##b
#define paste(a,b) paste2(a,b)
#if defined(LTR)
#define paste3(a,b,c) paste(paste(a,b),c)
#elif defined(RTL)
#define paste3(a,b,c) paste(a,paste(b,c))
#else
#define paste3(a,b,c) a##b##c
#endif
double a = paste3(1,.,e3), b = paste3(1e,+,3); // OK LTR, invalid RTL
#define stringify2(x) #x
#define stringify(x) stringify2(x)
#define stringify_paste3(a,b,c) stringify(paste3(a,b,c))
char s[] = stringify_paste3(%:,%,:); // invalid LTR, OK RTL
If your compiler uses a consistent order of evaluation (either LTR or RTL) and presents an error on generation of an invalid pp-token, then precisely one of these lines will generate an error. Naturally, a lax compiler could well allow both, while a strict compiler might allow neither.
The second example is rather contrived; because of the way the grammar is constructed it's very difficult to find a pp-token that is valid when build RTL but not when built LTR.
There are no significant differences between C and C++ in this regard; the two standards have identical language (up to section headers) describing the process of macro replacement. The only way the language could influence the process would be in the valid preprocessing-tokens: C++ (especially recently) has more forms of valid preprocessing-tokens, such as user-defined string literals.

Is D's grammar really context-free?

I've posted this on the D newsgroup some months ago, but for some reason, the answer never really convinced me, so I thought I'd ask it here.
The grammar of D is apparently context-free.
The grammar of C++, however, isn't (even without macros). (Please read this carefully!)
Now granted, I know nothing (officially) about compilers, lexers, and parsers. All I know is from what I've learned on the web.
And here is what (I believe) I have understood regarding context, in not-so-technical lingo:
The grammar of a language is context-free if and only if you can always understand the meaning (though not necessarily the exact behavior) of a given piece of its code without needing to "look" anywhere else.
Or, in even less rigor:
The grammar cannot be context-free if I need I can't tell the type of an expression just by looking at it.
So, for example, C++ fails the context-free test because the meaning of confusing<sizeof(x)>::q < 3 > (2) depends on the value of q.
So far, so good.
Now my question is: Can the same thing be said of D?
In D, hashtables can be created through a Value[Key] declaration, for example
int[string] peoplesAges; // Maps names to ages
Static arrays can be defined in a similar syntax:
int[3] ages; // Array of 3 elements
And templates can be used to make them confusing:
template Test1(T...)
{
alias int[T[0]] Test;
}
template Test2(U...)
{
alias int[U] Test2; // LGTM
}
Test1!(5) foo;
Test1!(int) bar;
Test2!(int) baz; // Guess what? It's invalid code.
This means that I cannot tell the meaning of T[0] or U just by looking at it (i.e. it could be a number, it could be a data type, or it could be a tuple of God-knows-what). I can't even tell if the expression is grammatically valid (since int[U] certainly isn't -- you can't have a hashtable with tuples as keys or values).
Any parsing tree that I attempt to make for Test would fail to make any sense (since it would need to know whether the node contains a data type versus a literal or an identifier) unless it delays the result until the value of T is known (making it context-dependent).
Given this, is D actually context-free, or am I misunderstanding the concept?
Why/why not?
Update:
I just thought I'd comment: It's really interesting to see the answers, since:
Some answers claim that C++ and D can't be context-free
Some answers claim that C++ and D are both context-free
Some answers support the claim that C++ is context-sensitive while D isn't
No one has yet claimed that C++ is context-free while D is context-sensitive :-)
I can't tell if I'm learning or getting more confused, but either way, I'm kind of glad I asked this... thanks for taking the time to answer, everyone!
Being context free is first a property of generative grammars. It means that what a non-terminal can generate will not depend on the context in which the non-terminal appears (in non context-free generative grammar, the very notion of "string generated by a given non-terminal" is in general difficult to define). This doesn't prevent the same string of symbols to be generated by two non-terminals (so for the same strings of symbols to appear in two different contexts with a different meaning) and has nothing to do with type checking.
It is common to extend the context-free definition from grammars to language by stating that a language is context-free if there is at least one context free grammar describing it.
In practice, no programming language is context-free because things like "a variable must be declared before it is used" can't be checked by a context-free grammar (they can be checked by some other kinds of grammars). This isn't bad, in practice the rules to be checked are divided in two: those you want to check with the grammar and those you check in a semantic pass (and this division also allows for better error reporting and recovery, so you sometimes want to accept more in the grammar than what would be possible in order to give your users better diagnostics).
What people mean by stating that C++ isn't context-free is that doing this division isn't possible in a convenient way (with convenient including as criteria "follows nearly the official language description" and "my parser generator tool support that kind of division"; allowing the grammar to be ambiguous and the ambiguity to be resolved by the semantic check is an relatively easy way to do the cut for C++ and follow quite will the C++ standard, but it is inconvenient when you are relying on tools which don't allow ambiguous grammars, when you have such tools, it is convenient).
I don't know enough about D to know if there is or not a convenient cut of the language rules in a context-free grammar with semantic checks, but what you show is far from proving the case there isn't.
The property of being context free is a very formal concept; you can find a definition here. Note that it applies to grammars: a language is said to be context free if there is at least one context free grammar that recognizes it. Note that there may be other grammars, possibly non context free, that recognize the same language.
Basically what it means is that the definition of a language element cannot change according to which elements surround it. By language elements I mean concepts like expressions and identifiers and not specific instances of these concepts inside programs, like a + b or count.
Let's try and build a concrete example. Consider this simple COBOL statement:
01 my-field PICTURE 9.9 VALUE 9.9.
Here I'm defining a field, i.e. a variable, which is dimensioned to hold one integral digit, the decimal point, and one decimal digit, with initial value 9.9 . A very incomplete grammar for this could be:
field-declaration ::= level-number identifier 'PICTURE' expression 'VALUE' expression '.'
expression ::= digit+ ( '.' digit+ )
Unfortunately the valid expressions that can follow PICTURE are not the same valid expressions that can follow VALUE. I could rewrite the second production in my grammar as follows:
'PICTURE' expression ::= digit+ ( '.' digit+ ) | 'A'+ | 'X'+
'VALUE' expression ::= digit+ ( '.' digit+ )
This would make my grammar context-sensitive, because expression would be a different thing according to whether it was found after 'PICTURE' or after 'VALUE'. However, as it has been pointed out, this doesn't say anything about the underlying language. A better alternative would be:
field-declaration ::= level-number identifier 'PICTURE' format 'VALUE' expression '.'
format ::= digit+ ( '.' digit+ ) | 'A'+ | 'X'+
expression ::= digit+ ( '.' digit+ )
which is context-free.
As you can see this is very different from your understanding. Consider:
a = b + c;
There is very little you can say about this statement without looking up the declarations of a,b and c, in any of the languages for which this is a valid statement, however this by itself doesn't imply that any of those languages is not context free. Probably what is confusing you is the fact that context freedom is different from ambiguity. This a simplified version of your C++ example:
a < b > (c)
This is ambiguous in that by looking at it alone you cannot tell whether this is a function template call or a boolean expression. The previous example on the other hand is not ambiguous; From the point of view of grammars it can only be interpreted as:
identifier assignment identifier binary-operator identifier semi-colon
In some cases you can resolve ambiguities by introducing context sensitivity at the grammar level. I don't think this is the case with the ambiguous example above: in this case you cannot eliminate the ambiguity without knowing whether a is a template or not. Note that when such information is not available, for instance when it depends on a specific template specialization, the language provides ways to resolve ambiguities: that is why you sometimes have to use typename to refer to certain types within templates or to use template when you call member function templates.
There are already a lot of good answers, but since you are uninformed about grammars, parsers and compilers etc, let me demonstrate this by an example.
First, the concept of grammars are quite intuitive. Imagine a set of rules:
S -> a T
T -> b G t
T -> Y d
b G -> a Y b
Y -> c
Y -> lambda (nothing)
And imagine you start with S. The capital letters are non-terminals and the small letters are terminals. This means that if you get a sentence of all terminals, you can say the grammar generated that sentence as a "word" in the language. Imagine such substitutions with the above grammar (The phrase between *phrase* is the one being replaced):
*S* -> a *T* -> a *b G* t -> a a *Y* b t -> a a b t
So, I could create aabt with this grammar.
Ok, back to main line.
Let us assume a simple language. You have numbers, two types (int and string) and variables. You can do multiplication on integers and addition on strings but not the other way around.
First thing you need, is a lexer. That is usually a regular grammar (or equal to it, a DFA, or equally a regular expression) that matches the program tokens. It is common to express them in regular expressions. In our example:
(I'm making these syntaxes up)
number: [1-9][0-9]* // One digit from 1 to 9, followed by any number
// of digits from 0-9
variable: [a-zA-Z_][a-zA-Z_0-9]* // You get the idea. First a-z or A-Z or _
// then as many a-z or A-Z or _ or 0-9
// this is similar to C
int: 'i' 'n' 't'
string: 's' 't' 'r' 'i' 'n' 'g'
equal: '='
plus: '+'
multiply: '*'
whitespace: (' ' or '\n' or '\t' or '\r')* // to ignore this type of token
So, now you got a regular grammar, tokenizing your input, but it understands nothing of the structure.
Then you need a parser. The parser, is usually a context free grammar. A context free grammar means, in the grammar you only have single nonterminals on the left side of grammar rules. In the example in the beginning of this answer, the rule
b G -> a Y b
makes the grammar context-sensitive because on the left you have b G and not just G. What does this mean?
Well, when you write a grammar, each of the nonterminals have a meaning. Let's write a context-free grammar for our example (| means or. As if writing many rules in the same line):
program -> statement program | lambda
statement -> declaration | executable
declaration -> int variable | string variable
executable -> variable equal expression
expression -> integer_type | string_type
integer_type -> variable multiply variable |
variable multiply number |
number multiply variable |
number multiply number
string_type -> variable plus variable
Now this grammar can accept this code:
x = 1*y
int x
string y
z = x+y
Grammatically, this code is correct. So, let's get back to what context-free means. As you can see in the example above, when you expand executable, you generate one statement of the form variable = operand operator operand without any consideration which part of code you are at. Whether the very beginning or middle, whether the variables are defined or not, or whether the types match, you don't know and you don't care.
Next, you need semantics. This is were context-sensitive grammars come into play. First, let me tell you that in reality, no one actually writes a context sensitive grammar (because parsing it is too difficult), but rather bit pieces of code that the parser calls when parsing the input (called action routines. Although this is not the only way). Formally, however, you can define all you need. For example, to make sure you define a variable before using it, instead of this
executable -> variable equal expression
you have to have something like:
declaration some_code executable -> declaration some_code variable equal expression
more complex though, to make sure the variable in declaration matches the one being calculated.
Anyway, I just wanted to give you the idea. So, all these things are context-sensitive:
Type checking
Number of arguments to function
default value to function
if member exists in obj in code: obj.member
Almost anything that's not like: missing ; or }
I hope you got an idea what are the differences (If you didn't, I'd be more than happy to explain).
So in summary:
Lexer uses a regular grammar to tokenize input
Parser uses a context-free grammar to make sure the program is in correct structure
Semantic analyzer uses a context-sensitive grammar to do type-checking, parameter matching etc etc
It is not necessarily always like that though. This just shows you how each level needs to get more powerful to be able to do more stuff. However, each of the mentioned compiler levels could in fact be more powerful.
For example, one language that I don't remember, used array subscription and function call both with parentheses and therefore it required the parser to go look up the type (context-sensitive related stuff) of the variable and determine which rule (function_call or array_substitution) to take.
If you design a language with lexer that has regular expressions that overlap, then you would need to also look up the context to determine which type of token you are matching.
To get to your question! With the example you mentioned, it is clear that the c++ grammar is not context-free. The language D, I have absolutely no idea, but you should be able to reason about it now. Think of it this way: In a context free grammar, a nonterminal can expand without taking into consideration anything, BUT the structure of the language. Similar to what you said, it expands, without "looking" anywhere else.
A familiar example would be natural languages. For example in English, you say:
sentence -> subject verb object clause
clause -> .... | lambda
Well, sentence and clause are nonterminals here. With this grammar you can create these sentences:
I go there because I want to
or
I jump you that I is air
As you can see, the second one has the correct structure, but is meaningless. As long as a context free grammar is concerned, the meaning doesn't matter. It just expands verb to whatever verb without "looking" at the rest of the sentence.
So if you think D has to at some point check how something was defined elsewhere, just to say the program is structurally correct, then its grammar is not context-free. If you isolate any part of the code and it still can say that it is structurally correct, then it is context-free.
There is a construct in D's lexer:
string ::= q" Delim1 Chars newline Delim2 "
where Delim1 and Delim2 are matching identifiers, and Chars does not contain newline Delim2.
This construct is context sensitive, therefore D's lexer grammar is context sensitive.
It's been a few years since I've worked with D's grammar much, so I can't remember all the trouble spots off the top of my head, or even if any of them make D's parser grammar context sensitive, but I believe they do not. From recall, I would say D's grammar is context free, not LL(k) for any k, and it has an obnoxious amount of ambiguity.
The grammar cannot be context-free if I need I can't tell the type of
an expression just by looking at it.
No, that's flat out wrong. The grammar cannot be context-free if you can't tell if it is an expression just by looking at it and the parser's current state (am I in a function, in a namespace, etc).
The type of an expression, however, is a semantic meaning, not syntactic, and the parser and the grammar do not give a penny about types or semantic validity or whether or not you can have tuples as values or keys in hashmaps, or if you defined that identifier before using it.
The grammar doesn't care what it means, or if that makes sense. It only cares about what it is.
To answer the question of if a programming language is context free you must first decide where to draw the line between syntax and semantics. As an extreme example, it is illegal in C for a program to use the value of some kinds of integers after they have been allowed to overflow. Clearly this can't be checked at compile time, let alone parse time:
void Fn() {
int i = INT_MAX;
FnThatMightNotReturn(); // halting problem?
i++;
if(Test(i)) printf("Weeee!\n");
}
As a less extreme example that others have pointed out, deceleration before use rules can't be enforced in a context free syntax so if you wish to keep your syntax pass context free, then that must be deferred to the next pass.
As a practical definition, I would start with the question of: Can you correctly and unambiguously determine the parse tree of all correct programs using a context free grammar and, for all incorrect programs (that the language requires be rejected), either reject them as syntactically invalid or produce a parse tree that the later passes can identify as invalid and reject?
Given that the most correct spec for the D syntax is a parser (IIRC an LL parser) I strongly suspect that it is in fact context free by the definition I suggested.
Note: the above says nothing about what grammar the language documentation or a given parser uses, only if a context free grammar exists. Also, the only full documentation on the D language is the source code of the compiler DMD.
These answers are making my head hurt.
First of all, the complications with low level languages and figuring out whether they are context-free or not, is that the language you write in is often processed in many steps.
In C++ (order may be off, but that shouldn't invalidate my point):
it has to process macros and other preprocessor stuffs
it has to interpret templates
it finally interprets your code.
Because the first step can change the context of the second step and the second step can change the context of the third step, the language YOU write in (including all of these steps) is context sensitive.
The reason people will try and defend a language (stating it is context-free) is, because the only exceptions that adds context are the traceable preprocessor statements and template calls. You only have to follow two restricted exceptions to the rules to pretend the language is context-free.
Most languages are context-sensitive overall, but most languages only have these minor exceptions to being context-free.