boost::spirit::qi difference parser behavior - c++

I'm trying to understand the behavior of Qi's Difference Parsers.
With something like this:
ruleA =
ruleAa
| ruleAb
| ruleAc
;
ruleB =
ruleA - ruleAc
;
I was imagining that the parser would match ruleB iff the input matches ruleAa or ruleAb. In other words, ruleB would subtract one alternative (ruleAc) from ruleA. Is this incorrect?
In my code, I'm doing something like the above and it compiles but does not seem to behave as I expected. My actual use-case involves other factors that are hard to unwind here.
In essence, what I'm trying to do is this:
I have a rule like ruleA, which contains a set of alternatives.
This rule gets used in a few different places in my grammar. But in one particular use, I need to avoid evoking just one of the alternatives. It seems like the difference parser is intended for that purpose, but maybe I'm misunderstanding?
Thanks in advance!

I was imagining that the parser would match ruleB iff the input matches ruleAa or ruleAb. In other words, ruleB would subtract one alternative (ruleAc) from ruleA. Is this incorrect?
(A|B|C) - C is only equivalent to (A|B) iff C never matches anything that (A|B) would match.
A a simple example, assume:
A = int_;
B = +space;
C = char_;
Because C always matches where A or B would match, it is clear that doing (A|B) - C always results in no match. Same for (A|B|C)-C.
In short any A or B that also matches C no longer matches with something - C.
This rule gets used in a few different places in my grammar. But in one particular use, I need to avoid evoking just one of the alternatives. It seems like the difference parser is intended for that purpose
As you've seen above, this is not the case. The difference parser doesn't /modify/ the left-hand-side parser at all. (It just discards some matches independent of the left-hand-side).
The best thing I can think of is to /just use/ (A|B) in one place, and (A|B|C) in the other. It's really that simple: Say What You Mean.

Related

Complicated Search and Replace using RegEx

I'm trying to convert a bunch of custom "recipes" from an old proprietary format to something that is ultimately compatible with C#. And I think that the easiest way to do this would be to use regular expressions. But I'm having trouble figuring out the expression. The piece that I need to convert with this RegEx is the IF statements. Here are a few examples of the original recipes...
IF(A = B,C,D)
IF(AA = BB,IF(E=F,G,H),DD)
IF(S1<>R1,ROUND(ROUND(S2/S1,R2)*S3,R3),R4)
The first one is straightforward... If A = B then C else D.
The second one is similar, except that the IF statements are nested.
And the third one includes additional ROIND function calls in the results.
I've stumbled across regex101.com and have managed to put together the following pattern which is getting close. It works for the first example, but not for the other two: (.*?)IF[^\S\r\n]*\((.*?),(.*?),(.*?)\)
Ultimately, what I want to do is use a regular expression to turn the three examples above into:
if (A == B) { C } else { D }
if (AA == BB) { if (E == F) { G } else { H } } else { DD }
if (S1 <> R1) { ROUND(ROUND(S2/S1,R2)*S3,R3) } else { R4 }
Note that the whitespace in the results is not particularly important. I just formatted it for readability. Also, the "ROUND" functions will be replaced separately with C# Math.Round() calls. No need to worry about those, here. (All I should need to do to them is add, "Math." and fix the capitalization.)
I'll keep plugging away at this, but if anyone out there has the RegEx experience to figure this out, I would appreciate it.
EDIT: With some additional effort, I've expounded upon my first expression and got it into the following... (.*?)IF[^\S\r\n]*\((.*?),(([^\(]*)|(.*?\(.*?\))),(([^\(]*)|(.*?\(.*?\)))\) And with the following replace expression... $1if($2) {$3} else {$6} I'm almost there. It's just the nested IF statements that are left. (And although I'd prefer to do this with a single pass, if a recursive expression is not going to work, I could rig something up to run the results of the expression through it a second time to deal with the nested IF statements. It's not ideal, but if it's the best I have, I could live with it.
The problem with using regex for parsing arbitrary recursive grammar, is that regex are not particularly suitable for recursion. There is a limited support for recursion in some regex implementation, but it's tricky to make it work for anything slightly more complicated than simple balanced parentheses.
That being said, for your particular case, although at the first sight it appears as recursive grammar, it might be possible to cheat.
In
IF(S1<>R1,ROUND(ROUND(S2/S1,R2)*S3,R3),R4)
if it is guaranteed that both S1<>R1 and R4 don't contain comma symbol, then you can use the following regex:
IF\(([^,]*),(.*),([^,]+)\)
Try it here: https://regexr.com/67r56
How it works: the first matching group greedily matches everything from the beginning of the string, until it encounters the first comma, then the second group greedily matches everything to the end, and starts backtracking, until the very last comma of the string is "released" from the second group. After that the third group matches the "released tail" of the string.
However, as I mentioned in the comments, if S1, R1 or R4 are expressions themself, this regex trick won't work, and you'd need to use a proper recursive parser. Fortunately, there are plenty of parser/combinator libraries for user defined grammars (or you might even find one that already works for your grammar). When your expression is parsed into AST, it's fairly easy to transform it into the desired form.
Alternatively, you can look into writing your own simple parser. It should be fairly straightforward, as you only care about nested parentheses and commas.

Does order not matter in regular expressions?

I was looking at the question posed in this stackoverflow link (Regular expression for odd number of a's) for which it is asked to find the regular expression for strings that have odd number of a over Σ = {a,b}.
The answer given by the top comment which works is b*(ab*ab*)*ab*.
I am quite confused - a was placed just before the last b*, does this ordering actually matter? Why can't it be b*a(ab*ab*)*b* instead (where a is placed after the first b*), or any other permutation of it?
Another thing I am confused about is why it is (ab*ab*)* and not (b*ab*ab*)*. Isn't b*ab*ab* the more accurate definition of 'having exactly 2 a'?
Why can't it be b*a(ab*ab*)*b* instead?
b*a(ab*ab*)*b* does not work because it would require the string to have two consecutive as before the first non-leading b, wouldn't it? For example, abaa would not be matched by your proposed regex when it should. Use the regex debugger on a site like Regex101 to see this for yourself.
On the other hand, moving the whole ab* part to the start (b*ab*(ab*ab*)*) works as well.
why it is (ab*ab*)* and not (b*ab*ab*)*?
(b*ab*ab*)* does work, but the first b* is quite redundant because whatever b there is left, will be matched by the last b* in the group. There is also a b* before the group, which causes the b* to not be able to match anything, hence it is redundant.
There are infinitely many equivalent regular expressions which generate a given (infinite) regular language. A particular expression might be preferable in some cases and by certain authors: one might prefer a minimal expression, or one which shows structure or symmetry, or even one that simplifies the reasoning in a proof by induction.
Your particular suggestion to move the a is insufficient since, as noted above, that ensures the substring aa will appear in any string with more than one a. However, abab could be changed to baba to make that placement work. Choosing babab* would work with either placement. You could even go for an expression like bab + bababab + (babab*)a(babab*) which might be nice to work with depending on your application. Something like b*(abab)ab* has the advantage of being minimal (if it's not strictly minimal, it must be pretty close).

Regular expression in C++ for mathematical expressions

I have this trouble: I must verify the correctness of many mathematical expressions especially check for consecutive operators + - * /.
For example:
6+(69-9)+3
is ok while
6++8-(52--*3)
no.
I am not using the library <regex> since it is only compatible with C++11.
Is there a alternative method to solve this problem? Thanks.
You can use a regular expression to verify everything about a mathematical expression except the check that parentheses are balanced. That is, the regular expression will only ensure that open and close parentheses appear at the point in the expression they should appear, but not their correct relationship with other parentheses.
So you could check both that the expression matches a regex and that the parentheses are balanced. Checking for balanced parentheses is really simple if there is only one type of parenthesis:
bool check_balanced(const char* expr, char open, char close) {
int parens = 0;
for (const char* p = expr; *p; ++p) {
if (*p == open) ++parens;
else if (*p == close && parens-- == 0) return false;
}
return parens == 0;
}
To get the regular expression, note that mathematical expressions without function calls can be summarized as:
BEFORE* VALUE AFTER* (BETWEEN BEFORE* VALUE AFTER*)*
where:
BEFORE is sub-regex which matches an open parenthesis or a prefix unary operator (if you have prefix unary operators; the question is not clear).
AFTER is a sub-regex which matches a close parenthesis or, in the case that you have them, a postfix unary operator.
BETWEEN is a sub-regex which matches a binary operator.
VALUE is a sub-regex which matches a value.
For example, for ordinary four-operator arithmetic on integers you would have:
BEFORE: [-+(]
AFTER: [)]
BETWEEN: [-+*/]
VALUE: [[:digit:]]+
and putting all that together you might end up with the regex:
^[-+(]*[[:digit:]]+[)]*([-+*/][-+(]*[[:digit:]]+[)]*)*$
If you have a Posix C library, you will have the <regex.h> header, which gives you regcomp and regexec. There's sample code at the bottom of the referenced page in the Posix standard, so I won't bother repeating it here. Make sure you supply REG_EXTENDED in the last argument to regcomp; REG_EXTENDED|REG_NOSUB, as in the example code, is probably even better since you don't need captures and not asking for them will speed things up.
You can loop over each charin your expression.
If you encounter a + you can check whether it is follow by another +, /, *...
Additionally you can group operators together to prevent code duplication.
int i = 0
while(!EOF) {
switch(expression[i]) {
case '+':
case '*': //Do your syntax checks here
}
i++;
}
Well, in general case, you can't solve this with regex. Arithmethic expressions "language" can't be described with regular grammar. It's context-free grammar. So if what you want is to check correctness of an arbitrary mathemathical expression then you'll have to write a parser.
However, if you only need to make sure that your string doesn't have consecutive +-*/ operators then regex is enough. You can write something like this [-+*/]{2,}. It will match substrings with 2 or more consecutive symbols from +-*/ set.
Or something like this ([-+*/]\s*){2,} if you also want to handle situations with spaces like 5+ - * 123
Well, you will have to define some rules if possible. It's not possible to completely parse mathamatical language with Regex, but given some lenience it may work.
The problem is that often the way we write math can be interpreted as an error, but it's really not. For instance:
5--3 can be 5-(-3)
So in this case, you have two choices:
Ensure that the input is parenthesized well enough that no two operators meet
If you find something like --, treat it as a special case and investigate it further
If the formulas are in fact in your favor (have well defined parenthesis), then you can just check for repeats. For instance:
--
+-
+*
-+
etc.
If you have a match, it means you have a poorly formatted equation and you can throw it out (or whatever you want to do).
You can check for this, using the following regex. You can add more constraints to the [..][..]. I'm giving you the basics here:
[+\-\*\\/][+\-\*\\/]
which will work for the following examples (and more):
6++8-(52--*3)
6+\8-(52--*3)
6+/8-(52--*3)
An alternative, probably a better one, is just write a parser. it will step by step process the equation to check it's validity. A parser will, if well written, 100% accurate. A Regex approach leaves you to a lot of constraints.
There is no real way to do this with a regex because mathematical expressions inherently aren't regular. Heck, even balancing parens isn't regular. Typically this will be done with a parser.
A basic approach to writing a recursive-descent parser (IMO the most basic parser to write) is:
Write a grammar for a mathematical expression. (These can be found online)
Tokenize the input into lexemes. (This will be done with a regex, typically).
Match the expressions based on the next lexeme you see.
Recurse based on your grammar
A quick Google search can provide many example recursive-descent parsers written in C++.

removing redundancy from regex with multiple possible delimiters

I have a regex in which the same match criteria can apply to multiple delimiters. [], (), and <> are all valid. For example purposes it looks like this:
\[.\]|\(.\)|<.>
Is there some way to remove the redundancy from the above regex? The match criteria inside the delimiters is always the same, but the delimiters themselves may be different.
I'm guessing you're asking because
[[(<].[])>]
isn't exact enough, for obvious reasons.
It's always dangerous to answer, "No, there is no way," because it's hard to be sure one has checked every possible way. One must often come up with a solid proof to answer in such cases.
I'm not sure this is a strong-enough proof, or even a "proof" at all, but consider this (pseudo-)information-theory perspective:
The PCRE engine itself has no knowledge of any relation between the pairs of characters, [], (), and <>. Thus, the expression itself must contain that information, i.e. require at least the six characters []()<> to be present.
Not only that, but for the same reason, the expression itself must define at least two pairings (leaving the third to be implied). I'm not sure how to prove that two alternation operators (|) is the best you can do, but I mean, even if there were a more compact way, you're going to save one character at most, since at least one bit is required to say, "Pairings exist!"
The escaping of meta-characters can only be compacted by the fact that []() can appear within character classes without being escaped, but firstly, that isn't really a "removal of redundancy" as much as it is "a lucky circumstance in syntax", and secondly, you still have to add two characters for the definition of said character class: [].
Therefore, it is my belief that even from a theoretical perspective, if my presumptions about what a regex engine cannot know are true, then one can save at most three characters from the regex you've already provided: \[.\]|\(.\)|<.>.
I eagerly look forward to being corrected by the regex gurus!
If you really are using the PCRE library (via PHP, for example) you can use a DEFINE group to create a subroutine, like so:
'~(?(DEFINE)(?<content>\w+))(?:<(?&content)>|\[(?&content)\]|\((?&content)\))~'
...or more readably:
(?(DEFINE)(?<content>\w+))
(?:
<(?&content)>
|
\[(?&content)\]
|
\((?&content)\)
)
Here's a demo in PHP. It should work in Perl, too.

Is D's grammar really context-free?

I've posted this on the D newsgroup some months ago, but for some reason, the answer never really convinced me, so I thought I'd ask it here.
The grammar of D is apparently context-free.
The grammar of C++, however, isn't (even without macros). (Please read this carefully!)
Now granted, I know nothing (officially) about compilers, lexers, and parsers. All I know is from what I've learned on the web.
And here is what (I believe) I have understood regarding context, in not-so-technical lingo:
The grammar of a language is context-free if and only if you can always understand the meaning (though not necessarily the exact behavior) of a given piece of its code without needing to "look" anywhere else.
Or, in even less rigor:
The grammar cannot be context-free if I need I can't tell the type of an expression just by looking at it.
So, for example, C++ fails the context-free test because the meaning of confusing<sizeof(x)>::q < 3 > (2) depends on the value of q.
So far, so good.
Now my question is: Can the same thing be said of D?
In D, hashtables can be created through a Value[Key] declaration, for example
int[string] peoplesAges; // Maps names to ages
Static arrays can be defined in a similar syntax:
int[3] ages; // Array of 3 elements
And templates can be used to make them confusing:
template Test1(T...)
{
alias int[T[0]] Test;
}
template Test2(U...)
{
alias int[U] Test2; // LGTM
}
Test1!(5) foo;
Test1!(int) bar;
Test2!(int) baz; // Guess what? It's invalid code.
This means that I cannot tell the meaning of T[0] or U just by looking at it (i.e. it could be a number, it could be a data type, or it could be a tuple of God-knows-what). I can't even tell if the expression is grammatically valid (since int[U] certainly isn't -- you can't have a hashtable with tuples as keys or values).
Any parsing tree that I attempt to make for Test would fail to make any sense (since it would need to know whether the node contains a data type versus a literal or an identifier) unless it delays the result until the value of T is known (making it context-dependent).
Given this, is D actually context-free, or am I misunderstanding the concept?
Why/why not?
Update:
I just thought I'd comment: It's really interesting to see the answers, since:
Some answers claim that C++ and D can't be context-free
Some answers claim that C++ and D are both context-free
Some answers support the claim that C++ is context-sensitive while D isn't
No one has yet claimed that C++ is context-free while D is context-sensitive :-)
I can't tell if I'm learning or getting more confused, but either way, I'm kind of glad I asked this... thanks for taking the time to answer, everyone!
Being context free is first a property of generative grammars. It means that what a non-terminal can generate will not depend on the context in which the non-terminal appears (in non context-free generative grammar, the very notion of "string generated by a given non-terminal" is in general difficult to define). This doesn't prevent the same string of symbols to be generated by two non-terminals (so for the same strings of symbols to appear in two different contexts with a different meaning) and has nothing to do with type checking.
It is common to extend the context-free definition from grammars to language by stating that a language is context-free if there is at least one context free grammar describing it.
In practice, no programming language is context-free because things like "a variable must be declared before it is used" can't be checked by a context-free grammar (they can be checked by some other kinds of grammars). This isn't bad, in practice the rules to be checked are divided in two: those you want to check with the grammar and those you check in a semantic pass (and this division also allows for better error reporting and recovery, so you sometimes want to accept more in the grammar than what would be possible in order to give your users better diagnostics).
What people mean by stating that C++ isn't context-free is that doing this division isn't possible in a convenient way (with convenient including as criteria "follows nearly the official language description" and "my parser generator tool support that kind of division"; allowing the grammar to be ambiguous and the ambiguity to be resolved by the semantic check is an relatively easy way to do the cut for C++ and follow quite will the C++ standard, but it is inconvenient when you are relying on tools which don't allow ambiguous grammars, when you have such tools, it is convenient).
I don't know enough about D to know if there is or not a convenient cut of the language rules in a context-free grammar with semantic checks, but what you show is far from proving the case there isn't.
The property of being context free is a very formal concept; you can find a definition here. Note that it applies to grammars: a language is said to be context free if there is at least one context free grammar that recognizes it. Note that there may be other grammars, possibly non context free, that recognize the same language.
Basically what it means is that the definition of a language element cannot change according to which elements surround it. By language elements I mean concepts like expressions and identifiers and not specific instances of these concepts inside programs, like a + b or count.
Let's try and build a concrete example. Consider this simple COBOL statement:
01 my-field PICTURE 9.9 VALUE 9.9.
Here I'm defining a field, i.e. a variable, which is dimensioned to hold one integral digit, the decimal point, and one decimal digit, with initial value 9.9 . A very incomplete grammar for this could be:
field-declaration ::= level-number identifier 'PICTURE' expression 'VALUE' expression '.'
expression ::= digit+ ( '.' digit+ )
Unfortunately the valid expressions that can follow PICTURE are not the same valid expressions that can follow VALUE. I could rewrite the second production in my grammar as follows:
'PICTURE' expression ::= digit+ ( '.' digit+ ) | 'A'+ | 'X'+
'VALUE' expression ::= digit+ ( '.' digit+ )
This would make my grammar context-sensitive, because expression would be a different thing according to whether it was found after 'PICTURE' or after 'VALUE'. However, as it has been pointed out, this doesn't say anything about the underlying language. A better alternative would be:
field-declaration ::= level-number identifier 'PICTURE' format 'VALUE' expression '.'
format ::= digit+ ( '.' digit+ ) | 'A'+ | 'X'+
expression ::= digit+ ( '.' digit+ )
which is context-free.
As you can see this is very different from your understanding. Consider:
a = b + c;
There is very little you can say about this statement without looking up the declarations of a,b and c, in any of the languages for which this is a valid statement, however this by itself doesn't imply that any of those languages is not context free. Probably what is confusing you is the fact that context freedom is different from ambiguity. This a simplified version of your C++ example:
a < b > (c)
This is ambiguous in that by looking at it alone you cannot tell whether this is a function template call or a boolean expression. The previous example on the other hand is not ambiguous; From the point of view of grammars it can only be interpreted as:
identifier assignment identifier binary-operator identifier semi-colon
In some cases you can resolve ambiguities by introducing context sensitivity at the grammar level. I don't think this is the case with the ambiguous example above: in this case you cannot eliminate the ambiguity without knowing whether a is a template or not. Note that when such information is not available, for instance when it depends on a specific template specialization, the language provides ways to resolve ambiguities: that is why you sometimes have to use typename to refer to certain types within templates or to use template when you call member function templates.
There are already a lot of good answers, but since you are uninformed about grammars, parsers and compilers etc, let me demonstrate this by an example.
First, the concept of grammars are quite intuitive. Imagine a set of rules:
S -> a T
T -> b G t
T -> Y d
b G -> a Y b
Y -> c
Y -> lambda (nothing)
And imagine you start with S. The capital letters are non-terminals and the small letters are terminals. This means that if you get a sentence of all terminals, you can say the grammar generated that sentence as a "word" in the language. Imagine such substitutions with the above grammar (The phrase between *phrase* is the one being replaced):
*S* -> a *T* -> a *b G* t -> a a *Y* b t -> a a b t
So, I could create aabt with this grammar.
Ok, back to main line.
Let us assume a simple language. You have numbers, two types (int and string) and variables. You can do multiplication on integers and addition on strings but not the other way around.
First thing you need, is a lexer. That is usually a regular grammar (or equal to it, a DFA, or equally a regular expression) that matches the program tokens. It is common to express them in regular expressions. In our example:
(I'm making these syntaxes up)
number: [1-9][0-9]* // One digit from 1 to 9, followed by any number
// of digits from 0-9
variable: [a-zA-Z_][a-zA-Z_0-9]* // You get the idea. First a-z or A-Z or _
// then as many a-z or A-Z or _ or 0-9
// this is similar to C
int: 'i' 'n' 't'
string: 's' 't' 'r' 'i' 'n' 'g'
equal: '='
plus: '+'
multiply: '*'
whitespace: (' ' or '\n' or '\t' or '\r')* // to ignore this type of token
So, now you got a regular grammar, tokenizing your input, but it understands nothing of the structure.
Then you need a parser. The parser, is usually a context free grammar. A context free grammar means, in the grammar you only have single nonterminals on the left side of grammar rules. In the example in the beginning of this answer, the rule
b G -> a Y b
makes the grammar context-sensitive because on the left you have b G and not just G. What does this mean?
Well, when you write a grammar, each of the nonterminals have a meaning. Let's write a context-free grammar for our example (| means or. As if writing many rules in the same line):
program -> statement program | lambda
statement -> declaration | executable
declaration -> int variable | string variable
executable -> variable equal expression
expression -> integer_type | string_type
integer_type -> variable multiply variable |
variable multiply number |
number multiply variable |
number multiply number
string_type -> variable plus variable
Now this grammar can accept this code:
x = 1*y
int x
string y
z = x+y
Grammatically, this code is correct. So, let's get back to what context-free means. As you can see in the example above, when you expand executable, you generate one statement of the form variable = operand operator operand without any consideration which part of code you are at. Whether the very beginning or middle, whether the variables are defined or not, or whether the types match, you don't know and you don't care.
Next, you need semantics. This is were context-sensitive grammars come into play. First, let me tell you that in reality, no one actually writes a context sensitive grammar (because parsing it is too difficult), but rather bit pieces of code that the parser calls when parsing the input (called action routines. Although this is not the only way). Formally, however, you can define all you need. For example, to make sure you define a variable before using it, instead of this
executable -> variable equal expression
you have to have something like:
declaration some_code executable -> declaration some_code variable equal expression
more complex though, to make sure the variable in declaration matches the one being calculated.
Anyway, I just wanted to give you the idea. So, all these things are context-sensitive:
Type checking
Number of arguments to function
default value to function
if member exists in obj in code: obj.member
Almost anything that's not like: missing ; or }
I hope you got an idea what are the differences (If you didn't, I'd be more than happy to explain).
So in summary:
Lexer uses a regular grammar to tokenize input
Parser uses a context-free grammar to make sure the program is in correct structure
Semantic analyzer uses a context-sensitive grammar to do type-checking, parameter matching etc etc
It is not necessarily always like that though. This just shows you how each level needs to get more powerful to be able to do more stuff. However, each of the mentioned compiler levels could in fact be more powerful.
For example, one language that I don't remember, used array subscription and function call both with parentheses and therefore it required the parser to go look up the type (context-sensitive related stuff) of the variable and determine which rule (function_call or array_substitution) to take.
If you design a language with lexer that has regular expressions that overlap, then you would need to also look up the context to determine which type of token you are matching.
To get to your question! With the example you mentioned, it is clear that the c++ grammar is not context-free. The language D, I have absolutely no idea, but you should be able to reason about it now. Think of it this way: In a context free grammar, a nonterminal can expand without taking into consideration anything, BUT the structure of the language. Similar to what you said, it expands, without "looking" anywhere else.
A familiar example would be natural languages. For example in English, you say:
sentence -> subject verb object clause
clause -> .... | lambda
Well, sentence and clause are nonterminals here. With this grammar you can create these sentences:
I go there because I want to
or
I jump you that I is air
As you can see, the second one has the correct structure, but is meaningless. As long as a context free grammar is concerned, the meaning doesn't matter. It just expands verb to whatever verb without "looking" at the rest of the sentence.
So if you think D has to at some point check how something was defined elsewhere, just to say the program is structurally correct, then its grammar is not context-free. If you isolate any part of the code and it still can say that it is structurally correct, then it is context-free.
There is a construct in D's lexer:
string ::= q" Delim1 Chars newline Delim2 "
where Delim1 and Delim2 are matching identifiers, and Chars does not contain newline Delim2.
This construct is context sensitive, therefore D's lexer grammar is context sensitive.
It's been a few years since I've worked with D's grammar much, so I can't remember all the trouble spots off the top of my head, or even if any of them make D's parser grammar context sensitive, but I believe they do not. From recall, I would say D's grammar is context free, not LL(k) for any k, and it has an obnoxious amount of ambiguity.
The grammar cannot be context-free if I need I can't tell the type of
an expression just by looking at it.
No, that's flat out wrong. The grammar cannot be context-free if you can't tell if it is an expression just by looking at it and the parser's current state (am I in a function, in a namespace, etc).
The type of an expression, however, is a semantic meaning, not syntactic, and the parser and the grammar do not give a penny about types or semantic validity or whether or not you can have tuples as values or keys in hashmaps, or if you defined that identifier before using it.
The grammar doesn't care what it means, or if that makes sense. It only cares about what it is.
To answer the question of if a programming language is context free you must first decide where to draw the line between syntax and semantics. As an extreme example, it is illegal in C for a program to use the value of some kinds of integers after they have been allowed to overflow. Clearly this can't be checked at compile time, let alone parse time:
void Fn() {
int i = INT_MAX;
FnThatMightNotReturn(); // halting problem?
i++;
if(Test(i)) printf("Weeee!\n");
}
As a less extreme example that others have pointed out, deceleration before use rules can't be enforced in a context free syntax so if you wish to keep your syntax pass context free, then that must be deferred to the next pass.
As a practical definition, I would start with the question of: Can you correctly and unambiguously determine the parse tree of all correct programs using a context free grammar and, for all incorrect programs (that the language requires be rejected), either reject them as syntactically invalid or produce a parse tree that the later passes can identify as invalid and reject?
Given that the most correct spec for the D syntax is a parser (IIRC an LL parser) I strongly suspect that it is in fact context free by the definition I suggested.
Note: the above says nothing about what grammar the language documentation or a given parser uses, only if a context free grammar exists. Also, the only full documentation on the D language is the source code of the compiler DMD.
These answers are making my head hurt.
First of all, the complications with low level languages and figuring out whether they are context-free or not, is that the language you write in is often processed in many steps.
In C++ (order may be off, but that shouldn't invalidate my point):
it has to process macros and other preprocessor stuffs
it has to interpret templates
it finally interprets your code.
Because the first step can change the context of the second step and the second step can change the context of the third step, the language YOU write in (including all of these steps) is context sensitive.
The reason people will try and defend a language (stating it is context-free) is, because the only exceptions that adds context are the traceable preprocessor statements and template calls. You only have to follow two restricted exceptions to the rules to pretend the language is context-free.
Most languages are context-sensitive overall, but most languages only have these minor exceptions to being context-free.