Is D's grammar really context-free? - c++

I've posted this on the D newsgroup some months ago, but for some reason, the answer never really convinced me, so I thought I'd ask it here.
The grammar of D is apparently context-free.
The grammar of C++, however, isn't (even without macros). (Please read this carefully!)
Now granted, I know nothing (officially) about compilers, lexers, and parsers. All I know is from what I've learned on the web.
And here is what (I believe) I have understood regarding context, in not-so-technical lingo:
The grammar of a language is context-free if and only if you can always understand the meaning (though not necessarily the exact behavior) of a given piece of its code without needing to "look" anywhere else.
Or, in even less rigor:
The grammar cannot be context-free if I need I can't tell the type of an expression just by looking at it.
So, for example, C++ fails the context-free test because the meaning of confusing<sizeof(x)>::q < 3 > (2) depends on the value of q.
So far, so good.
Now my question is: Can the same thing be said of D?
In D, hashtables can be created through a Value[Key] declaration, for example
int[string] peoplesAges; // Maps names to ages
Static arrays can be defined in a similar syntax:
int[3] ages; // Array of 3 elements
And templates can be used to make them confusing:
template Test1(T...)
{
alias int[T[0]] Test;
}
template Test2(U...)
{
alias int[U] Test2; // LGTM
}
Test1!(5) foo;
Test1!(int) bar;
Test2!(int) baz; // Guess what? It's invalid code.
This means that I cannot tell the meaning of T[0] or U just by looking at it (i.e. it could be a number, it could be a data type, or it could be a tuple of God-knows-what). I can't even tell if the expression is grammatically valid (since int[U] certainly isn't -- you can't have a hashtable with tuples as keys or values).
Any parsing tree that I attempt to make for Test would fail to make any sense (since it would need to know whether the node contains a data type versus a literal or an identifier) unless it delays the result until the value of T is known (making it context-dependent).
Given this, is D actually context-free, or am I misunderstanding the concept?
Why/why not?
Update:
I just thought I'd comment: It's really interesting to see the answers, since:
Some answers claim that C++ and D can't be context-free
Some answers claim that C++ and D are both context-free
Some answers support the claim that C++ is context-sensitive while D isn't
No one has yet claimed that C++ is context-free while D is context-sensitive :-)
I can't tell if I'm learning or getting more confused, but either way, I'm kind of glad I asked this... thanks for taking the time to answer, everyone!

Being context free is first a property of generative grammars. It means that what a non-terminal can generate will not depend on the context in which the non-terminal appears (in non context-free generative grammar, the very notion of "string generated by a given non-terminal" is in general difficult to define). This doesn't prevent the same string of symbols to be generated by two non-terminals (so for the same strings of symbols to appear in two different contexts with a different meaning) and has nothing to do with type checking.
It is common to extend the context-free definition from grammars to language by stating that a language is context-free if there is at least one context free grammar describing it.
In practice, no programming language is context-free because things like "a variable must be declared before it is used" can't be checked by a context-free grammar (they can be checked by some other kinds of grammars). This isn't bad, in practice the rules to be checked are divided in two: those you want to check with the grammar and those you check in a semantic pass (and this division also allows for better error reporting and recovery, so you sometimes want to accept more in the grammar than what would be possible in order to give your users better diagnostics).
What people mean by stating that C++ isn't context-free is that doing this division isn't possible in a convenient way (with convenient including as criteria "follows nearly the official language description" and "my parser generator tool support that kind of division"; allowing the grammar to be ambiguous and the ambiguity to be resolved by the semantic check is an relatively easy way to do the cut for C++ and follow quite will the C++ standard, but it is inconvenient when you are relying on tools which don't allow ambiguous grammars, when you have such tools, it is convenient).
I don't know enough about D to know if there is or not a convenient cut of the language rules in a context-free grammar with semantic checks, but what you show is far from proving the case there isn't.

The property of being context free is a very formal concept; you can find a definition here. Note that it applies to grammars: a language is said to be context free if there is at least one context free grammar that recognizes it. Note that there may be other grammars, possibly non context free, that recognize the same language.
Basically what it means is that the definition of a language element cannot change according to which elements surround it. By language elements I mean concepts like expressions and identifiers and not specific instances of these concepts inside programs, like a + b or count.
Let's try and build a concrete example. Consider this simple COBOL statement:
01 my-field PICTURE 9.9 VALUE 9.9.
Here I'm defining a field, i.e. a variable, which is dimensioned to hold one integral digit, the decimal point, and one decimal digit, with initial value 9.9 . A very incomplete grammar for this could be:
field-declaration ::= level-number identifier 'PICTURE' expression 'VALUE' expression '.'
expression ::= digit+ ( '.' digit+ )
Unfortunately the valid expressions that can follow PICTURE are not the same valid expressions that can follow VALUE. I could rewrite the second production in my grammar as follows:
'PICTURE' expression ::= digit+ ( '.' digit+ ) | 'A'+ | 'X'+
'VALUE' expression ::= digit+ ( '.' digit+ )
This would make my grammar context-sensitive, because expression would be a different thing according to whether it was found after 'PICTURE' or after 'VALUE'. However, as it has been pointed out, this doesn't say anything about the underlying language. A better alternative would be:
field-declaration ::= level-number identifier 'PICTURE' format 'VALUE' expression '.'
format ::= digit+ ( '.' digit+ ) | 'A'+ | 'X'+
expression ::= digit+ ( '.' digit+ )
which is context-free.
As you can see this is very different from your understanding. Consider:
a = b + c;
There is very little you can say about this statement without looking up the declarations of a,b and c, in any of the languages for which this is a valid statement, however this by itself doesn't imply that any of those languages is not context free. Probably what is confusing you is the fact that context freedom is different from ambiguity. This a simplified version of your C++ example:
a < b > (c)
This is ambiguous in that by looking at it alone you cannot tell whether this is a function template call or a boolean expression. The previous example on the other hand is not ambiguous; From the point of view of grammars it can only be interpreted as:
identifier assignment identifier binary-operator identifier semi-colon
In some cases you can resolve ambiguities by introducing context sensitivity at the grammar level. I don't think this is the case with the ambiguous example above: in this case you cannot eliminate the ambiguity without knowing whether a is a template or not. Note that when such information is not available, for instance when it depends on a specific template specialization, the language provides ways to resolve ambiguities: that is why you sometimes have to use typename to refer to certain types within templates or to use template when you call member function templates.

There are already a lot of good answers, but since you are uninformed about grammars, parsers and compilers etc, let me demonstrate this by an example.
First, the concept of grammars are quite intuitive. Imagine a set of rules:
S -> a T
T -> b G t
T -> Y d
b G -> a Y b
Y -> c
Y -> lambda (nothing)
And imagine you start with S. The capital letters are non-terminals and the small letters are terminals. This means that if you get a sentence of all terminals, you can say the grammar generated that sentence as a "word" in the language. Imagine such substitutions with the above grammar (The phrase between *phrase* is the one being replaced):
*S* -> a *T* -> a *b G* t -> a a *Y* b t -> a a b t
So, I could create aabt with this grammar.
Ok, back to main line.
Let us assume a simple language. You have numbers, two types (int and string) and variables. You can do multiplication on integers and addition on strings but not the other way around.
First thing you need, is a lexer. That is usually a regular grammar (or equal to it, a DFA, or equally a regular expression) that matches the program tokens. It is common to express them in regular expressions. In our example:
(I'm making these syntaxes up)
number: [1-9][0-9]* // One digit from 1 to 9, followed by any number
// of digits from 0-9
variable: [a-zA-Z_][a-zA-Z_0-9]* // You get the idea. First a-z or A-Z or _
// then as many a-z or A-Z or _ or 0-9
// this is similar to C
int: 'i' 'n' 't'
string: 's' 't' 'r' 'i' 'n' 'g'
equal: '='
plus: '+'
multiply: '*'
whitespace: (' ' or '\n' or '\t' or '\r')* // to ignore this type of token
So, now you got a regular grammar, tokenizing your input, but it understands nothing of the structure.
Then you need a parser. The parser, is usually a context free grammar. A context free grammar means, in the grammar you only have single nonterminals on the left side of grammar rules. In the example in the beginning of this answer, the rule
b G -> a Y b
makes the grammar context-sensitive because on the left you have b G and not just G. What does this mean?
Well, when you write a grammar, each of the nonterminals have a meaning. Let's write a context-free grammar for our example (| means or. As if writing many rules in the same line):
program -> statement program | lambda
statement -> declaration | executable
declaration -> int variable | string variable
executable -> variable equal expression
expression -> integer_type | string_type
integer_type -> variable multiply variable |
variable multiply number |
number multiply variable |
number multiply number
string_type -> variable plus variable
Now this grammar can accept this code:
x = 1*y
int x
string y
z = x+y
Grammatically, this code is correct. So, let's get back to what context-free means. As you can see in the example above, when you expand executable, you generate one statement of the form variable = operand operator operand without any consideration which part of code you are at. Whether the very beginning or middle, whether the variables are defined or not, or whether the types match, you don't know and you don't care.
Next, you need semantics. This is were context-sensitive grammars come into play. First, let me tell you that in reality, no one actually writes a context sensitive grammar (because parsing it is too difficult), but rather bit pieces of code that the parser calls when parsing the input (called action routines. Although this is not the only way). Formally, however, you can define all you need. For example, to make sure you define a variable before using it, instead of this
executable -> variable equal expression
you have to have something like:
declaration some_code executable -> declaration some_code variable equal expression
more complex though, to make sure the variable in declaration matches the one being calculated.
Anyway, I just wanted to give you the idea. So, all these things are context-sensitive:
Type checking
Number of arguments to function
default value to function
if member exists in obj in code: obj.member
Almost anything that's not like: missing ; or }
I hope you got an idea what are the differences (If you didn't, I'd be more than happy to explain).
So in summary:
Lexer uses a regular grammar to tokenize input
Parser uses a context-free grammar to make sure the program is in correct structure
Semantic analyzer uses a context-sensitive grammar to do type-checking, parameter matching etc etc
It is not necessarily always like that though. This just shows you how each level needs to get more powerful to be able to do more stuff. However, each of the mentioned compiler levels could in fact be more powerful.
For example, one language that I don't remember, used array subscription and function call both with parentheses and therefore it required the parser to go look up the type (context-sensitive related stuff) of the variable and determine which rule (function_call or array_substitution) to take.
If you design a language with lexer that has regular expressions that overlap, then you would need to also look up the context to determine which type of token you are matching.
To get to your question! With the example you mentioned, it is clear that the c++ grammar is not context-free. The language D, I have absolutely no idea, but you should be able to reason about it now. Think of it this way: In a context free grammar, a nonterminal can expand without taking into consideration anything, BUT the structure of the language. Similar to what you said, it expands, without "looking" anywhere else.
A familiar example would be natural languages. For example in English, you say:
sentence -> subject verb object clause
clause -> .... | lambda
Well, sentence and clause are nonterminals here. With this grammar you can create these sentences:
I go there because I want to
or
I jump you that I is air
As you can see, the second one has the correct structure, but is meaningless. As long as a context free grammar is concerned, the meaning doesn't matter. It just expands verb to whatever verb without "looking" at the rest of the sentence.
So if you think D has to at some point check how something was defined elsewhere, just to say the program is structurally correct, then its grammar is not context-free. If you isolate any part of the code and it still can say that it is structurally correct, then it is context-free.

There is a construct in D's lexer:
string ::= q" Delim1 Chars newline Delim2 "
where Delim1 and Delim2 are matching identifiers, and Chars does not contain newline Delim2.
This construct is context sensitive, therefore D's lexer grammar is context sensitive.
It's been a few years since I've worked with D's grammar much, so I can't remember all the trouble spots off the top of my head, or even if any of them make D's parser grammar context sensitive, but I believe they do not. From recall, I would say D's grammar is context free, not LL(k) for any k, and it has an obnoxious amount of ambiguity.

The grammar cannot be context-free if I need I can't tell the type of
an expression just by looking at it.
No, that's flat out wrong. The grammar cannot be context-free if you can't tell if it is an expression just by looking at it and the parser's current state (am I in a function, in a namespace, etc).
The type of an expression, however, is a semantic meaning, not syntactic, and the parser and the grammar do not give a penny about types or semantic validity or whether or not you can have tuples as values or keys in hashmaps, or if you defined that identifier before using it.
The grammar doesn't care what it means, or if that makes sense. It only cares about what it is.

To answer the question of if a programming language is context free you must first decide where to draw the line between syntax and semantics. As an extreme example, it is illegal in C for a program to use the value of some kinds of integers after they have been allowed to overflow. Clearly this can't be checked at compile time, let alone parse time:
void Fn() {
int i = INT_MAX;
FnThatMightNotReturn(); // halting problem?
i++;
if(Test(i)) printf("Weeee!\n");
}
As a less extreme example that others have pointed out, deceleration before use rules can't be enforced in a context free syntax so if you wish to keep your syntax pass context free, then that must be deferred to the next pass.
As a practical definition, I would start with the question of: Can you correctly and unambiguously determine the parse tree of all correct programs using a context free grammar and, for all incorrect programs (that the language requires be rejected), either reject them as syntactically invalid or produce a parse tree that the later passes can identify as invalid and reject?
Given that the most correct spec for the D syntax is a parser (IIRC an LL parser) I strongly suspect that it is in fact context free by the definition I suggested.
Note: the above says nothing about what grammar the language documentation or a given parser uses, only if a context free grammar exists. Also, the only full documentation on the D language is the source code of the compiler DMD.

These answers are making my head hurt.
First of all, the complications with low level languages and figuring out whether they are context-free or not, is that the language you write in is often processed in many steps.
In C++ (order may be off, but that shouldn't invalidate my point):
it has to process macros and other preprocessor stuffs
it has to interpret templates
it finally interprets your code.
Because the first step can change the context of the second step and the second step can change the context of the third step, the language YOU write in (including all of these steps) is context sensitive.
The reason people will try and defend a language (stating it is context-free) is, because the only exceptions that adds context are the traceable preprocessor statements and template calls. You only have to follow two restricted exceptions to the rules to pretend the language is context-free.
Most languages are context-sensitive overall, but most languages only have these minor exceptions to being context-free.

Related

Rules & Actions for Parser Generator, and

I am trying to wrap my head around an assignment question, therefore I would very highly appreciate any help in the right direction (and not necessarily a complete answer). I am being asked to write the grammar specification for this parser. The specification for the grammar that I must implement can be found here:
http://anoopsarkar.github.io/compilers-class/decafspec.html
Although the documentation is there, I do not understand a few things, such as how to write (in my .y file) things such as
{ identifier },+
I understand that this would mean a comma-separated list of 1 (or more) occurrences of an identifier, however when I write it as such, the compiler displays an error of unrecognized symbols '+' and ',', being mistaken as whitespace. I tried '{' identifier "},+", but I haven't the slightest clue whether that is correct or not.
I have written the lexical analyzer portion (as it was from the previous segment of the assignment) which returns tokens (T_ID, T_PLUS, etc.) accordingly, however there is this new notion that I must assign 'yylval' to be the value of the token itself. To my understanding, this is only necessary if I am in need of the actual value of the token, therefore I would need the value of an identifier token T_ID, but not necessarily the value of T_PLUS, being '+'. This is done by creating a %union in the parser generator file, which I have done, and have provided the tokens that I currently believe would require the literal token value with the proper yylval assignment.
Here is my lexical analysis code (I could not get it to format properly, I apologize): https://pastebin.com/XMZwvWCK
Here is my parser file decafast.y: https://pastebin.com/2jvaBFQh
And here is the final piece of code supplied to me, the C++ code to build an abstract syntax tree at the end:
https://pastebin.com/ELy53VrW?fbclid=IwAR2cFT_-pGKlVZ2liC-zAe3Fw0BWDlGjrrayqEGV4JuJq1_7nKoe9-TLTlA
To finalize my question, I do not know if I am creating my grammar rules correctly. I have tried my best to follow the specification in the above website, but I can't help but feel that what I am writing is completely wrong. My compiler is spitting out nothing but "warning: rule useless in grammar" for almost every (if not every) rule.
If anyone could help me out and point me in the right direction on how to make any progress, I would highly, highly appreciate it.
The decaf specification is written in (an) Extended Backus Naur Form (EBNF), which includes a number of convenience operators for repetition, optionality and grouping. These are not part of the bison/yacc syntax, which is pretty well limited to BNF. (Bison/yacc do allow the alternation operator |, but since there is no way to group subpatterns, alteration can only be used at the top-level, to combine two productions for the same non-terminal.)
The short section at the beginning of the specification which describes EBNF includes a grammar for the particular variety of EBNF that is being used. (Since this grammar is itself recursively written in the same EBNF, there is a need to apply a bit of inductive reasoning.) When it says, for example,
CommaList = "{" Expression "}+," .
it is not saying that "}+," is the peculiar spelling of a comma-repetition operator. What it is saying is that when you see something in the Decaf grammar surrounded by { and }+,, that should be interpreted as describing a comma-separated list.
For example, the Decaf grammar includes:
FieldDecl = var { identifier }+, Type ";" .
That means that a FieldDecl can be (amongst other possibilities) the token var followed by a comma-separated list of identifier tokens and then followed by a Type and finally a semicolon.
As I said, bison/yacc don't implement the EBNF operators, so you have to find an equivalent yourself. Since BNF doesn't allow any form of grouping -- and a list is a grouped subexpression -- we need to rewrite the subexpression of a production as a new non-terminal. Also, I suppose we need to use the tokens defined in spec (although bison allows a more readable syntax).
So to yacc-ify this EBNF production, we first introducing the new non-terminal and replace the token names:
FieldDecl: T_VAR IdentifierList Type T_SEMICOLON
Which leaves the definition of IdentifierList. Repetition in BNF is always produced with recursion, following a very simple model which uses two productions:
the base, which is the simplest possible repetition (usually either nothing or a single list item), and
the recursion, which describes a longer possibility by extending a shorter one.
In this case, the list must have at least one item, and we extend by adding a comma and another item:
IdentifierList
: T_ID /* base case */
| IdentifierList T_COMMA T_ID /* Recursive extension */
The point of this exercise is to develop your skills in thinking grammatically: that is, factoring out the syntax and semantics of the language. So you should try to understand the grammars presented, both for Decaf and for the author's version of EBNF, and avoid blindly copying code (including grammars). Good luck!

Is it correct to say that there is no implied ordering in the presentation of grammar options in the C++ Standard?

I'll try to explain my question with an example. Consider the following grammar production in the C++ standard:
literal:
   integer-literal
   character-literal
   floating-point-literal
   string-literal
   boolean-literal
   pointer-literal
   user-defined-literal
Once the parser identifies a literal as an integer-literal, I always thought that the parser would just stop there. But I was told that this is not true. The parser will continue parsing to verify whether the literal could also be matched with a user-defined-literal, for example.
Is this correct?
Edit
I decided to include this edit as my interpretation of the Standard, in response to #rici's excellent answer below, although with a result that is the opposite of the one advocated by the OP.
One can read the following in [stmt.ambig]/1 and /3 (emphases are mine):
[stmt.ambig]/1
There is an ambiguity in the grammar involving
expression-statements and declarations: An expression-statement with a
function-style explicit type conversion as its leftmost subexpression
can be indistinguishable from a declaration where the first declarator
starts with a (. In those cases the statement is a declaration.
That is, this paragraph states how ambiguities in the grammar should be treated. There are several other ambiguities mentioned in the C++ Standard, but only three that I know are ambiguities related to the grammar, [stmt.ambig], [dcl.ambig.res]/1, a direct consequence of [stmt.ambig] and [expr.unary.op]/10, which explicitly states the term ambiguity in the grammar.
[stmt.ambig]/3:
The disambiguation is purely syntactic; that is, the meaning of the
names occurring in such a statement, beyond whether they are
type-names or not, is not generally used in or changed by the
disambiguation. Class templates are instantiated as necessary to
determine if a qualified name is a type-name. Disambiguation
precedes parsing, and a statement disambiguated as a declaration may
be an ill-formed declaration. If, during parsing, a name in a template
parameter is bound differently than it would be bound during a trial
parse, the program is ill-formed. No diagnostic is required. [ Note:
This can occur only when the name is declared earlier in the
declaration. — end note ]
Well, if disambiguation precedes parsing there is nothing that could prevent a decent compiler to optimize parsing by just considering that the alternatives present in each definition of the grammar are indeed ordered. With that in mind, the first sentence in [lex.ext]/1 below could be eliminated.
[lex.ext]/1:
If a token matches both user-defined-literal and another literal kind,
it is treated as the latter. [ Example: 123_­km is a
user-defined-literal, but 12LL is an integer-literal. — end example ]
The syntactic non-terminal preceding the ud-suffix in a
user-defined-literal is taken to be the longest sequence of characters
that could match that non-terminal.
Note also that this paragraph doesn't mention ambiguity in the grammar, which for me at least, is an indication that the ambiguity doesn't exist.
There is no implicit ordering of productions in the C++ presentation grammar.
There are ambiguities in that grammar, which are dealt with on a case-by-case basis by text in the standard. Note that the text of the the standard is normative; the grammar does not stand alone, and it does not override the text. The two need to be read together.
The standard itself points out that the grammar as resumed in Appendix A:
… is not an exact statement of the language. In particular, the grammar described here accepts a superset of valid C++ constructs. Disambiguation rules (8.9, 9.2, 11.8) must be applied to distinguish expressions from declarations. Further, access control, ambiguity, and type rules must be used to weed out syntactically valid but meaningless constructs. (Appendix A, paragraph 1)
That's not a complete list of the ambiguities resolved in the text of the standard, because there are also rules about lexical ambiguities. (See below.)
Almost all of these ambiguity resolution clauses are of the form "if both P and Q applies, choose Q", and thus would be unnecessary were there an implicit ordering of grammar alternatives, since the correct parse could be guaranteed simply by putting the alternatives in the correct order. So the fact that the standard feels the need to dedicate a number of clauses to ambiguity resolution is prima facie evidence that alternatives are not implicitly ordered. [Note 1]
The C++ standard does not explicitly name the grammar formalism being used, but it does credit the antecedents which allows us to construct a historical argument. The formalism used by the C++ standard was inherited from the C standard and the description in Kernighan & Ritchie's original book on the (then newly-minted) C language. K&R wrote their grammar using the Yacc parser generator, and the original C grammar is basically a Yacc grammar file. Yacc uses the LALR(1) algorithm to construct a parser from a context-free grammar (CFG), and its grammar files are a concrete representation of that grammar written in what has come to be known as BNF (although there is some historical ambiguity about what the letters in BNF actually stand for). BNF does not have any implicit ordering of rules, and the formalism does not allow any way to write an explicit ordering or any other disambiguation rule. (A BNF grammar must be unambiguous in order to be mechanically parsed; if it is ambiguous, the LALR(1) algorithm will fail to generate a parser.)
Yacc does go a bit outside of the box. It has some automatic disambiguation rules, and one mechanism to provide explicit disambiguation (operator precedence). But Yacc's disambiguation has nothing to do with the ordering of alternatives either.
In short, ordered alternatives were not really a feature of any grammar formalism until 2002 when Bryan Ford proposed packrat parsing, and subsequently formalised a class of grammars which he called "Parsing Expression Grammars" (PEGs). The PEG algorithm does implicitly order alternatives, by insisting that the right-hand alternative in an alternation only be attempted if the left-hand alternative failed to match. For this reason, the PEG alternation operator (or "ordered alternation" operator) is generally written as / instead of |, avoiding confusion with the traditional unordered alternation syntax.
A key feature of the PEG algorithm is that it is always deterministic. Every PEG grammar can be deterministically applied to a source text without ambiguity. (That doesn't mean that the grammar will give you the parse you wanted, of course. It just means that it will never give you a list of parses and let you select the one you want.) So grammars written in PEG cannot be accompanied by textual rules which disambiguate, because there are no ambiguities.
I mention this because the existence and popularity of PEG have to some extent altered the perception of the meaning of the alternation operator. Before PEG, we probably wouldn't be having this kind of discussion at all. But using PEG as a guide to interpreting the C++ grammar formalism is ahistoric and unjustifiable; the roots of the C++ grammar go back to at least 1978, at least a quarter of a century before PEG.
Lexical ambiguities, and the clauses which resolve them
[lex.pptoken] (§5.4) paragraph 3 lays down the fundamental rules for token recognition, which is a little more complicated than the traditional "maximal munch" principle which always recognises the longest possible token starting immediately after the previously recognised token. It includes two exceptions:
The sequence <:: is treated as starting with the token < rather than the longer token <: unless it is the start of <::> (treated as <:, :>) or <::: (treated as <:, ::). That might all make more sense if you mentally replace <: with [ and :> with ], which is the intended syntactic equivalence.
A raw string literal is terminated by the first matching delimiter sequence. This rule could in theory be written in a context-free grammar only because there is an explicit limit on the length of termination sequences, which means that the theoretical CFG would have on the order of 8816 rules, one for each possible delimiter sequence. In practice, this rule cannot be written as such, and it is described textually, along with the 16-character limit on the length of the d-char-sequence.
[lex-header] (§5.8) avoids the ambiguity between header-names and string-literals (as well as certain token sequences starting with <) by requiring header-name to only be recognised in certain contexts, including an #include preprocessing directive. (The section does not actually say that the string-literal should not be recognised, but I think that the implication is clear.)
[lex.ext] (§5.13.8) paragraph 1 resolves the ambiguities involved with user-defined-literals, by requiring that:
the user-defined-literal rule is only recognised if the token cannot be recognised as some other kind of literal, and
the decomposition of the user-defined-literal into a literal followed by a ud-suffix follows the longest-token rule, described above.
Note that this rule is not really a tokenisation rule, because it is applied after the source text has been divided into tokens. Tokenisation is done in translation phase 3, after which the tokens are passed through the preprocessing directives (phase 4), rewriting of escape sequences and UCNs (phase 5), and concatenation of string literals (phase 6). Each token which emerges from phase 6 must then be reinterpreted as a token in the syntactic grammar, and it is at that point that literal tokens will be classified. So it's not necessary that §5.13.8 clarify what the extent of the token being categorised is; the extent is already known and the converted token must use exactly all of the characters in the preprocessing token. Thus it's quite different from the other ambiguities in this list, but I left it here because it is so present in the original question and in the various comment threads.
Notes:
Curiously, in almost all of the ambiguity resolution clauses, the preferred alternative is the one which appears later in the list of alternatives. For example, §8.9 explicitly prefers declarations to expressions, but the grammar for statement lists expression-statement long before declaration-statement. Having said that, correctly parsing C++ requires a more sophisticated algorithm than just "try to parse a declaration and if that fails, then try to parse as an expression," because there are programs which must be parsed as a declaration with a syntax error (see the example at [stmt.ambig]/3).
No ordering is either implied or necessary.
All seven kinds of literal are distinct. No token that meets the definition of any of them can meet the definition of any other. For example, 42 is an integer-literal and cannot be a floating-point-literal.
How a compiler determines what a token is is an implementation detail that the standard doesn't address, and doesn't need to.
If there were an ambiguity, so that for example the same token could be either an integer-literal or a user-defined-literal, either the language would have to have a rule to disambiguate it, or it would be a bug in the grammar.
UPDATE: There is in fact such an ambiguity. As discussed in comments, 42ULL satisfies the syntax of either an integer-literal or a user-defined-literal. This ambiguity is resolved, not by the ordering of the grammar productions, but by an explicit statement:
If a token matches both user-defined-literal and another literal kind, it is treated as the latter.
The section on syntactic notation in the standard only says this about what it means:
In the syntax notation used in this document, syntactic categories are indicated by italic type, and literal words and characters in constant width type. Alternatives are listed on separate lines except in a few cases where a long set of alternatives is marked by the phrase “one of”. If the text of an alternative is too long to fit on a line, the text is continued on subsequent lines indented from the first one. An optional terminal or non-terminal symbol is indicated by the subscript “opt”, so
{ expressionopt }
indicates an optional expression enclosed in braces.
Note that the statement considers the terms in grammars to be "alternatives", rather than a list or even an ordered list. There is no statement about ordering of the "alternatives" at all.
As such, this strongly suggests that there is no ordering at all.
Indeed, the presence throughout the standard of specific rules to disambiguate cases where multiple terms match also suggests that the alternatives are not written as a prioritized list. If the alternatives were some kind of ordered list, this statement would be redundant:
If a token matches both user-defined-literal and another literal kind, it is treated as the latter.

How to test if a string is a valid C++(ish) expression?

I am writing a program in C++ that needs to be able to test if a string (probably std::string) is a valid C++ expression. Variables can be checked if they have been declared (bool variableDeclared(std::string identifier)) and their type can also be checked (std::string variableType(std::string identifier)). The variableType function returns a string based on how it would be declared in C++ ("bool", "double", "char", etc).
The expression doesn't need to be evaluated but only tested to see if it is valid. The function only needs to support character literals, string literals, number literals, brackets, simple operators (+, -, *, /, ! (logic not), &&, ||, >, <, ==), and variables of type double, std::string (no function calls needed), bool and char. It is also not required to support string concatenation.
The desired result would be a function that is something like bool validExpression(std::string expression). It is also preferable that it allows me to modify the operations (for example I could change "==" to "equal-to").
How would I implement this? Is there a library that could do something like this, a regex statement or is it simply a matter of a long function with lots of if statements?
Formally, your situation is: you have a grammar which describes the language of expressions which you want to validate, and a word for which you want to determine whether it belongs in that language. This is a job for a parser of that language.
You could hand-cook something like a recursive-descent LL(1) parser, or use a tool to generate a parser. A well-known example of such a tool is Bison for generating LALR(1) parsers. Wikipedia has a long parser generator list.
Technical terms are used above mainly to provide entry points for googling.
You would start from defining your language more or less formally. (A language is a set of strings). A good way to define a language is to specify its context-free grammar. Describe additional conditions (like the requirement that variables must be declared, and of the right type) informally in prose.
The next step would be building a parser for your grammar specified at the previous step. There are several tools for building parsers from grammars automatically, from yacc/bison to boost::spirit.
After building and checking the parser, implement the informally-specified rules and plug them into your parser code/data.
Normally the next step, building an evaluator, would probably the easiest part of writing a simple interpreter, but you say you don't need one.
Describing your language as "just like C++ only with certain bits taken out" could be a preliminary step to the sequence outlined above. It is however not recommended to start out from C++ if you can help it. C++ is an extremely hard language to specify formally, and its parsers tend to be rather hairy, due to its convoluted declaration syntax.
you can run compiler as sub-process of your application. All you have to do is to pass arguments and parse response properly

How to allow ok spaces and ban bad ones with boost spirit

Motivating examples
good:
SELECT a, b, c d,e FROM t1
bad:
SE L ECT a, b, c d,e FR OM t1
SELECTa, b, c d,eFROMt1
So as you can see problem here is that some spaces are ok(between SELECT and a,b,c for example) and some are bad(SE L ECT) and some are neccessary(after/before keyword).
So my question is what idioms to use here since if I use space skipper with phrase_parse it will allow bad spaces and if I want to allow good spaces without a skipper parsers become littered with *char_(' ')
You need to mark your keywords as qi::lexeme[].
Besides, you probably want something like boost::spirit::repository::qi::distinct to avoid parsing SELECT2 as SELECT followed by 2.\
See e.g.
Boost spirit skipper issues
boost::spirit::qi keywords and identifiers
What you're looking for is, well, parsing.
It's not about accepting/rejecting "good" or "bad" spaces. It is about trying to recognize what's entered, and rejecting it if you can't.
In this case, let's start with a (thoroughly simplified) grammar for the statement in question:
select_statement ::= 'select' field_list 'from' table
So, you read in the first token. If it's SE or SELECTa, you reject the statement as invalid, because neither of those fits your grammar. Almost any decent parser generator (including, but certainly not limited to, Spirit) makes this fairly trivial--you specify what is acceptable, and what to do if the input is not acceptable, and it deals with invoking that for input that doesn't fit the specified grammar.
As for how you do the tokenization to start with, it's typically pretty simple, and usually can be based on regular expressions (e.g., many languages have been implemented using lex and derivatives like Flex, which use regexen to specify tokenization).
For something like this, you directly specify the keywords for your language, so you'd have something that says when it matches 'select', it should return that as a token. Then you have something more general for an identifier that typically runs something like `[_a-zA-Z][_a-zA-Z0-9]*' ("an identifier starts with an underscore or letter, followed by an arbitrary number of underscores, letters, or digits"). In the cases above, this would be entirely sufficient to find and return the "SE" and "SELECTa" as the first tokens in the "bad" examples.
Your parser would then detect that the first thing it received was an identifier instead of a key word, at which point it would (presumably) be rejected.

Counterpart of regular expressions for parsing nested strucures

Regular expressions are a standard tool used for parsing strings across many languages. However their scope of applicability is limited. Regular expressions can only match a list. There is no way to describe arbitrary deep nested structures using regular expressions. Question: what is a technology/framework as widely used/spread and as standatd as regular expessions are that can match tree structures (produce AST).
Regular expressions describe a finite-state automaton.
Since the late 1960's, the "bread and butter" of parsing (though not necessarily the "state of the art") has been push-down automata generated by parser generators according to "LR" algorithms like LALR(1).
The connection to regular expressions is this: the parsing machine does in fact use rules very similar to regular expressions in order to recognize viable prefixes. The "shift" state transitions among the "core LR(0) items" constitute a finite automaton, and can be described by a regular expression. The recursion is is handled thanks to the semantic action of pushing symbols onto a stack when doing the "shifts", and removing them ("reducing"). Reductions rewrite a portion of the stack, and perform a "goto" to another state. This type of goto, together with the stack, is absent in the regular expression automaton.
Parse Expression Grammars are also related to regular expressions. Regular expressions themselves can be endowed with recursion. Firstly, we can take pieces of regular expressions and give them names, and then construct bigger regular expressions by writing expressions which invoke these names. (Such as feature is found in the lex tool where you can define a named expressions like letters [A-Za-z]+ and refer to it later as {letters}. Now suppose you allow circular references, like letters [A-Za-z]{letters}?. You now have recursion; the only problem is to adjust the model in order to implement it.
Implementations of so-called "regular expressions" in various modern languages and environments in fact support recursion. Perl-compatible regular expressions (PCRE) support it, for instance.
Expressions that feature recursion or backreferencing are not handled by the classic NFA compilation route (possibly converted to a DFA); they cannot be.
How the above letters recursion can be handled is with actual recursion. The ? operator can be implemented as a function which tries to match its respective argument object. If it succeeds, then it consumes whatever it has matched, otherwise it consumes nothing. That is to say, the regular expression can be converted to a syntax tree, and interpreted "as is" rather than compiled to a state machine (or trivially compiled to functions corresponding to the nodes of the tree), and such interpretation can naturally handle recursion. The interpretation then constitutes, effectively, a syntax-directed recursive-descent parser. (Note how I avoided left recursion in defining letters to make that example compatible with this approach).
Example: parenthesis-matching regex:
par-match := ({par-match})|
This gets compiled to a tree:
branch-op <-- "par-match" name points at this node
/ \
catenate-op <empty>
/ \
"(" catenate-op
/ \
{par-match} ")"
This can then converted to a recursive descent parser, or interpreted directly.
Pattern matching starts by invoking the top-level "branch-op". This operator simply tries all of the alternatives. Suppose the input is empty. Then the left alternative will fail: it demands an open parenthesis. So then the right alternative will succeed: empty matches empty. (The operators either "fail" or indicate "success" and consume input.)
But suppose your input is (()). The left catenate-op will in turn invoke its left subtree, which matches and consumes the left parenthesis, leaving ()). It will then invoke its right subtree, another catenate-op. This catenate-op matches its left subtree, which triggers recursion into the top level via the named par-match references. That recursion will match and consume (), leaving ). The catenate-op then invokes its right subtree which matches ). Control returns up to branch-op. (Though the left side of branch-op matched something, branch-op must still try the other alternative; more than one branch can match, and some can match longer than others.)
This is closely related to Parsing Expression Grammars work.
Practically speaking, the recursive definition could be encoded into the regex syntax somehow. Say we invent some new operator like (?name:definition) which means "match definition which is allowed to contain invocations of itself via name. The invocation syntax could be (*name), so that we can write the par-match example as (?par-match:\((*par-match)\)|). The combinations (? and (* are invalid under "classic" regex syntax and so we can use them for extension.
As a final note, regexes correspond to grammars. That is the fundamental connection btween regexes and parsing. That is to say, regexes correspond to a particular subset of grammars describe only regular languages. An example of a grammar which describes a regular language:
S -> A | B
B -> b
A -> A a | c
Although there is A -> A ... recursion, this is still regular, and corresponds to the regex ac*|b, which is just a more compact way to denote the same language. The grammar lets us notate languages that aren't regular and for which we can't write a regex, but as we have seen, we can extend the regex notation and semantics to express some of these things. Regular expressions aren't separate from grammars. The two aren't counterparts, but rather one is a special case or subset of the other.
Parser generators like Yacc, Bison, and derivatives are what you're after. They aren't as widespread as regular expressions because they generate actual C code. There are translations like Jison for example which implement the Yacc/Bison syntax using javascript. I know there are similar tools for other languages.
I get the impression Parsing expression grammar systems are up and coming though.