C++ Array Definition with Lower and Upper Bound? - c++

My daughter's 12th standard C++ textbook says that
the notation of an array can (also) be given as follows: Array name
[lower bound L, upper bound U]
This was a surprise for me. I know Pascal has this notation, but C++? Had never seen this earlier. I wrote a quick program in her prescribed compiler (the ancient Turbo C++ 4.5), and that does not support it. Did not find this syntax in Stanley Lippman's book either. Internet search did not throw up this. Or maybe I didn't search correctly?
So, is it a valid declaration?

This is not valid, from the draft C++ standard section 8.3.4 Arrays the declaration must be of this form:
D1 [ constant-expressionopt] attribute-specifier-seqopt
and we can from section 5.19 Constant expressions the grammar for constant expression is:
constant-expression:
conditional-expression
This grammar does not allow us to get to the comma operator either to do something like this:
int a[ 1, 2 ] ;
^
as others have implied since there is no path to comma operator from conditional-expression. Although if you add parenthesis we can get to the comma operator since conditional-expression allows us to get to primary-expression which gets us () so the following would be valid:
int a[ (1, 2) ] ;
^ ^
Note, in C++03 you were explicitly forbidden from using the comma operator in a constant expression.

No it's not true, unless someone has overloaded the comma operator and possibly [] as well which is very unlikely. (Boost Spirit does both but for very different reasons).
Without any overloading at all, Array[x, y] is syntatically invalid since the size must be a constant-expression and these cannot contain the comma operator; as to do so would make it an assignment-expression.
Burn the book and put Stroustrup in her Christmas stocking!

Related

warning: top-level comma expression in array subscript changed meaning in C++23 [-Wcomma-subscript]

I have overloaded the 2D subscript operator in one of my classes. And for that I use the -std=c++23 option to compile the program.
Now when calling this operator, GCC complains:
warning: top-level comma expression in array subscript changed meaning in C++23 [-Wcomma-subscript]
331 | m_characterMatrix[ x1, y1 ] = ch.value( );
| ~~~~~~~~~~~~~~~~~^
So what is this warning for? Should I take it seriously?
The warning is there because the compiler's assumption is that you might have been expecting the pre-C++23 behaviour - that is, the "traditional" comma operator evaluation.
(While common sense would clearly indicate that you meant to use your overload and there is no problem, computer programs don't possess common sense.)
You can disable that warning with -Wno-comma-subscript.
Using an unparenthesized comma expression as second (right) argument
of a subscript operator is deprecated.
For example, a[b, c] is deprecated and a[(b, c)] is not.
An unparenthesized comma expression cannot be second (right) argument of a subscript operator. For example, a[b, c] is either ill-formed or equivalent to a.operator[](b, c).
Parentheses are needed to for using a comma expression as the subscript, e.g., a[(b, c)].
Via: https://en.cppreference.com/w/cpp/language/operator_other
So yes, I think you should add parens inside operator[] to have old behaviour

Is it correct to say that there is no implied ordering in the presentation of grammar options in the C++ Standard?

I'll try to explain my question with an example. Consider the following grammar production in the C++ standard:
literal:
   integer-literal
   character-literal
   floating-point-literal
   string-literal
   boolean-literal
   pointer-literal
   user-defined-literal
Once the parser identifies a literal as an integer-literal, I always thought that the parser would just stop there. But I was told that this is not true. The parser will continue parsing to verify whether the literal could also be matched with a user-defined-literal, for example.
Is this correct?
Edit
I decided to include this edit as my interpretation of the Standard, in response to #rici's excellent answer below, although with a result that is the opposite of the one advocated by the OP.
One can read the following in [stmt.ambig]/1 and /3 (emphases are mine):
[stmt.ambig]/1
There is an ambiguity in the grammar involving
expression-statements and declarations: An expression-statement with a
function-style explicit type conversion as its leftmost subexpression
can be indistinguishable from a declaration where the first declarator
starts with a (. In those cases the statement is a declaration.
That is, this paragraph states how ambiguities in the grammar should be treated. There are several other ambiguities mentioned in the C++ Standard, but only three that I know are ambiguities related to the grammar, [stmt.ambig], [dcl.ambig.res]/1, a direct consequence of [stmt.ambig] and [expr.unary.op]/10, which explicitly states the term ambiguity in the grammar.
[stmt.ambig]/3:
The disambiguation is purely syntactic; that is, the meaning of the
names occurring in such a statement, beyond whether they are
type-names or not, is not generally used in or changed by the
disambiguation. Class templates are instantiated as necessary to
determine if a qualified name is a type-name. Disambiguation
precedes parsing, and a statement disambiguated as a declaration may
be an ill-formed declaration. If, during parsing, a name in a template
parameter is bound differently than it would be bound during a trial
parse, the program is ill-formed. No diagnostic is required. [ Note:
This can occur only when the name is declared earlier in the
declaration. — end note ]
Well, if disambiguation precedes parsing there is nothing that could prevent a decent compiler to optimize parsing by just considering that the alternatives present in each definition of the grammar are indeed ordered. With that in mind, the first sentence in [lex.ext]/1 below could be eliminated.
[lex.ext]/1:
If a token matches both user-defined-literal and another literal kind,
it is treated as the latter. [ Example: 123_­km is a
user-defined-literal, but 12LL is an integer-literal. — end example ]
The syntactic non-terminal preceding the ud-suffix in a
user-defined-literal is taken to be the longest sequence of characters
that could match that non-terminal.
Note also that this paragraph doesn't mention ambiguity in the grammar, which for me at least, is an indication that the ambiguity doesn't exist.
There is no implicit ordering of productions in the C++ presentation grammar.
There are ambiguities in that grammar, which are dealt with on a case-by-case basis by text in the standard. Note that the text of the the standard is normative; the grammar does not stand alone, and it does not override the text. The two need to be read together.
The standard itself points out that the grammar as resumed in Appendix A:
… is not an exact statement of the language. In particular, the grammar described here accepts a superset of valid C++ constructs. Disambiguation rules (8.9, 9.2, 11.8) must be applied to distinguish expressions from declarations. Further, access control, ambiguity, and type rules must be used to weed out syntactically valid but meaningless constructs. (Appendix A, paragraph 1)
That's not a complete list of the ambiguities resolved in the text of the standard, because there are also rules about lexical ambiguities. (See below.)
Almost all of these ambiguity resolution clauses are of the form "if both P and Q applies, choose Q", and thus would be unnecessary were there an implicit ordering of grammar alternatives, since the correct parse could be guaranteed simply by putting the alternatives in the correct order. So the fact that the standard feels the need to dedicate a number of clauses to ambiguity resolution is prima facie evidence that alternatives are not implicitly ordered. [Note 1]
The C++ standard does not explicitly name the grammar formalism being used, but it does credit the antecedents which allows us to construct a historical argument. The formalism used by the C++ standard was inherited from the C standard and the description in Kernighan & Ritchie's original book on the (then newly-minted) C language. K&R wrote their grammar using the Yacc parser generator, and the original C grammar is basically a Yacc grammar file. Yacc uses the LALR(1) algorithm to construct a parser from a context-free grammar (CFG), and its grammar files are a concrete representation of that grammar written in what has come to be known as BNF (although there is some historical ambiguity about what the letters in BNF actually stand for). BNF does not have any implicit ordering of rules, and the formalism does not allow any way to write an explicit ordering or any other disambiguation rule. (A BNF grammar must be unambiguous in order to be mechanically parsed; if it is ambiguous, the LALR(1) algorithm will fail to generate a parser.)
Yacc does go a bit outside of the box. It has some automatic disambiguation rules, and one mechanism to provide explicit disambiguation (operator precedence). But Yacc's disambiguation has nothing to do with the ordering of alternatives either.
In short, ordered alternatives were not really a feature of any grammar formalism until 2002 when Bryan Ford proposed packrat parsing, and subsequently formalised a class of grammars which he called "Parsing Expression Grammars" (PEGs). The PEG algorithm does implicitly order alternatives, by insisting that the right-hand alternative in an alternation only be attempted if the left-hand alternative failed to match. For this reason, the PEG alternation operator (or "ordered alternation" operator) is generally written as / instead of |, avoiding confusion with the traditional unordered alternation syntax.
A key feature of the PEG algorithm is that it is always deterministic. Every PEG grammar can be deterministically applied to a source text without ambiguity. (That doesn't mean that the grammar will give you the parse you wanted, of course. It just means that it will never give you a list of parses and let you select the one you want.) So grammars written in PEG cannot be accompanied by textual rules which disambiguate, because there are no ambiguities.
I mention this because the existence and popularity of PEG have to some extent altered the perception of the meaning of the alternation operator. Before PEG, we probably wouldn't be having this kind of discussion at all. But using PEG as a guide to interpreting the C++ grammar formalism is ahistoric and unjustifiable; the roots of the C++ grammar go back to at least 1978, at least a quarter of a century before PEG.
Lexical ambiguities, and the clauses which resolve them
[lex.pptoken] (§5.4) paragraph 3 lays down the fundamental rules for token recognition, which is a little more complicated than the traditional "maximal munch" principle which always recognises the longest possible token starting immediately after the previously recognised token. It includes two exceptions:
The sequence <:: is treated as starting with the token < rather than the longer token <: unless it is the start of <::> (treated as <:, :>) or <::: (treated as <:, ::). That might all make more sense if you mentally replace <: with [ and :> with ], which is the intended syntactic equivalence.
A raw string literal is terminated by the first matching delimiter sequence. This rule could in theory be written in a context-free grammar only because there is an explicit limit on the length of termination sequences, which means that the theoretical CFG would have on the order of 8816 rules, one for each possible delimiter sequence. In practice, this rule cannot be written as such, and it is described textually, along with the 16-character limit on the length of the d-char-sequence.
[lex-header] (§5.8) avoids the ambiguity between header-names and string-literals (as well as certain token sequences starting with <) by requiring header-name to only be recognised in certain contexts, including an #include preprocessing directive. (The section does not actually say that the string-literal should not be recognised, but I think that the implication is clear.)
[lex.ext] (§5.13.8) paragraph 1 resolves the ambiguities involved with user-defined-literals, by requiring that:
the user-defined-literal rule is only recognised if the token cannot be recognised as some other kind of literal, and
the decomposition of the user-defined-literal into a literal followed by a ud-suffix follows the longest-token rule, described above.
Note that this rule is not really a tokenisation rule, because it is applied after the source text has been divided into tokens. Tokenisation is done in translation phase 3, after which the tokens are passed through the preprocessing directives (phase 4), rewriting of escape sequences and UCNs (phase 5), and concatenation of string literals (phase 6). Each token which emerges from phase 6 must then be reinterpreted as a token in the syntactic grammar, and it is at that point that literal tokens will be classified. So it's not necessary that §5.13.8 clarify what the extent of the token being categorised is; the extent is already known and the converted token must use exactly all of the characters in the preprocessing token. Thus it's quite different from the other ambiguities in this list, but I left it here because it is so present in the original question and in the various comment threads.
Notes:
Curiously, in almost all of the ambiguity resolution clauses, the preferred alternative is the one which appears later in the list of alternatives. For example, §8.9 explicitly prefers declarations to expressions, but the grammar for statement lists expression-statement long before declaration-statement. Having said that, correctly parsing C++ requires a more sophisticated algorithm than just "try to parse a declaration and if that fails, then try to parse as an expression," because there are programs which must be parsed as a declaration with a syntax error (see the example at [stmt.ambig]/3).
No ordering is either implied or necessary.
All seven kinds of literal are distinct. No token that meets the definition of any of them can meet the definition of any other. For example, 42 is an integer-literal and cannot be a floating-point-literal.
How a compiler determines what a token is is an implementation detail that the standard doesn't address, and doesn't need to.
If there were an ambiguity, so that for example the same token could be either an integer-literal or a user-defined-literal, either the language would have to have a rule to disambiguate it, or it would be a bug in the grammar.
UPDATE: There is in fact such an ambiguity. As discussed in comments, 42ULL satisfies the syntax of either an integer-literal or a user-defined-literal. This ambiguity is resolved, not by the ordering of the grammar productions, but by an explicit statement:
If a token matches both user-defined-literal and another literal kind, it is treated as the latter.
The section on syntactic notation in the standard only says this about what it means:
In the syntax notation used in this document, syntactic categories are indicated by italic type, and literal words and characters in constant width type. Alternatives are listed on separate lines except in a few cases where a long set of alternatives is marked by the phrase “one of”. If the text of an alternative is too long to fit on a line, the text is continued on subsequent lines indented from the first one. An optional terminal or non-terminal symbol is indicated by the subscript “opt”, so
{ expressionopt }
indicates an optional expression enclosed in braces.
Note that the statement considers the terms in grammars to be "alternatives", rather than a list or even an ordered list. There is no statement about ordering of the "alternatives" at all.
As such, this strongly suggests that there is no ordering at all.
Indeed, the presence throughout the standard of specific rules to disambiguate cases where multiple terms match also suggests that the alternatives are not written as a prioritized list. If the alternatives were some kind of ordered list, this statement would be redundant:
If a token matches both user-defined-literal and another literal kind, it is treated as the latter.

Operator precedence, inconsistent documentations

I'm refreshing my memory about operator precedence, because I try to be smart guy and avoid parentheses as much as possible, refreshing on following 2 links:
cpp reference
MS docs
One problem I have, is that those 2 "reliable" docs are not telling the same thing, I no longer know whom to trust?
For one example, Cppreference says throw keyword is in same group as the conditional operator. Microsoft's docs say the conditional operator is higher than throw. There are other differences.
Which site is correct, or are both sites wrong in different ways?
TL;DR: The Microsoft docs can be interpreted to be less correct, depending on how you look at them.
The first thing you have to understand is that C++ as a language does not have "operator precedence" rules. C++ has a grammar; it is the grammar that defines what a particular piece of C++ syntax means. It is the C++ grammar that tells you that 5 + 3 * 4 should be considered equivalent to 5 + (3 * 4) rather than (5 + 3) * 4.
Therefore, any "operator precedence" rules that you see are merely a textual, human-readable explanation of the C++ grammar around expression parsing. As such, one can imagine that two different ways of describing the behavior of the same grammar could exist.
Consider the specific example of throw vs. the ?: operator. The Microsoft site says that ?: has higher precedence than throw, while the Cppreference site says that they have the same precedence.
First, let's look at a hypothetical C++ expression:
throw val ? one : two
By Microsoft's rules, the ?: operator has higher precedence, so would be parsed as throw (val ? one : two). By Cppreference's rules, the two operators have equal precedence. However, since they have right-to-left associativity, the ?: gets first dibs on the sub-expressions. So we have throw (val ? one : two).
So both of them resolve to the same result.
But what does the C++ grammar say? Well, here's a relevant fragment of the grammar:
throw-expression:
throw assignment-expression(opt)
assignment-expression:
conditional-expression
logical-or-expression assignment-operator initializer-clause
throw-expression
This is parsed as a throw-expression, which contains an assignment-expression, which contains a conditional-expression, which is where our ?: lies. In short, the parser parses it as throw (val ? one : two).
So both pages are the same, and both of them are correct.
Now consider:
val ? throw one : two
How does this get parsed? Well, the thing to remember is that ?: is a ternary operator; unlike most others, it has three terms. That is, the conditional-expression itself is not finished being specified until the : <something> gets parsed.
So the precedence of throw vs ?: is irrelevant in this case. The throw one is within the ternary operator because the expression is literally within the ternary operator. The two operators are not competing.
Lastly, how about:
val ? one : throw two
Microsoft gives ?: higher precedence. By Microsoft's documentation, precedence "specifies the order of operations in expressions that contain more than one operator". So the ?: happens first.
Here's the rub though. throw by itself is actually a grammatically legal expression (it's only valid C++ within a catch clause, but the grammar is legal everywhere). As such, val ? one : throw could be a legitimate expression, which is what the Microsoft docs' rules would appear to say.
Of course, (val ? one : throw) two is not a legitimate expression, because () two isn't legal C++ grammar. So one could interpret Microsoft's rules to say that this should be a compile error.
But it's not. C++'s grammar states:
conditional-expression:
logical-or-expression
logical-or-expression ? expression : assignment-expression
throw two is the full assignment-expression used as the third operand of the given expression. So this should be parsed as val ? one : (throw two).
And what of Cppreference? Well, by giving them right-to-left associativity, the throw two is grouped with itself. So it should be considered val ? one : (throw two).

C++14 standard Annex A interpretation

What are "superset of valid C++ constructs" from Annex A ?
Also, any guide which will help you read this grammar in Annex A ?
Annex A quote (donot block quote the following as it messes up the angle brackets):
This summary of C++ syntax is intended to be an aid to comprehension. It is not an exact statement
of the language. In particular, the grammar described here accepts a superset of valid C++ constructs.
Disambiguation rules (6.8, 7.1, 10.2) must be applied to distinguish expressions from declarations. Further,
access control, ambiguity, and type rules must be used to weed out syntactically valid but meaningless
constructs.
Here is one short example that is valid according to the grammar, but not according to the full language rules:
int a[];
struct s;
void main(foo bar)
{
return (sizeof a) + sizeof (s);
}
The primary issue is that the grammar is expressed using context-free productions, but C++ syntactic parse is highly contextual.
If S is a set of elements, a superset is another set X such that each element s in S is also an element of X, but there may be elements x in X that are not elements of S.
As an example, {1,2,3} is a set of 3 numbers. {1,2,3,4} is a superset of the first set -- it contains the elements in {1,2,3}, but also an extra element 4.
So the grammar listed in Annex A will match C++, but will also match things that are not valid C++.
It then goes on to list some issues you have to solve "outside of the grammar" -- the disambiguation rules, access control, ambiguity, and type rules.
The quote implies, lightly, that this is a complete set of things you must consider to distinguish valid C++ from things matched by the grammar, but does not explicitly say so. I am uncertain if this light implication is actually intended or not.

Accessing arrays by index[array] in C and C++

There is this little trick question that some interviewers like to ask for whatever reason:
int arr[] = {1, 2, 3};
2[arr] = 5; // does this line compile?
assert(arr[2] == 5); // does this assertion fail?
From what I can understand, a[b] gets converted to *(a + b) and since addition is commutative, it doesn't really matter their order, so 2[a] is really *(2 + a) and that works fine.
Is this guaranteed to work by C and/or C++'s specs?
Yes. 6.5.2.1 paragraph 1 (C99 standard) describes the arguments to the [] operator:
One of the expressions shall have type "pointer to object type", the other expression shall have integer type, and the result has type "type".
6.5.2.1 paragraph 2 (emphasis added):
A postfix expression followed by an expression in square brackets [] is a subscripted
designation of an element of an array object. The definition of the subscript operator []
is that E1[E2] is identical to (*((E1)+(E2))). Because of the conversion rules that
apply to the binary + operator, if E1 is an array object (equivalently, a pointer to the
initial element of an array object) and E2 is an integer, E1[E2] designates the E2-th
element of E1 (counting from zero).
It says nothing requiring the order of the arguments to [] to be sane.
In general 2[a] is identical to a[2] and this is guaranteed to be equivalent in both C and C++ (assuming no operator overloading), because as you meantioned it translates into *(2+a) or *(a+2), respectively. Because the plus operator is commutative, the two forms are equivalent.
Although the forms are equivalent, please for the sake of all that's holy (and future maintenance programmers), prefer the "a[2]" form over the other.
P.S., If you do get asked this at an interview, please do exact revenge on behalf of the C/C++ community and make sure that you ask the interviewer to list all trigraph sequences as a precondition to you giving your answer. Perhaps this will disenchant him/her from asking such (worthless, with regard to actually programming anything) questions in the future. In the odd event that the interviewer actually knows all nine of the trigraph sequences, you can always make another attempt to stomp them with a question about the destruction order of virtual base classes - a question that is just as mind bogglingly irrelevant for everyday programming.