Why +++x will be divided into ++(+x) instead of +(++x) in C++? [duplicate] - c++

This question already has answers here:
Why doesn't a+++++b work?
(9 answers)
Closed 10 months ago.
When I type this code bellow
int x = 1;
+++x;
it would be divided into ++(+x), and of course the sentence is wrong cause there's a rvalue after ++.
I am curious about why it can not be +(++x), in which the code is correct.
Is this depend on the IDE or the compiler ?
Can it be find in C++ Standard ? Or it's just a undefined behaviour ?
Thanks a lot to answer this question and forgive my poor English.

From C++20 (draft N4860) [lex.pptoken]/3.3
— Otherwise, the next preprocessing token is the longest sequence of characters that could constitute
a preprocessing token, even if that would cause further lexical analysis to fail, ...
and [lex.pptoken]/6
[Example: The program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y have integral types,
violates a constraint on increment operators, even though the parse x ++ + ++ y might yield a correct
expression. —end example]
So, it is a rule of the language, that the + goes with the variable, because the ++ is first grouped together.
Funnily, this reminds me of an old problem where: std::vector<std::vector<int>> a used to cause problems because >> would be one token instead of two (since it's supposed to be the longest sequence of characters). This is addressed by [temp.names]/3
When a name is considered to be a template-name, and it is followed by a <, the < is always taken as the
delimiter of a template-argument-list and never as the less-than operator. When parsing a template-argumentlist,
the first non-nested > is taken as the ending delimiter rather than a greater-than operator. Similarly,
the first non-nested >> is treated as two consecutive but distinct > tokens, the first of which is taken as the
end of the template-argument-list and completes the template-id. [Note: The second > token produced by this
replacement rule may terminate an enclosing template-id construct or it may be part of a different construct
(e.g., a cast). —end note]

This is a consequence of the maximum munch tokenization principle:
A C++ implementation must collect as many consecutive characters as possible into a token.
From lex.pptoken#3.3:
Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail, except that a header-name is only formed within a #include directive.
And since ++ is the longest valid token, the parser treats the expression as if ++ +x.

Related

why does C++ allow a declaration with no space between the type and a parenthesized variable name? [duplicate]

This question already has an answer here:
Is white space considered a token in C like languages?
(1 answer)
Closed 8 months ago.
A previous C++ question asked why int (x) = 0; is allowed. However, I noticed that even int(x) = 0; is allowed, i.e. without a space before the (x). I find the latter quite strange, because it causes things like this:
using Oit = std::ostream_iterator<int>;
Oit bar(std::cout);
*bar = 6; // * is optional
*Oit(bar) = 7; // * is NOT optional!
where the final line is because omitting the * makes the compiler think we are declaring bar again and initializing to 7.
Am I interpreting this correctly, that int(x) = 0; is indeed equivalent to int x = 0, and Oit(bar) = 7; is indeed equivalent to Oit bar = 7;? If yes, why specifically does C++ allow omitting the space before the parentheses in such a declaration + initialization?
(my guess is because the C++ compiler does not care about any space before a left paren, since it treats that parenthesized expression as it's own "token" [excuse me if I'm butchering the terminology], i.e. in all cases, qux(baz) is equivalent to qux (baz))
It is allowed in C++ because it is allowed in C and requiring the space would be an unnecessary C-compatibility breaking change. Even setting that aside, it would be surprising to have int (x) and int(x) behave differently, since generally (with few minor exceptions) C++ is agnostic to additional white-space as long as tokens are properly separated. And ( (outside a string/character literal) is always a token on its own. It can't be part of a token starting with int(.
In C int(x) has no other potential meaning for which it could be confused, so there is no reason to require white-space separation at all. C also is generally agnostic to white-space, so it would be surprising there as well to have different behavior with and without it.
One requirement when defining the syntax of a language is that elements of the language can be separated. According to the C++ syntax rules, a space separates things. But also according to the C++ syntax rules, parentheses also separate things.
When C++ is compiled, the first step is the parsing. And one of the first steps of the parsing is separating all the elements of the language. Often this step is called tokenizing or lexing. But this is just the technical background. The user does not have to know this. He or she only has to know that things in C++ must be clearly separted from each others, so that there is a sequence "*", "Oit", "(", "bar", ")", "=", "7", ";".
As explained, the rule that the parenthesis always separates is established on a very low level of the compiler. The compiler determines even before knowing what the purpose of the parenthesis is, that a parenthesis separates things. And therefore an extra space would be redundant.
When you ever use parser generators, you will see that most of them just ignore spaces. That means, when the lexer has produced the list of tokens, the spaces do not exist any more. See above in the list. There are no spaces any more. So you have no chance to specify something that explicitly requires a space.

Is it correct to say that there is no implied ordering in the presentation of grammar options in the C++ Standard?

I'll try to explain my question with an example. Consider the following grammar production in the C++ standard:
literal:
   integer-literal
   character-literal
   floating-point-literal
   string-literal
   boolean-literal
   pointer-literal
   user-defined-literal
Once the parser identifies a literal as an integer-literal, I always thought that the parser would just stop there. But I was told that this is not true. The parser will continue parsing to verify whether the literal could also be matched with a user-defined-literal, for example.
Is this correct?
Edit
I decided to include this edit as my interpretation of the Standard, in response to #rici's excellent answer below, although with a result that is the opposite of the one advocated by the OP.
One can read the following in [stmt.ambig]/1 and /3 (emphases are mine):
[stmt.ambig]/1
There is an ambiguity in the grammar involving
expression-statements and declarations: An expression-statement with a
function-style explicit type conversion as its leftmost subexpression
can be indistinguishable from a declaration where the first declarator
starts with a (. In those cases the statement is a declaration.
That is, this paragraph states how ambiguities in the grammar should be treated. There are several other ambiguities mentioned in the C++ Standard, but only three that I know are ambiguities related to the grammar, [stmt.ambig], [dcl.ambig.res]/1, a direct consequence of [stmt.ambig] and [expr.unary.op]/10, which explicitly states the term ambiguity in the grammar.
[stmt.ambig]/3:
The disambiguation is purely syntactic; that is, the meaning of the
names occurring in such a statement, beyond whether they are
type-names or not, is not generally used in or changed by the
disambiguation. Class templates are instantiated as necessary to
determine if a qualified name is a type-name. Disambiguation
precedes parsing, and a statement disambiguated as a declaration may
be an ill-formed declaration. If, during parsing, a name in a template
parameter is bound differently than it would be bound during a trial
parse, the program is ill-formed. No diagnostic is required. [ Note:
This can occur only when the name is declared earlier in the
declaration. — end note ]
Well, if disambiguation precedes parsing there is nothing that could prevent a decent compiler to optimize parsing by just considering that the alternatives present in each definition of the grammar are indeed ordered. With that in mind, the first sentence in [lex.ext]/1 below could be eliminated.
[lex.ext]/1:
If a token matches both user-defined-literal and another literal kind,
it is treated as the latter. [ Example: 123_­km is a
user-defined-literal, but 12LL is an integer-literal. — end example ]
The syntactic non-terminal preceding the ud-suffix in a
user-defined-literal is taken to be the longest sequence of characters
that could match that non-terminal.
Note also that this paragraph doesn't mention ambiguity in the grammar, which for me at least, is an indication that the ambiguity doesn't exist.
There is no implicit ordering of productions in the C++ presentation grammar.
There are ambiguities in that grammar, which are dealt with on a case-by-case basis by text in the standard. Note that the text of the the standard is normative; the grammar does not stand alone, and it does not override the text. The two need to be read together.
The standard itself points out that the grammar as resumed in Appendix A:
… is not an exact statement of the language. In particular, the grammar described here accepts a superset of valid C++ constructs. Disambiguation rules (8.9, 9.2, 11.8) must be applied to distinguish expressions from declarations. Further, access control, ambiguity, and type rules must be used to weed out syntactically valid but meaningless constructs. (Appendix A, paragraph 1)
That's not a complete list of the ambiguities resolved in the text of the standard, because there are also rules about lexical ambiguities. (See below.)
Almost all of these ambiguity resolution clauses are of the form "if both P and Q applies, choose Q", and thus would be unnecessary were there an implicit ordering of grammar alternatives, since the correct parse could be guaranteed simply by putting the alternatives in the correct order. So the fact that the standard feels the need to dedicate a number of clauses to ambiguity resolution is prima facie evidence that alternatives are not implicitly ordered. [Note 1]
The C++ standard does not explicitly name the grammar formalism being used, but it does credit the antecedents which allows us to construct a historical argument. The formalism used by the C++ standard was inherited from the C standard and the description in Kernighan & Ritchie's original book on the (then newly-minted) C language. K&R wrote their grammar using the Yacc parser generator, and the original C grammar is basically a Yacc grammar file. Yacc uses the LALR(1) algorithm to construct a parser from a context-free grammar (CFG), and its grammar files are a concrete representation of that grammar written in what has come to be known as BNF (although there is some historical ambiguity about what the letters in BNF actually stand for). BNF does not have any implicit ordering of rules, and the formalism does not allow any way to write an explicit ordering or any other disambiguation rule. (A BNF grammar must be unambiguous in order to be mechanically parsed; if it is ambiguous, the LALR(1) algorithm will fail to generate a parser.)
Yacc does go a bit outside of the box. It has some automatic disambiguation rules, and one mechanism to provide explicit disambiguation (operator precedence). But Yacc's disambiguation has nothing to do with the ordering of alternatives either.
In short, ordered alternatives were not really a feature of any grammar formalism until 2002 when Bryan Ford proposed packrat parsing, and subsequently formalised a class of grammars which he called "Parsing Expression Grammars" (PEGs). The PEG algorithm does implicitly order alternatives, by insisting that the right-hand alternative in an alternation only be attempted if the left-hand alternative failed to match. For this reason, the PEG alternation operator (or "ordered alternation" operator) is generally written as / instead of |, avoiding confusion with the traditional unordered alternation syntax.
A key feature of the PEG algorithm is that it is always deterministic. Every PEG grammar can be deterministically applied to a source text without ambiguity. (That doesn't mean that the grammar will give you the parse you wanted, of course. It just means that it will never give you a list of parses and let you select the one you want.) So grammars written in PEG cannot be accompanied by textual rules which disambiguate, because there are no ambiguities.
I mention this because the existence and popularity of PEG have to some extent altered the perception of the meaning of the alternation operator. Before PEG, we probably wouldn't be having this kind of discussion at all. But using PEG as a guide to interpreting the C++ grammar formalism is ahistoric and unjustifiable; the roots of the C++ grammar go back to at least 1978, at least a quarter of a century before PEG.
Lexical ambiguities, and the clauses which resolve them
[lex.pptoken] (§5.4) paragraph 3 lays down the fundamental rules for token recognition, which is a little more complicated than the traditional "maximal munch" principle which always recognises the longest possible token starting immediately after the previously recognised token. It includes two exceptions:
The sequence <:: is treated as starting with the token < rather than the longer token <: unless it is the start of <::> (treated as <:, :>) or <::: (treated as <:, ::). That might all make more sense if you mentally replace <: with [ and :> with ], which is the intended syntactic equivalence.
A raw string literal is terminated by the first matching delimiter sequence. This rule could in theory be written in a context-free grammar only because there is an explicit limit on the length of termination sequences, which means that the theoretical CFG would have on the order of 8816 rules, one for each possible delimiter sequence. In practice, this rule cannot be written as such, and it is described textually, along with the 16-character limit on the length of the d-char-sequence.
[lex-header] (§5.8) avoids the ambiguity between header-names and string-literals (as well as certain token sequences starting with <) by requiring header-name to only be recognised in certain contexts, including an #include preprocessing directive. (The section does not actually say that the string-literal should not be recognised, but I think that the implication is clear.)
[lex.ext] (§5.13.8) paragraph 1 resolves the ambiguities involved with user-defined-literals, by requiring that:
the user-defined-literal rule is only recognised if the token cannot be recognised as some other kind of literal, and
the decomposition of the user-defined-literal into a literal followed by a ud-suffix follows the longest-token rule, described above.
Note that this rule is not really a tokenisation rule, because it is applied after the source text has been divided into tokens. Tokenisation is done in translation phase 3, after which the tokens are passed through the preprocessing directives (phase 4), rewriting of escape sequences and UCNs (phase 5), and concatenation of string literals (phase 6). Each token which emerges from phase 6 must then be reinterpreted as a token in the syntactic grammar, and it is at that point that literal tokens will be classified. So it's not necessary that §5.13.8 clarify what the extent of the token being categorised is; the extent is already known and the converted token must use exactly all of the characters in the preprocessing token. Thus it's quite different from the other ambiguities in this list, but I left it here because it is so present in the original question and in the various comment threads.
Notes:
Curiously, in almost all of the ambiguity resolution clauses, the preferred alternative is the one which appears later in the list of alternatives. For example, §8.9 explicitly prefers declarations to expressions, but the grammar for statement lists expression-statement long before declaration-statement. Having said that, correctly parsing C++ requires a more sophisticated algorithm than just "try to parse a declaration and if that fails, then try to parse as an expression," because there are programs which must be parsed as a declaration with a syntax error (see the example at [stmt.ambig]/3).
No ordering is either implied or necessary.
All seven kinds of literal are distinct. No token that meets the definition of any of them can meet the definition of any other. For example, 42 is an integer-literal and cannot be a floating-point-literal.
How a compiler determines what a token is is an implementation detail that the standard doesn't address, and doesn't need to.
If there were an ambiguity, so that for example the same token could be either an integer-literal or a user-defined-literal, either the language would have to have a rule to disambiguate it, or it would be a bug in the grammar.
UPDATE: There is in fact such an ambiguity. As discussed in comments, 42ULL satisfies the syntax of either an integer-literal or a user-defined-literal. This ambiguity is resolved, not by the ordering of the grammar productions, but by an explicit statement:
If a token matches both user-defined-literal and another literal kind, it is treated as the latter.
The section on syntactic notation in the standard only says this about what it means:
In the syntax notation used in this document, syntactic categories are indicated by italic type, and literal words and characters in constant width type. Alternatives are listed on separate lines except in a few cases where a long set of alternatives is marked by the phrase “one of”. If the text of an alternative is too long to fit on a line, the text is continued on subsequent lines indented from the first one. An optional terminal or non-terminal symbol is indicated by the subscript “opt”, so
{ expressionopt }
indicates an optional expression enclosed in braces.
Note that the statement considers the terms in grammars to be "alternatives", rather than a list or even an ordered list. There is no statement about ordering of the "alternatives" at all.
As such, this strongly suggests that there is no ordering at all.
Indeed, the presence throughout the standard of specific rules to disambiguate cases where multiple terms match also suggests that the alternatives are not written as a prioritized list. If the alternatives were some kind of ordered list, this statement would be redundant:
If a token matches both user-defined-literal and another literal kind, it is treated as the latter.

C++ Array Definition with Lower and Upper Bound?

My daughter's 12th standard C++ textbook says that
the notation of an array can (also) be given as follows: Array name
[lower bound L, upper bound U]
This was a surprise for me. I know Pascal has this notation, but C++? Had never seen this earlier. I wrote a quick program in her prescribed compiler (the ancient Turbo C++ 4.5), and that does not support it. Did not find this syntax in Stanley Lippman's book either. Internet search did not throw up this. Or maybe I didn't search correctly?
So, is it a valid declaration?
This is not valid, from the draft C++ standard section 8.3.4 Arrays the declaration must be of this form:
D1 [ constant-expressionopt] attribute-specifier-seqopt
and we can from section 5.19 Constant expressions the grammar for constant expression is:
constant-expression:
conditional-expression
This grammar does not allow us to get to the comma operator either to do something like this:
int a[ 1, 2 ] ;
^
as others have implied since there is no path to comma operator from conditional-expression. Although if you add parenthesis we can get to the comma operator since conditional-expression allows us to get to primary-expression which gets us () so the following would be valid:
int a[ (1, 2) ] ;
^ ^
Note, in C++03 you were explicitly forbidden from using the comma operator in a constant expression.
No it's not true, unless someone has overloaded the comma operator and possibly [] as well which is very unlikely. (Boost Spirit does both but for very different reasons).
Without any overloading at all, Array[x, y] is syntatically invalid since the size must be a constant-expression and these cannot contain the comma operator; as to do so would make it an assignment-expression.
Burn the book and put Stroustrup in her Christmas stocking!

What's the difference between the comma operator and the comma separator? [duplicate]

This question already has answers here:
How does the compiler know that the comma in a function call is not a comma operator?
(6 answers)
Closed 8 years ago.
In C++, the comma token (i.e., ,) is either interpreted as a comma operator or as a comma separator.
However, while searching in the web I realized that it's not quite clear in which cases the , token is interpreted as the binary comma operator and where is interpreted as a separator between statements.
Moreover, considering multiple statements/expressions in one line separated by , (e.g., a = 1, b = 2, c = 3;), there's a turbidness on the order in which they are evaluated.
Questions:
In which cases a comma , token is interpreted as an operator and in which as a separator?
When we have one line multiple statements/expressions separated by comma what's the order of evaluation for either the case of the comma operator and the case of the comma separator?
When a separator is appropriate -- in arguments to a function call or macro, or separating values in an initializer list (thanks for the reminder, #haccks) -- comma will be taken as a separator. In other expressions, it is taken as an operator. For example,
my_function(a,b,c,d);
is a call passing four arguments to a function, whereas
result=(a,b,c,d);
will be understood as the comma operator. It is possible, through ugly, to intermix the two by writing something like
my_function(a,(b,c),d);
The comma operator is normally evaluated left-to-right.
The original use of this operation in C was to allow a macro to perform several operations before returning a value. Since a macro instantiation looks like a function call, users generally expect it to be usable anywhere a function call could be; having the macro expand to multiple statements would defeat that. Hence, C introduced the , operator to permit chaining several expressions together into a single expression while discarding the results of all but the last.
As #haccks pointed out, the exact rules for how the compiler determines which meaning of , was intended come out of the language grammar, and have previously been discussed at How does the compiler know that the comma in a function call is not a comma operator?
You cannot use comma to separate statements. The , in a = 1, b = 2; is the comma operator, whose arguments are two assignment expressions. The order of evaluation of the arguments of the comma operator is left-to-right, so it's clear what the evaluation order is in that case.
In the context of the arguments to a function-call, those arguments cannot be comma-expressions, so the top-level commas must be syntactic (i.e. separating arguments). In that case, the evaluation order is not specified. (Of course, the arguments might be parenthesized expressions, and the parenthesized expression might be a comma expression.)
This is expressed clearly in the grammar in the C++ standard. The relevant productions are expression, which can be:
assignment-expression
or
expression , assignment-expression
and expression-list, which is the same as an initializer-list, which is a ,-separated list of initializer-clause, where an initializer-clause is either:
assignment-expression
or
braced-init-list
The , in the second expression production is the comma-operator.

Explain the difference:

int i=1,2,3,4; // Compile error
// The value of i is 1
int i = (1,2,3,4,5);
// The value of i is 5
What is the difference between these definitions of i in C and how do they work?
Edit: The first one is a compiler error. How does the second work?
= takes precedence over ,1. So the first statement is a declaration and initialisation of i:
int i = 1;
… followed by lots of comma-separated expressions that do nothing.
The second code, on the other hand, consists of one declaration followed by one initialisation expression (the parentheses take precedence so the respective precedence of , and = are no longer relevant).
Then again, that’s purely academic since the first code isn’t valid, neither in C nor in C++. I don’t know which compiler you’re using that it accepts this code. Mine (rightly) complains
error: expected unqualified-id before numeric constant
1 Precedence rules in C++ apply regardless of how an operator is used. = and , in the code of OP do not refer to operator= or operator,. Nevertheless, they are operators as far as C++ is concerned (§2.13 of the standard), and the precedence of the tokens = and , does not depend on their usage – it so happens that , always has a lower precedence than =, regardless of semantics.
You have run into an interesting edge case of the comma operator (,).
Basically, it takes the result of the previous statement and discards it, replacing it with the next statement.
The problem with the first line of code is operator precedence. Because the = operator has greater precedence than the , operator, you get the result of the first statement in the comma chain (1).
Correction (thanks #jrok!) - the first line of code neither compiles, nor is it using the comma as an operator, but instead as an expression separator, which allows you to define multiple variable names of the same type at a time.
In the second one, all of the first values are discarded and you are given the final result in the chain of items (5).
Not sure about C++, but at least for C the first one is invalid syntax so you can't really talk about a declaration since it doesn't compile. The second one is just the comma operator misused, with the result 5.
So, bluntly, the difference is that the first isn't C while the second is.