This question already has an answer here:
Conditional inclusion: integral constant expression is unlimited?
(1 answer)
Closed 9 months ago.
Many sources online (for example, https://en.cppreference.com/w/cpp/preprocessor/conditional#Condition_evaluation) say that the expression need only be an integer constant expression.
The following are all integral constant expressions without any identifiers in them:
#include <compare>
#if (1 <=> 2) > 0
#error 1 > 2
#endif
#if (([]{}()), 0)
#error 0
#endif
#if 1.2 < 0.0
#error 1.2 < 0.0
#endif
#if ""[0]
#error null terminator is true
#endif
#if *""
#error null terminator is true
#endif
Yet they fail to compile with clang or gcc, so there obviously are some limitations.
The grammar for the #if directive is given in [cpp.pre] in the standard as:
if-group:
# if constant-expression new-line groupopt
All of the previous expressions fit the grammar of constant-expression.
It goes on later to say (in [cpp.cond]):
1/
The expression that controls conditional inclusion shall be an integral constant expression except that identifiers (including those lexically identical to keywords) are interpreted as described below
8/
Each preprocessing token that remains (in the list of preprocessing tokens that will become the controlling expression) after all macro replacements have occurred shall be in the lexical form of a token.
All of the preprocessing tokens seem to be in the form of [lex.token]:
token:
identifier
keyword
literal
operator-or-punctuator
<=>, >, [, ], {, }, (, ), * are all an operator-or-punctuator
1, 2, 0, 1.2, 0.0, "" are all literals
So what part of the standard rules out these forms of expressions? And what subset of integral constant expressions are allowed?
I think that all of these examples are intended to be ill-formed, although as you demonstrate the current standard wording doesn't have that effect.
This seems to be tracked as active CWG issue 1436. The proposed resolution would disqualify string literals, floating point literals and also <=> from #if conditions. (Although <=> was added to the language after the issue description was written.) I suppose it is also meant to disallow lambdas, but that may not be covered by the proposed wording.
Per C++11 (and newer) this code is valid:
#if 1.0 > 2.0 ? 1 : 0
#endif
However, most (if not all) C++ compilers reject it:
$ echo "#if 1.0 > 2.0 ? 1 : 0" | g++ -xc++ - -std=c++11 -pedantic -c
<stdin>:1:5: error: floating constant in preprocessor expression
<stdin>:1:11: error: floating constant in preprocessor expression
N4849 has this (emphasis added):
The expression that controls conditional inclusion shall be an integral constant expression except that identifiers (including those lexically identical to keywords) are interpreted as described below143 and it may contain zero or more defined-macro-expressions and/or has-include-expressions and/or has-attribute-expressions as unary operator expressions.
and this (emphasis added):
An integral constant expression is an expression of integral or unscoped enumeration type, implicitly converted to a prvalue, where the converted expression is a core constant expression.
The 1.0 > 2.0 ? 1 : 0 is integral constant expression.
So, where C++ standard prohibits using floating-point literal (for example) in the expression that controls conditional inclusion?
Answer from Richard Smith:
This is an error in the standard wording. See http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1436 for details and a proposed fix -- though that fix is known to be wrong too (it permits lambda-expressions).
My book defines an expression as "a programming statement that has a value" and a literal as "a piece of data that is written directly into a program's source code", but I'm still having some trouble distinguishing between the two. For example, is 3+3 a literal AND an expression, or just an expression? Why?
int number = 2+2;
Is this whole statement an expression, or just the right value? Why? This whole statement has a value of 4, so surely the whole statement is an expression?
In my mind, an expression usually involves operators and a literal involves a single piece of data like 4, "Hello", 'A', etc. I also understand that a literal can be an expression because of unary operators such as - or +. Am I correct in thinking this?
An expression is a sequence of operators and operands that specifies a computation. An expression can result in a value and can cause side effects.
A literal is one of the following:
integer literal
character literal
floating point literal
string literal
boolean literal
pointer literal
I won't try to give the formal definition of each of these, but each is basically just a value.
There's one more type of literal that's somewhat special though:
user-defined literal
Although user-defined literals are literals, the value of the literal is defined in terms of the result of evaluating an expression.
References:
Expressions: [expr]
Literals: [lex.literal]
(For those unfamiliar with it, the tag in square brackets is the notation used to specify sections in the C++ standard).
A literal is something like the number 7 for example. When converted to assembly code, the literal 7 remains quite visible in the code:
MOV R1, 7 ; move the number 7 as a value into register R1
An expression is something that needs to be evaluated. Generally, you'll find something along the lines of C=A+B; where A+B is an expression.
An expression is a sequence of operators and their operands, that
specifies a computation. Expression evaluation may produce a result
(e.g., evaluation of 2+2 produces the result 4) and may generate
side-effects (e.g. evaluation of std::printf("%d",4) prints the
character '4' on the standard output).
http://en.cppreference.com/w/cpp/language/expressions
http://en.cppreference.com/w/cpp/language/expressions#Literals
I want to extensively use the ##-operator and enum magic to handle a huge bunch of similar access-operations, error handling and data flow.
If applying the ## and # preprocessor operators results in an invalid pp-token, the behavior is undefined in C.
The order of preprocessor operation in general is not defined (*) in C90 (see The token pasting operator). Now in some cases it happens (said so in different sources, including the MISRA Committee, and the referenced page) that the order of multiple ##/#-Operators influences the occurrence of undefined behavior. But I have a really hard time to understand the examples of these sources and pin down the common rule.
So my questions are:
What are the rules for valid pp-tokens?
Are there difference between the different C and C++ Standards?
My current problem: Is the following legal with all 2 operator orders?(**)
#define test(A) test_## A ## _THING
int test(0001) = 2;
Comments:
(*) I don't use "is undefined" because this has nothing to do with undefined behavior yet IMHO, but rather unspecified behavior. More than one ## or # operator being applied do not in general render the program to be erroneous. There is obviously an order — we just can't predict which — so the order is unspecified.
(**) This is no actual application for the numbering. But the pattern is equivalent.
What are the rules for valid pp-tokens?
These are spelled out in the respective standards; C11 §6.4 and C++11 §2.4. In both cases, they correspond to the production preprocessing-token. Aside from pp-number, they shouldn't be too surprising. The remaining possibilities are identifiers (including keywords), "punctuators" (in C++, preprocessing-op-or-punc), string and character literals, and any single non-whitespace character which doesn't match any other production.
With a few exceptions, any sequence of characters can be decomposed into a sequence of valid preprocessing-tokens. (One exception is unmatched quotes and apostrophes: a single quote or apostrophe is not a valid preprocessing-token, so a text including an unterminated string or character literal cannot be tokenised.)
In the context of the ## operator, though, the result of the concatenation must be a single preprocessing-token. So an invalid concatenation is a concatenation whose result is a sequence of characters which comprise multiple preprocessing-tokens.
Are there differences between C and C++?
Yes, there are slight differences:
C++ has user defined string and character literals, and allows "raw" string literals. These literals will be tokenized differently in C, so they might be multiple preprocessing-tokens or (in the case of raw string literals) even invalid preprocessing-tokens.
C++ includes the symbols ::, .* and ->*, all of which would be tokenised as two punctuator tokens in C. Also, in C++, some things which look like keywords (eg. new, delete) are part of preprocessing-op-or-punc (although these symbols are valid preprocessing-tokens in both languages.)
C allows hexadecimal floating point literals (eg. 1.1p-3), which are not valid preprocessing-tokens in C++.
C++ allows apostrophes to be used in integer literals as separators (1'000'000'000). In C, this would probably result in unmatched apostrophes.
There are minor differences in the handling of universal character names (eg. \u0234).
In C++, <:: will be tokenised as <, :: unless it is followed by : or >. (<::: and <::> are tokenised normally, using the longest-match rule.) In C, there are no exceptions to the longest-match rule; <:: is always tokenised using the longest-match rule, so the first token will always be <:.
Is it legal to concatenate test_, 0001, and _THING, even though concatenation order is unspecified?
Yes, that is legal in both languages.
test_ ## 0001 => test_0001 (identifier)
test_0001 ## _THING => test_0001_THING (identifier)
0001 ## _THING => 0001_THING (pp-number)
test_ ## 0001_THING => test_0001_THING (identifier)
What are examples of invalid token concatenation?
Suppose we have
#define concat3(a, b, c) a ## b ## c
Now, the following are invalid regardless of concatenation order:
concat3(., ., .)
.. is not a token even though ... is. But the concatenation must proceed in some order, and .. would be a necessary intermediate value; since that is not a single token, the concatenation would be invalid.
concat3(27,e,-7)
Here, -7 is two tokens, so it cannot be concatenated.
And here is a case in which concatenation order matters:
concat3(27e, -, 7)
If this is concatenated left-to-right, it will result in 27e- ## 7, which is the concatenation of two pp-numbers. But - cannot be concatenated with 7, because (as above) -7 is not a single token.
What exactly is a pp-number?
In general terms, a pp-number is a superset of tokens which might be converted into (single) numeric literals or might be invalid. The definition is intentionally broad, partly in order to allow (some) token concatenations, and partly to insulate the preprocessor from the periodic changes in numeric formats. The precise definition can be found in the respective standards, but informally a token is a pp-number if:
It starts with a decimal digit or a period (.) followed by a decimal digit.
The rest of the token is letters, numbers and periods, possibly including sign characters (+, -) if preceded by an exponent symbol. The exponent symbol can be E or e in both languages; and also P and p in C since C99.
In C++, a pp-number can also include (but not start with) an apostrophe followed by a letter or digit.
Note: Above, letter includes underscore. Also, universal character names can be used (except following an apostrophe in C++).
Once preprocessing is terminated, all pp-numbers will be converted to numeric literals if possible. If the conversion is not possible (because the token doesn't correspond to the syntax for any numeric literal), the program is invalid.
#define test(A) test_## A ## _THING
int test(0001) = 2;
This is legal with both LTR and RTL evaluation, since both test_0001 and 0001_THING are valid preprocessor-tokens. The former is an identifier, while the latter is a pp-number; pp-numbers are not checked for suffix correctness until a later stage of compilation; think e.g. 0001u an unsigned octal literal.
A few examples to show that the order of evaluation does matter:
#define paste2(a,b) a##b
#define paste(a,b) paste2(a,b)
#if defined(LTR)
#define paste3(a,b,c) paste(paste(a,b),c)
#elif defined(RTL)
#define paste3(a,b,c) paste(a,paste(b,c))
#else
#define paste3(a,b,c) a##b##c
#endif
double a = paste3(1,.,e3), b = paste3(1e,+,3); // OK LTR, invalid RTL
#define stringify2(x) #x
#define stringify(x) stringify2(x)
#define stringify_paste3(a,b,c) stringify(paste3(a,b,c))
char s[] = stringify_paste3(%:,%,:); // invalid LTR, OK RTL
If your compiler uses a consistent order of evaluation (either LTR or RTL) and presents an error on generation of an invalid pp-token, then precisely one of these lines will generate an error. Naturally, a lax compiler could well allow both, while a strict compiler might allow neither.
The second example is rather contrived; because of the way the grammar is constructed it's very difficult to find a pp-token that is valid when build RTL but not when built LTR.
There are no significant differences between C and C++ in this regard; the two standards have identical language (up to section headers) describing the process of macro replacement. The only way the language could influence the process would be in the valid preprocessing-tokens: C++ (especially recently) has more forms of valid preprocessing-tokens, such as user-defined string literals.
According to the language specification, the lexical elements are defined like this:
token:
keyword
identifier
constant
string-literal
operator
punctuator
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
operator
punctuator
each non-white-space character that cannot be one of the above
Why is there a distinction between a number and a character on the preprocessing token level, whereas on the token level, there are only constants? I don't see the benefit in this distinction.
The names of the non-terminals in the C grammars are not normative; they simply exist for purpose of description. It is only important that the behaviour be correctly described. The grammar itself is not sufficient to describe the language; it needs to be read along with the text, which imposes further restrictions on well-formed programs.
There is not a one-to-one relationship between preprocessor tokens and program tokens. There is overlap: a preprocessor identifier might be a keywords or it might be one of the various definable symbol types (including some constants and typedef-names). A pp-number might be an integer or floating constant, but it might also be invalid. The lexical productions are not all mutually exclusive, and the actual application of lexical category to a substring of the program requires procedures described in the standard text, and not in the formal grammar.
Character constants pass directly from the preprocessor into the program syntax without modification (although they are then subsumed into the constant category). If there is a single comment about preprocessor numbers (such as the fact that they must be convertible into a real numeric constant literal if they survive the preprocessor) is a sufficient reason to have the category.
Also, what would it add to include character-constant in the definition of pp-number? You still need both productions in order to describe the language.