C++ macro expansion in include directive

C++ macro expansion in include directive - c++

In the current draft of the C++ standard (N4830, august 2019) there are the following paragraphs:
[cpp.include] p.2:
A preprocessing directive of the form #include < h-char-sequence > new-line [...]
[cpp.include] p.3:
A preprocessing directive of the form # include " q-char-sequence " new-line [...]
[cpp.include] p.4:
A preprocessing directive of the form # include pp-tokens new-line
(that does not match one of the two previous forms) is permitted. The preprocessing tokens after include in the directive are processed just as in normal text (i.e., each identifier currently defined as a macro name is replaced by its replacement list of preprocessing tokens). If the directive resulting after all replacements does not match one of the two previous forms, the behavior is undefined. The method by which a sequence of preprocessing tokens between a < and a > preprocessing token pair or a pair of " characters is combined into a single header name preprocessing token is implementation-defined.
The sentence "If the directive resulting after all replacements does not match one of the two previous forms, the behavior is undefined" states that after the macro expansion is performed, the directive must match one of the two forms presented in the previous two paragraphs ([cpp.include]/2 and [cpp.include]/3). I will refer to this operation as check.
The following sentence "The method by which a sequence of preprocessing tokens between a < and a > preprocessing token pair or a pair of " characters is combined into a single header name preprocessing token is implementation-defined" implies that there is an implementation-defined process that transforms a sequence of preprocessing tokens between delimited by < > or by " " into a single preprocessing token (a header-name). I will refer to this operation as process.
My first question is whether process is only applied in the situation described in [cpp.include] p.4. I believe so, because tokenization is performed before processing of preprocessing directives, therefore in the first two forms of the include directive there is exactly one preprocessing token (a header-name) after "#include", not "a sequence of preprocessing tokens between a < and a > preprocessing token pair or a pair of " characters".
My second question is how is check performed? Is it performed before process? If so, is a sequence of preprocessing tokens separated by sequences of white-space characters (a white-space sequence can have zero or more white-space characters) compared to a sequence of characters by going lower in the "parsing hierarchy" (so that each preprocessing token in the first sequence is reverted to the characters forming it)?
For example the sequence of preprocessing tokens: <io[space][space]stream[space]> can be considered to match < h-char-sequence > if <io[space][space]stream[space]> is converted to the sequence of characters [<][i][o][space][space][s][t][r][e][a][m][space][>] (where [.] denotes an individual character)?
Considering that check is done first and it succeeds, then transform should be applied to the sequence of preprocessing tokens in order to transform it to a single preprocessing token (a header-name).
Are the details of these two operations and their ordering correct?
My last question is whether the last sentence in [cpp.include] p.4 is partly wrong. This is because it says: "[...] a sequence of preprocessing tokens between a < and a > preprocessing token pair or a pair of " characters". I do not think that the third form of the include directive can result (even after macro expansion) in a sequence of preprocessing tokens between a pair of " characters, because the " character alone is not a preprocessing token ([lex.pptoken] p.2) so I don't see any sequence of preprocessing tokens that can expand into "a sequence of preprocessing tokens between a pair of " characters".
Thank you.

Related

Is there any relationship between the wording in [cpp.pre]/7 and its example?

[cpp.pre]/7:
The preprocessing tokens within a preprocessing directive are not
subject to macro expansion unless otherwise stated.
[Example 2: In:
#define EMPTY
EMPTY # include <file.h>
the sequence of preprocessing tokens on the second line is not a
preprocessing directive, because it does not begin with a # at the
start of translation phase 4, even though it will do so after the
macro EMPTY has been replaced. — end example]

Why paired comment can't be placed inside a string in c++?

Normally anything inside /* and */ is considered as a comment.
But the statement,
std::cout << "not-a-comment /* comment */";
prints not-a-comment /* comment */ instead of not-a-comment.
Why does this happen? Are there any other places in c++ where I can't use comments?

This is a consequence of the maximum munch principle. It's a lexing rule that the C++ language follows. When processing a source file, translation is divided into (logical) phases. During phase 3, we get preprocsessing tokens:
[lex.phases]
1.3 The source file is decomposed into preprocessing tokens and
sequences of white-space characters (including comments). A source
file shall not end in a partial preprocessing token or in a partial
comment. Each comment is replaced by one space character. New-line
characters are retained.
Turning comments into white space pp-tokens is done at the same phase. Now a string literal is a pp-token:
[lex.pptoken]
preprocessing-token:
header-name
identifier
pp-number
character-literal
user-defined-character-literal
string-literal
user-defined-string-literal
preprocessing-op-or-punc
each non-white-space character that cannot be one of the above
As are other literals. And the maximum munch principle, tells us that:
3 If the input stream has been parsed into preprocessing tokens up to a
given character:
Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that
would cause further lexical analysis to fail, except that a
header-name is only formed within a #include directive.
So because preprocessing found the opening ", it must keep looking for the longest sequence of characters that will make a valid pp-token (in this case, the token is a string literal). This sequence ends at the closing ". That's why it can't stop and handle the comment, because it is obligated to consume up to the closing quotation mark.
Following these rules you can pin-point the places where comments will not be handled by the pre-processor as comments.

Why does this happen?
Because the comment becomes part of the string literal (eveything between the "" double quotes).
Are there any other places in c++ where I can't use comments?
Yes, the same applies for character literals (using '' single quotes).
You can think of it like single and double quotes have higher precedence before the comment delimiters /**/.

Concatenation and the standard

According to this page "A ## operator between any two successive identifiers in the replacement-list runs parameter replacement on the two identifiers". That is, the preprocessor operator ## acts on identifiers. Microsoft's page says ", each occurrence of the token-pasting operator in token-string is removed, and the tokens preceding and following it are concatenated". That is, the preprocessor operator ## acts on tokens.
I have looked for a definition of an identifier and/or token and the most I have found is this link: "An identifier is an arbitrary long sequence of digits, underscores, lowercase and uppercase Latin letters, and Unicode characters. A valid identifier must begin with a non-digit character".
According to that definition, the following macro should not work (on two accounts):
#define PROB1(x) x##0000
#define PROB2(x,y) x##y
int PROB1(z) = PROB2( 1, 2 * 3 );
Does the standard have some rigorous definitions regarding ## and the objects it acts on? Or, is it mostly 'try and see if it works' (a.k.a. implementation defined)?

The standard is extremely precise, both about what can be concatenated, and about what a valid token is.
The en.cppreference.com page is imprecise; what are concatenated are preprocessing tokens, not identifiers. The Microsoft page is much closer to the standard, although it omits some details and fails to distinguish "preprocessing token" from "token", which are slightly different concepts.
What the standard actually says (§16.3.3/3):
For both object-like and function-like macro invocations, before the replacement list is reexamined for more macro names to replace, each instance of a ## preprocessing token in the replacement list (not from an
argument) is deleted and the preceding preprocessing token is concatenated with the following preprocessing token.…
For reference, "preprocessing token" is defined in §2.4 to be one of the following:
header-name
identifier
pp-number
character-literal
user-defined-character-literal
string-literal
user-defined-string-literal
preprocessing-op-or-punc
each non-white-space character that cannot be one of the above
Most of the time, the tokens to be combined are identifiers (and numbers), but it is quite possible to generate a multicharacter token by concatenating individual characters. (Given the last item in the list of possible preprocessor tokens, any single non-whitespace character is a preprocessor token, even if it is not a letter, digit or standard punctuation symbol.)
The result of a concatenation must be a preprocessing token:
If the result is not a valid preprocessing token, the behavior is undefined. The resulting token is available for further macro replacement.
Note that the replacement of a function-like macro's argument names with the actual arguments may result in the argument name being replaced by 0 tokens or more than one token. If that argument is used on either side of a concatenation operator:
In the case that the actual argument had zero tokens, nothing is concatenated. (The Microsoft page implies that the concatenation operator will concatenate whatever tokens end up preceding and following it.)
In the case that the actual argument has more than one token, the one which is concatenated is the one which precedes or follows the concatenation operator.
As an example of the last case, remember that -42 is two preprocessing tokens (and two tokens, after preprocessing): - and 42. Consequently, although you can concatenate the pp-number 42E with the pp-number 3, resulting in the pp-number (and valid token) 42E3, you cannot create the token 42E-3 from 42E and -3, because only the - would be concatenated, resulting in two pp-number tokens: 42E-3. (The first of these is a valid preprocessing token but it cannot be converted into a valid token, so a tokenization error will be reported.)
In a sequence of concatenations:
#define concat3(a,b,c) a ## b ## c
the order of concatenations is not defined. So it is unspecified whether concat3(42E,-,3) is valid; if the first two tokens are concatenated first, all is well, but if the second two are concatenated first, the result is not a valid preprocessing token. On the other hand, concat3(.,.,.) must be an error, because .. is not a valid token, and so neither a##b nor b##c can be processed. So it is impossible to produce the token ... with concatenation.

Unclear #define syntax in cpp using `\` sign

#define is_module_error(_module_,_error_) \
((_module_##_errors<_error_)&&(_error_<_module_##_errors_end))
#define is_general_error(_error_) is_module_error(general,_error_)
#define is_network_error(_error_) is_module_error(network,_error_)
Can someone please explain to me what does the first define means?
How is is evaluated?
I don't understand what's the \ sign mean here?

The backslash is the line continuation symbol used in preprocessor directives. It tells the preprocessor to merge the following line with the current one. In other words it escapes the hard newline at the end of the line.
In the specific example, it tells the preprocessor that
#define is_module_error(_module_,_error_) \
((_module_##_errors<_error_)&&(_error_<_module_##_errors_end))
should be interpreted as:
#define is_module_error(_module_,_error_) ((_module_##_errors<_error_)&&(_error_<_module_##_errors_end))
The relevant quote from the C99 draft standard (N1256) is the following:
6.10 Preprocessing directives
[...]
Description
A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the
following constraints: The first token in the sequence is a # preprocessing token that (at
the start of translation phase 4) is either the first character in the source file (optionally
after white space containing no new-line characters) or that follows white space
containing at least one new-line character. The last token in the sequence is the first new-line character that follows the first token in the sequence. A new-line character ends
the preprocessing directive even if it occurs within what would otherwise be an invocation of a function-like macro.
Emphasis on the relevant sentence is mine.
If you are also unsure of what the ## symbol means, it is the token-pasting operator. From the already cited C99 document (emphasis mine):
6.10.3.3 The ## operator
[...]
Semantics
If, in the replacement list of a function-like macro, a parameter is immediately preceded
or followed by a ## preprocessing token, the parameter is replaced by the corresponding
argument’s preprocessing token sequence; however, if an argument consists of no preprocessing tokens, the parameter is replaced by a placemarker preprocessing token instead.
In the case at hand this means that, for example, wherever the preprocessor finds the following macro "call":
is_module_error(dangerous_module,blow_up_error)
it will replace it with this code fragment:
((dangerous_module_errors<blow_up_error)&&(blow_up_error<dangerous_module_errors_end))

Why is this C or C++ macro not expanded by the preprocessor?

Can someone points me the problem in the code when compiled with gcc 4.1.0.
#define X 10
int main()
{
double a = 1e-X;
return 0;
}
I am getting error:Exponent has no digits.
When I replace X with 10, it works fine. Also I checked with g++ -E command to see the file with preprocessors applied, it has not replaced X with 10.
I was under the impression that preprocessor replaces every macro defined in the file with the replacement text with applying any intelligence. Am I wrong?
I know this is a really silly question but I am confused and I would rather be silly than confused :).
Any comments/suggestions?

The preprocessor is not a text processor, it works on the level of tokens. In your code, after the define, every occurence of the token X would be replaced by the token 10. However, there is not token X in the rest of your code.
1e-X is syntactically invalid and cannot be turned into a token, which is basically what the error is telling you (it says that to make it a valid token -- in this case a floating point literal -- you have to provide a valid exponent).

When you write 1e-X all together like that, the X isn't a separate symbol for the preprocessor to replace - there needs to be whitespace (or certain other symbols) on either side. Think about it a little and you'll realize why.. :)
Edit: "12-X" is valid because it gets parsed as "12", "-", "X" which are three separate tokens. "1e-X" can't be split like that because "1e-" doesn't form a valid token by itself, as Jonathan mentioned in his answer.
As for the solution to your problem, you can use token-concatenation:
#define E(X) 1e-##X
int main()
{
double a = E(10); // expands to 1e-10
return 0;
}

Several people have said that 1e-X is lexed as a single token, which is partially correct. To explain:
There are two classes of tokens during translation: preprocessing tokens and tokens. A source file is initially decomposed into preprocessing tokens; these tokens are then used in all of the preprocessing tasks, including macro replacement. After preprocessing, each preprocessing token is converted into a token; these resulting tokens are used during actual compilation.
There are fewer types of preprocessing tokens than there are types of tokens. For example, keywords (e.g. for, while, if) are not significant during the preprocessing phases, so there is no keyword preprocessing token. Keywords are simply lexed as identifiers. When the conversion from preprocessing tokens to tokens takes place, each identifier preprocessing token is inspected; if it matches a keyword, it is converted into a keyword token; otherwise it is converted into an identifier token.
There is only one type of numeric token during preprocessing: preprocessing number. This type of preprocessing token corresponds to two different types of tokens: integer literal and floating literal.
The preprocessing number preprocessing token is defined very broadly. Effectively it matches any sequence of characters that begins with a digit or a decimal point followed by any number of digits, nondigits (e.g. letters), and e+ and e-. So, all of the following are valid preprocessing number preprocessing tokens:
1.0e-10
.78
42
1e-X
1helloworld
The first two can be converted into floating literals; the third can be converted into an integer literal. The last two are not valid integer literals or floating literals; those preprocessing tokens cannot be converted into tokens. This is why you can preprocess the source without error but cannot compile it: the error occurs in the conversion from preprocessing tokens to tokens.

GCC 4.5.0 doesn't change the X either.
The answer is going to lie in how the preprocessor interprets preprocessing tokens - and in the 'maximal munch' rule. The 'maximal munch' rule is what dictates that 'x+++++y' is treated as 'x ++ ++ + y' and hence is erroneous, rather than as 'x ++ + ++ y' which is legitimate.
The issue is why does the preprocessor interpret '1e-X' as a single preprocessing token. Clearly, it will treat '1e-10' as a single token. There is no valid interpretation for '1e-' unless it is followed by a digit once it passes the preprocessor. So, I have to guess that the preprocessor sees '1e-X' as a single token (actually erroneous). But I have not dissected the correct clauses in the standard to see where it is required. But the definition of a 'preprocessing number' or 'pp-number' in the standard (see below) is somewhat different from the definition of a valid integer or floating point constant and allows many 'pp-numbers' that are not valid as an integer or floating point constant.
If it helps, the output of the Sun C Compiler for 'cc -E -v soq.c' is:
# 1 "soq.c"
# 2
int main()
{
"soq.c", line 4: invalid input token: 1e-X
double a = 1e-X ;
return 0;
}
#ident "acomp: Sun C 5.9 SunOS_sparc Patch 124867-09 2008/11/25"
cc: acomp failed for soq.c
So, at least one C compiler rejects the code in the preprocessor - it might be that the GCC preprocessor is a little slack (I tried to provoke it into complaining with gcc -Wall -pedantic -std=c89 -Wextra -E soq.c but it did not utter a squeak). And using 3 X's in both the macro and the '1e-XXX' notation showed that all three X's were consumed by both GCC and Sun C Compiler.
C Standard Definition of Preprocessing Number
From the C Standard - ISO/IEC 9899:1999 §6.4.8 Preprocessing Numbers:
pp-number:
digit
. digit
pp-number digit
pp-number identifier-nondigit
pp-number e sign
pp-number E sign
pp-number p sign
pp-number P sign
pp-number .
Given this, '1e-X' is a valid 'pp-number', and therefore the X is not a separate token (nor is the 'XXX' in '1e-XXX' a separate token). Therefore, the preprocessor cannot expand the X; it isn't a separate token subject to expansion.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js