Concatenation and the standard - c++

According to this page "A ## operator between any two successive identifiers in the replacement-list runs parameter replacement on the two identifiers". That is, the preprocessor operator ## acts on identifiers. Microsoft's page says ", each occurrence of the token-pasting operator in token-string is removed, and the tokens preceding and following it are concatenated". That is, the preprocessor operator ## acts on tokens.
I have looked for a definition of an identifier and/or token and the most I have found is this link: "An identifier is an arbitrary long sequence of digits, underscores, lowercase and uppercase Latin letters, and Unicode characters. A valid identifier must begin with a non-digit character".
According to that definition, the following macro should not work (on two accounts):
#define PROB1(x) x##0000
#define PROB2(x,y) x##y
int PROB1(z) = PROB2( 1, 2 * 3 );
Does the standard have some rigorous definitions regarding ## and the objects it acts on? Or, is it mostly 'try and see if it works' (a.k.a. implementation defined)?

The standard is extremely precise, both about what can be concatenated, and about what a valid token is.
The en.cppreference.com page is imprecise; what are concatenated are preprocessing tokens, not identifiers. The Microsoft page is much closer to the standard, although it omits some details and fails to distinguish "preprocessing token" from "token", which are slightly different concepts.
What the standard actually says (§16.3.3/3):
For both object-like and function-like macro invocations, before the replacement list is reexamined for more macro names to replace, each instance of a ## preprocessing token in the replacement list (not from an
argument) is deleted and the preceding preprocessing token is concatenated with the following preprocessing token.…
For reference, "preprocessing token" is defined in §2.4 to be one of the following:
header-name
identifier
pp-number
character-literal
user-defined-character-literal
string-literal
user-defined-string-literal
preprocessing-op-or-punc
each non-white-space character that cannot be one of the above
Most of the time, the tokens to be combined are identifiers (and numbers), but it is quite possible to generate a multicharacter token by concatenating individual characters. (Given the last item in the list of possible preprocessor tokens, any single non-whitespace character is a preprocessor token, even if it is not a letter, digit or standard punctuation symbol.)
The result of a concatenation must be a preprocessing token:
If the result is not a valid preprocessing token, the behavior is undefined. The resulting token is available for further macro replacement.
Note that the replacement of a function-like macro's argument names with the actual arguments may result in the argument name being replaced by 0 tokens or more than one token. If that argument is used on either side of a concatenation operator:
In the case that the actual argument had zero tokens, nothing is concatenated. (The Microsoft page implies that the concatenation operator will concatenate whatever tokens end up preceding and following it.)
In the case that the actual argument has more than one token, the one which is concatenated is the one which precedes or follows the concatenation operator.
As an example of the last case, remember that -42 is two preprocessing tokens (and two tokens, after preprocessing): - and 42. Consequently, although you can concatenate the pp-number 42E with the pp-number 3, resulting in the pp-number (and valid token) 42E3, you cannot create the token 42E-3 from 42E and -3, because only the - would be concatenated, resulting in two pp-number tokens: 42E-3. (The first of these is a valid preprocessing token but it cannot be converted into a valid token, so a tokenization error will be reported.)
In a sequence of concatenations:
#define concat3(a,b,c) a ## b ## c
the order of concatenations is not defined. So it is unspecified whether concat3(42E,-,3) is valid; if the first two tokens are concatenated first, all is well, but if the second two are concatenated first, the result is not a valid preprocessing token. On the other hand, concat3(.,.,.) must be an error, because .. is not a valid token, and so neither a##b nor b##c can be processed. So it is impossible to produce the token ... with concatenation.

Related

Why paired comment can't be placed inside a string in c++?

Normally anything inside /* and */ is considered as a comment.
But the statement,
std::cout << "not-a-comment /* comment */";
prints not-a-comment /* comment */ instead of not-a-comment.
Why does this happen? Are there any other places in c++ where I can't use comments?
This is a consequence of the maximum munch principle. It's a lexing rule that the C++ language follows. When processing a source file, translation is divided into (logical) phases. During phase 3, we get preprocsessing tokens:
[lex.phases]
1.3 The source file is decomposed into preprocessing tokens and
sequences of white-space characters (including comments). A source
file shall not end in a partial preprocessing token or in a partial
comment. Each comment is replaced by one space character. New-line
characters are retained.
Turning comments into white space pp-tokens is done at the same phase. Now a string literal is a pp-token:
[lex.pptoken]
preprocessing-token:
header-name
identifier
pp-number
character-literal
user-defined-character-literal
string-literal
user-defined-string-literal
preprocessing-op-or-punc
each non-white-space character that cannot be one of the above
As are other literals. And the maximum munch principle, tells us that:
3 If the input stream has been parsed into preprocessing tokens up to a
given character:
Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that
would cause further lexical analysis to fail, except that a
header-name is only formed within a #include directive.
So because preprocessing found the opening ", it must keep looking for the longest sequence of characters that will make a valid pp-token (in this case, the token is a string literal). This sequence ends at the closing ". That's why it can't stop and handle the comment, because it is obligated to consume up to the closing quotation mark.
Following these rules you can pin-point the places where comments will not be handled by the pre-processor as comments.
Why does this happen?
Because the comment becomes part of the string literal (eveything between the "" double quotes).
Are there any other places in c++ where I can't use comments?
Yes, the same applies for character literals (using '' single quotes).
You can think of it like single and double quotes have higher precedence before the comment delimiters /**/.

What are the definitions for valid and invalid pp-tokens?

I want to extensively use the ##-operator and enum magic to handle a huge bunch of similar access-operations, error handling and data flow.
If applying the ## and # preprocessor operators results in an invalid pp-token, the behavior is undefined in C.
The order of preprocessor operation in general is not defined (*) in C90 (see The token pasting operator). Now in some cases it happens (said so in different sources, including the MISRA Committee, and the referenced page) that the order of multiple ##/#-Operators influences the occurrence of undefined behavior. But I have a really hard time to understand the examples of these sources and pin down the common rule.
So my questions are:
What are the rules for valid pp-tokens?
Are there difference between the different C and C++ Standards?
My current problem: Is the following legal with all 2 operator orders?(**)
#define test(A) test_## A ## _THING
int test(0001) = 2;
Comments:
(*) I don't use "is undefined" because this has nothing to do with undefined behavior yet IMHO, but rather unspecified behavior. More than one ## or # operator being applied do not in general render the program to be erroneous. There is obviously an order — we just can't predict which — so the order is unspecified.
(**) This is no actual application for the numbering. But the pattern is equivalent.
What are the rules for valid pp-tokens?
These are spelled out in the respective standards; C11 §6.4 and C++11 §2.4. In both cases, they correspond to the production preprocessing-token. Aside from pp-number, they shouldn't be too surprising. The remaining possibilities are identifiers (including keywords), "punctuators" (in C++, preprocessing-op-or-punc), string and character literals, and any single non-whitespace character which doesn't match any other production.
With a few exceptions, any sequence of characters can be decomposed into a sequence of valid preprocessing-tokens. (One exception is unmatched quotes and apostrophes: a single quote or apostrophe is not a valid preprocessing-token, so a text including an unterminated string or character literal cannot be tokenised.)
In the context of the ## operator, though, the result of the concatenation must be a single preprocessing-token. So an invalid concatenation is a concatenation whose result is a sequence of characters which comprise multiple preprocessing-tokens.
Are there differences between C and C++?
Yes, there are slight differences:
C++ has user defined string and character literals, and allows "raw" string literals. These literals will be tokenized differently in C, so they might be multiple preprocessing-tokens or (in the case of raw string literals) even invalid preprocessing-tokens.
C++ includes the symbols ::, .* and ->*, all of which would be tokenised as two punctuator tokens in C. Also, in C++, some things which look like keywords (eg. new, delete) are part of preprocessing-op-or-punc (although these symbols are valid preprocessing-tokens in both languages.)
C allows hexadecimal floating point literals (eg. 1.1p-3), which are not valid preprocessing-tokens in C++.
C++ allows apostrophes to be used in integer literals as separators (1'000'000'000). In C, this would probably result in unmatched apostrophes.
There are minor differences in the handling of universal character names (eg. \u0234).
In C++, <:: will be tokenised as <, :: unless it is followed by : or >. (<::: and <::> are tokenised normally, using the longest-match rule.) In C, there are no exceptions to the longest-match rule; <:: is always tokenised using the longest-match rule, so the first token will always be <:.
Is it legal to concatenate test_, 0001, and _THING, even though concatenation order is unspecified?
Yes, that is legal in both languages.
test_ ## 0001 => test_0001 (identifier)
test_0001 ## _THING => test_0001_THING (identifier)
0001 ## _THING => 0001_THING (pp-number)
test_ ## 0001_THING => test_0001_THING (identifier)
What are examples of invalid token concatenation?
Suppose we have
#define concat3(a, b, c) a ## b ## c
Now, the following are invalid regardless of concatenation order:
concat3(., ., .)
.. is not a token even though ... is. But the concatenation must proceed in some order, and .. would be a necessary intermediate value; since that is not a single token, the concatenation would be invalid.
concat3(27,e,-7)
Here, -7 is two tokens, so it cannot be concatenated.
And here is a case in which concatenation order matters:
concat3(27e, -, 7)
If this is concatenated left-to-right, it will result in 27e- ## 7, which is the concatenation of two pp-numbers. But - cannot be concatenated with 7, because (as above) -7 is not a single token.
What exactly is a pp-number?
In general terms, a pp-number is a superset of tokens which might be converted into (single) numeric literals or might be invalid. The definition is intentionally broad, partly in order to allow (some) token concatenations, and partly to insulate the preprocessor from the periodic changes in numeric formats. The precise definition can be found in the respective standards, but informally a token is a pp-number if:
It starts with a decimal digit or a period (.) followed by a decimal digit.
The rest of the token is letters, numbers and periods, possibly including sign characters (+, -) if preceded by an exponent symbol. The exponent symbol can be E or e in both languages; and also P and p in C since C99.
In C++, a pp-number can also include (but not start with) an apostrophe followed by a letter or digit.
Note: Above, letter includes underscore. Also, universal character names can be used (except following an apostrophe in C++).
Once preprocessing is terminated, all pp-numbers will be converted to numeric literals if possible. If the conversion is not possible (because the token doesn't correspond to the syntax for any numeric literal), the program is invalid.
#define test(A) test_## A ## _THING
int test(0001) = 2;
This is legal with both LTR and RTL evaluation, since both test_0001 and 0001_THING are valid preprocessor-tokens. The former is an identifier, while the latter is a pp-number; pp-numbers are not checked for suffix correctness until a later stage of compilation; think e.g. 0001u an unsigned octal literal.
A few examples to show that the order of evaluation does matter:
#define paste2(a,b) a##b
#define paste(a,b) paste2(a,b)
#if defined(LTR)
#define paste3(a,b,c) paste(paste(a,b),c)
#elif defined(RTL)
#define paste3(a,b,c) paste(a,paste(b,c))
#else
#define paste3(a,b,c) a##b##c
#endif
double a = paste3(1,.,e3), b = paste3(1e,+,3); // OK LTR, invalid RTL
#define stringify2(x) #x
#define stringify(x) stringify2(x)
#define stringify_paste3(a,b,c) stringify(paste3(a,b,c))
char s[] = stringify_paste3(%:,%,:); // invalid LTR, OK RTL
If your compiler uses a consistent order of evaluation (either LTR or RTL) and presents an error on generation of an invalid pp-token, then precisely one of these lines will generate an error. Naturally, a lax compiler could well allow both, while a strict compiler might allow neither.
The second example is rather contrived; because of the way the grammar is constructed it's very difficult to find a pp-token that is valid when build RTL but not when built LTR.
There are no significant differences between C and C++ in this regard; the two standards have identical language (up to section headers) describing the process of macro replacement. The only way the language could influence the process would be in the valid preprocessing-tokens: C++ (especially recently) has more forms of valid preprocessing-tokens, such as user-defined string literals.

Unclear #define syntax in cpp using `\` sign

#define is_module_error(_module_,_error_) \
((_module_##_errors<_error_)&&(_error_<_module_##_errors_end))
#define is_general_error(_error_) is_module_error(general,_error_)
#define is_network_error(_error_) is_module_error(network,_error_)
Can someone please explain to me what does the first define means?
How is is evaluated?
I don't understand what's the \ sign mean here?
The backslash is the line continuation symbol used in preprocessor directives. It tells the preprocessor to merge the following line with the current one. In other words it escapes the hard newline at the end of the line.
In the specific example, it tells the preprocessor that
#define is_module_error(_module_,_error_) \
((_module_##_errors<_error_)&&(_error_<_module_##_errors_end))
should be interpreted as:
#define is_module_error(_module_,_error_) ((_module_##_errors<_error_)&&(_error_<_module_##_errors_end))
The relevant quote from the C99 draft standard (N1256) is the following:
6.10 Preprocessing directives
[...]
Description
A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the
following constraints: The first token in the sequence is a # preprocessing token that (at
the start of translation phase 4) is either the first character in the source file (optionally
after white space containing no new-line characters) or that follows white space
containing at least one new-line character. The last token in the sequence is the first new-line character that follows the first token in the sequence. A new-line character ends
the preprocessing directive even if it occurs within what would otherwise be an invocation of a function-like macro.
Emphasis on the relevant sentence is mine.
If you are also unsure of what the ## symbol means, it is the token-pasting operator. From the already cited C99 document (emphasis mine):
6.10.3.3 The ## operator
[...]
Semantics
If, in the replacement list of a function-like macro, a parameter is immediately preceded
or followed by a ## preprocessing token, the parameter is replaced by the corresponding
argument’s preprocessing token sequence; however, if an argument consists of no preprocessing tokens, the parameter is replaced by a placemarker preprocessing token instead.
In the case at hand this means that, for example, wherever the preprocessor finds the following macro "call":
is_module_error(dangerous_module,blow_up_error)
it will replace it with this code fragment:
((dangerous_module_errors<blow_up_error)&&(blow_up_error<dangerous_module_errors_end))

In the C++ standard, where does it indicate the spacing protocol for the replacement of category descriptives by the source code it represents?

At the risk of asking a question deemed too nit-picky, I have spent a long time trying to justify (as a single example of something that occurs throughout the standard in different contexts) the following definition of an integer literal in §2.14.2 of the C++11 standard, specifically in regards to one detail, the presence of whitespace in the syntax notation itself.
(Note that this example - the definition of an integer literal - is not the point of my question. The point of my question is to ask about the syntax description notation used by the C++ standard itself, specifically in regards to whitespace between grammatical category names. The example I give here - the definition of an integer literal - is specifically chosen only because it acts as an example that is simple and clear-cut.)
(Abbreviated for concision, from §2.14.2):
integer-literal:
decimal-literal integer-suffix_opt
decimal-literal:
nonzero-digit
decimal-literal digit
(with nonzero-digit and digit as expected, [0] 1 ... 9). (Note: The above text is all in italics in the standard.)
This all makes sense to me, assuming that the SPACE between the syntax category descriptives decimal-literal and digit is understood to NOT be present in the actual source code, but is only present in the syntax description itself as it appears here in section §2.14.2.
This convention - placing a space between category descriptives within the notation, where it is understood that the space is not to be present in the source code - is used in other places in the specification. The example here is just a clear-cut case where the space is clearly not supposed to be present in the source code. (See addendum to this question for counterexamples from the standard where whitespace or other separator/s must be present, or is optional, between category descriptives when those category descriptives are replaced by actual tokens in the source code.)
Again, at the risk of being nit-picky, I cannot find anywhere in the standard a statement of convention that spaces are NOT to be present in the source code when interpreting notation such as in this example.
The standard does discuss notational convention in §1.6.1 (and thereafter). The only relevant text that I can find regarding this is:
In the syntax notation used in this International Standard, syntactic
categories are indicated by italic type, and literal words and
characters in constant width type. Alternatives are listed on separate
lines except in a few cases where a long set of alternatives is marked
by the phrase “one of.”
I would not be so nit-picky; however, I find the notation used within the standard to be somewhat tricky, so I would like to be clear on all of the details. I appreciate anyone willing to take the time to fill me in on this.
ADDENDUM In response to comments in which a claim is made similar to "it's obvious that whitespace should not be included in the final source code, so there's no need for the standard to explicitly state this": I have chosen a trivial example in this question, where it is obvious. There are many cases in the standard where it isn't obvious without a. priori knowledge of the language (in my opinion), such as §8.0.4 discussing "const" and "volatile":
cv-qualifier-seq:
cv-qualifier cv-qualifier-seq_opt
... Note the opposite assumption here (whitespace, or another separator or separators, is required in the final source code), but that's not possible to deduce from the syntax notation itself.
There are also cases where a space is optional, such as:
noptr-abstract-declarator:
noptr-abstract-declarator_opt parameters-and-qualifiers
(In this example, to make a point, I won't give the section number or paraphrase what is being discussed; I'll just ask if it's obvious from the grammar notation itself that, in this context, whitespace in the final source code is optional between the tokens.)
I suspect that the comments along these lines - "it's obvious, so that's what it must be" - are the result of the fact that the example I've chosen is so obvious. That's exactly why I chose the example.
§2.7.1
There are five kinds of tokens: identifiers, keywords, literals,
operators, and other separators. Blanks, horizontal and vertical tabs,
newlines, formfeeds, and comments (collectively, “white space”), as
described below, are ignored except as they serve to separate tokens.
So, if a literal is a token, and whitespace serves to seperate tokens, space in between the digits of a literal would be interpreted as two separate tokens, and therefore cannot be part of the same literal.
I'm reasonably certain there is no more direct explanation of this fact in the standard.
The notation used is similar enough to typical BNF that they take many of the same general conventions for granted, including the fact that whitespace in the notation has no significance beyond separating the tokens of the BNF itself -- that if/when whitespace has significance in the source code beyond separating tokens, they'll include notation to specify it directly (e.g., for most preprocessing directives, the new-line is specified directly:
# ifdef identifier new-line groupopt
or:
# include < h-char-sequence> new-line
The blame for that probably goes back to the Algol 68 standard, which went so far overboard in its attempts at precisely specifying syntax that it was essentially impossible for anybody to read without weeks of full-time study1. Since then, any more than the most cursory explanation of the syntax description language leads to rejection on the basis that it's too much like Algol 68 and will undoubtedly fail because it's too formal and nobody will ever read or understand it.
1 How could it be that bad you ask? It basically went like this: they started with a formal English description of a syntax description language. That wasn't used to define Algol 68 though -- it was used to specify (even more precisely) another syntax description language. That second syntax description language was then used to specify the syntax of Algol 68 itself. So, you had to learn two separate syntax description languages before you could start to read the Algol 68 syntax itself at all. As you can undoubtedly guess, almost nobody ever did.
As you say, the standard says:
literal words and characters in constant width type
So, if a literal space were to be included in a rule, it would have to be rendered in a constant width type. Close examination of the standard will reveal that the space in the production you refer to is narrower than the constant width type. (Also your attempt to quote the standard is a misrepresentation because it renders in constant-width type that which should be rendered in italics, with a consequent semantic change.)
Ok, that was the "aspiring language lawyer" answer; furthermore, it doesn't really work because it fails on all the productions which are of the form:
One of:
0 1 2 3 4 5 6 7 8 9
I think, in reality, the answer is that whitespace is not part of the formal grammar, because it serves only to separate tokens; furthermore, that statement is mostly true of the grammar itself, whose tokens are separated by whitespace without that whitespace being a token, except that indentation in the grammar matters, unlike indentation in a program.
Addendum to answer the addendum
It's not actually true that const and volatile need to be separated by whitespace. They simply need to be separate tokens. Example:
#define A(x)x
A(const)A(volatile)A(int)A(x)A(;)
Again, more seriously, Chapter 2 (with particular reference to 2.2 and 2.5, but you have to read the entire text) describe how the program text is processed in order to produce a stream of tokens. All of the rules in which you claim whitespace must be ignored are in this part of the grammar, and all of the rules in which you claim whitespace might be required are not.
These are really two separate grammars, but the lexical grammar is necessarily incomplete because you need to consider the operation of the preprocessor in order to apply it.
I believe that everything I said can be gleaned from the standard. Here are some excerpts:
2.2(3) The source file is decomposed into preprocessing tokens (2.5) and sequences of white-space characters (including comments)… The process of dividing a source file’s characters into preprocessing tokens is context-dependent.
…
2.2(7) White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. (2.7). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.
I think that all this makes it clear that there are two grammars, one lexical -- that is, it produces a lexeme (token) from a sequence of graphemes (characters) -- and the other syntactic -- that is, it produces an abstract syntax tree from a sequence of lexemes (tokens). In neither case (with a small exception, which I'll get to in a minute) is whitespace considered anything other than something which stops two lexemes from running into each other if the lexical grammar would otherwise allow that. (See the algorithm in 2.5(3).)
C++ is not syntactically pretty, so there are almost always exceptions. One of these, inherited from C, is the difference between:
#define A(X)(X)
and
#define A (X)(X)
Preprocessing directives have their own parsing rules, and this one is typified by the definition:
lparen:
  a ( character not immediately preceded by white-space
This, I would say, is the exception that proves the rule [Note 1]. The fact that it is necessary to say that this ( is not preceded by white-space shows that the normal use of the token ( in a syntactic rule does not say anything about its blancospatial context.
So, to paraphrase Ray Cummings (not Albert Einstein, as is sometimes claimed), "time and white-space are all that separate one token from another." [Note 2]
[Note 1] I use the phrase here in its original legal sense, as perCicero.
[Note 2]:
"Time," said George, "why I can give you a definition of time. It's what keeps everything from happening at once."
A ripple of laughter went about the little group of men.
"Quite so," agreed the Chemist. "And, gentlemen, that's not so funny as it sounds. As a matter of fact, it is really not a bad scientific definition. Time and space are all that separate one event from another…
-- From The man who mastered time, by Ray Cummings, 1929, Ace Books. See first page, in Google books
The Standard actually has two separate grammars.
The preprocessor grammar, described in sections 2 and 16, defines how a sequence of source characters is converted to a sequence of preprocessing tokens and whitespace characters, in translation phases 1-6. In some of these phases and parts of this grammar, whitespace is significant.
Whitespace characters which are not part of preprocessing tokens stop being significant after translation phase 4. The Standard explicitly says at the start of translation phase 7 to discard whitespace characters between preprocessing tokens.
The language grammar defines how a sequence of tokens (converted from preprocessing tokens) are syntactically and semantically interpreted in translation phase 7. There is no such thing as whitespace in this grammar. (By this point, ' ' is a character-literal just like 'c' is.)
In both grammars, the space between grammar components visible in the Standard has nothing to do with source or execution whitespace characters, it's just there to make the Standard legible. When the preprocessor grammar depends on whitespace, it spells it out with words, for example:
c-char:
any member of the source character set except the single-quote ', backslash \, or new-line character
escape-sequence
universal-character-name
and
control-line:
...
# define identifier lparen identifier-list[opt] ) replacement-list newline
...
lparen:
a ( character not immediately preceded by white-space
So there may not be whitespace between digits of an integer-literal because the preprocessor grammar does not allow it.
One other important rule here is from C++11 2.5p3:
If the input stream has been parsed into preprocessing tokens up to a given character:
If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. ...
Otherwise, if the next three characters are <:: and the subsequent character is neither : nor >, the < is treated as a preprocessor token by itself and not as the first character of the alternative token <:.
Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.
So there must be whitespace between const and volatile tokens because otherwise, the longest-token-possible rule would convert that to a single identifier token constvolatile.

Why is this C or C++ macro not expanded by the preprocessor?

Can someone points me the problem in the code when compiled with gcc 4.1.0.
#define X 10
int main()
{
double a = 1e-X;
return 0;
}
I am getting error:Exponent has no digits.
When I replace X with 10, it works fine. Also I checked with g++ -E command to see the file with preprocessors applied, it has not replaced X with 10.
I was under the impression that preprocessor replaces every macro defined in the file with the replacement text with applying any intelligence. Am I wrong?
I know this is a really silly question but I am confused and I would rather be silly than confused :).
Any comments/suggestions?
The preprocessor is not a text processor, it works on the level of tokens. In your code, after the define, every occurence of the token X would be replaced by the token 10. However, there is not token X in the rest of your code.
1e-X is syntactically invalid and cannot be turned into a token, which is basically what the error is telling you (it says that to make it a valid token -- in this case a floating point literal -- you have to provide a valid exponent).
When you write 1e-X all together like that, the X isn't a separate symbol for the preprocessor to replace - there needs to be whitespace (or certain other symbols) on either side. Think about it a little and you'll realize why.. :)
Edit: "12-X" is valid because it gets parsed as "12", "-", "X" which are three separate tokens. "1e-X" can't be split like that because "1e-" doesn't form a valid token by itself, as Jonathan mentioned in his answer.
As for the solution to your problem, you can use token-concatenation:
#define E(X) 1e-##X
int main()
{
double a = E(10); // expands to 1e-10
return 0;
}
Several people have said that 1e-X is lexed as a single token, which is partially correct. To explain:
There are two classes of tokens during translation: preprocessing tokens and tokens. A source file is initially decomposed into preprocessing tokens; these tokens are then used in all of the preprocessing tasks, including macro replacement. After preprocessing, each preprocessing token is converted into a token; these resulting tokens are used during actual compilation.
There are fewer types of preprocessing tokens than there are types of tokens. For example, keywords (e.g. for, while, if) are not significant during the preprocessing phases, so there is no keyword preprocessing token. Keywords are simply lexed as identifiers. When the conversion from preprocessing tokens to tokens takes place, each identifier preprocessing token is inspected; if it matches a keyword, it is converted into a keyword token; otherwise it is converted into an identifier token.
There is only one type of numeric token during preprocessing: preprocessing number. This type of preprocessing token corresponds to two different types of tokens: integer literal and floating literal.
The preprocessing number preprocessing token is defined very broadly. Effectively it matches any sequence of characters that begins with a digit or a decimal point followed by any number of digits, nondigits (e.g. letters), and e+ and e-. So, all of the following are valid preprocessing number preprocessing tokens:
1.0e-10
.78
42
1e-X
1helloworld
The first two can be converted into floating literals; the third can be converted into an integer literal. The last two are not valid integer literals or floating literals; those preprocessing tokens cannot be converted into tokens. This is why you can preprocess the source without error but cannot compile it: the error occurs in the conversion from preprocessing tokens to tokens.
GCC 4.5.0 doesn't change the X either.
The answer is going to lie in how the preprocessor interprets preprocessing tokens - and in the 'maximal munch' rule. The 'maximal munch' rule is what dictates that 'x+++++y' is treated as 'x ++ ++ + y' and hence is erroneous, rather than as 'x ++ + ++ y' which is legitimate.
The issue is why does the preprocessor interpret '1e-X' as a single preprocessing token. Clearly, it will treat '1e-10' as a single token. There is no valid interpretation for '1e-' unless it is followed by a digit once it passes the preprocessor. So, I have to guess that the preprocessor sees '1e-X' as a single token (actually erroneous). But I have not dissected the correct clauses in the standard to see where it is required. But the definition of a 'preprocessing number' or 'pp-number' in the standard (see below) is somewhat different from the definition of a valid integer or floating point constant and allows many 'pp-numbers' that are not valid as an integer or floating point constant.
If it helps, the output of the Sun C Compiler for 'cc -E -v soq.c' is:
# 1 "soq.c"
# 2
int main()
{
"soq.c", line 4: invalid input token: 1e-X
double a = 1e-X ;
return 0;
}
#ident "acomp: Sun C 5.9 SunOS_sparc Patch 124867-09 2008/11/25"
cc: acomp failed for soq.c
So, at least one C compiler rejects the code in the preprocessor - it might be that the GCC preprocessor is a little slack (I tried to provoke it into complaining with gcc -Wall -pedantic -std=c89 -Wextra -E soq.c but it did not utter a squeak). And using 3 X's in both the macro and the '1e-XXX' notation showed that all three X's were consumed by both GCC and Sun C Compiler.
C Standard Definition of Preprocessing Number
From the C Standard - ISO/IEC 9899:1999 §6.4.8 Preprocessing Numbers:
pp-number:
digit
. digit
pp-number digit
pp-number identifier-nondigit
pp-number e sign
pp-number E sign
pp-number p sign
pp-number P sign
pp-number .
Given this, '1e-X' is a valid 'pp-number', and therefore the X is not a separate token (nor is the 'XXX' in '1e-XXX' a separate token). Therefore, the preprocessor cannot expand the X; it isn't a separate token subject to expansion.