Why is C/C++ preprocessor adding a space here? - c++

I have a tiny problem with a preprocessor that puzzles me and I cannot find any explanation to it in the documentation/preprocessor/language spec.
#define booboo() aaa
booboo()bbb
booboo().bbb
is preprocessed into:
aaa bbb <--- why is space added here
aaa.bbb
After handling trigraphs, continued lines and comments, preprocessor works on preprocessor directives and divides input into preprocessing tokens and whitespace. booboo's replacement list comprises one pp-token which is identifier 'aaa'. booboo()bbb is divided into pp-tokens: 'booboo', '(', ')', 'bbb'. Sequence of 'booboo', '(', ')' is recognised as functional macro invocation and it should be expanded to 'aaa' and imho in output should look like 'aaabbb'. I said look like since - to human - it would look like one token whereas compiler would get 2 tokens 'aaa' and 'bbb' since no '##' operator was used that allows pp-token concatenation. Why/what rule makes cpp (c preprocessor) place additional space between 'aaa' and 'bbb' when 'booboo().bbb' results in 'aaa.bbb' without space?
Is this because cpp tries to make output (which is for humans mostly) unambinuous? Human is not able to tell that 'aaabbb' is composed from 2 tokens as it sees token's spelling only. Am I right? I've read C99 documentation about preprocessor and gcc's documentation for cpp. I see nothing about it.
If I am right we have similar situation here:
#define baba() +
baba()+
baba()-
results in:
+ +
+-
Otherwise (if '++' is the output) it would look to a human like '++' token but there would be 2 tokens '+' and '+'. Is it like with '##' operator that cpp checks if concatenation produces valid token but in shown cases wants to prevent human that concatenation was performed? '+-' is not ambiguous hence no space added

The result of preprocessing is to transform the source file into a list of tokens. In your case the list of tokens would look like, after tokenization:
....
booboo()
bbb
....
and then after macro replacement:
....
aaa
bbb
....
Then the compiler translates the list of tokens into an executable.
The whitespace you are seeing is just an implementation detail that your compiler etc. has chosen to lay out the preprocessing tokens when displaying an intermediate result to you. The standards say nothing about any intermediate processing files. It is not required that there be a separate program to do preprocessing either.

I wrote an ANSI C compiler myself in early 90's. As far as I remember, a comment token /....../ should be replaced by a single white-space. Macros do text replacement, in place. Not necessary that the tokens which result from text replacement of such macro expansion(s) be legal C language tokens. When a macro is defined as text 'aaa' it is just that text 'aaa' that makes its way into the input stream. C's parser may or may not see valid tokens as a result of that!
Hence, given:
define booboo() aaa
Expanding booboo()bbb should result in text aaabbb
What does that aaabbb mean is up to the user. But that aaabbb will not be preprocessed even if happens to be the name of a macro. That is for sure. But aaabbb could be a user identifier - no issues there.

Related

What does Tokens do and why they need to be created in C++ programming?

I am reading a book (Programming Principles and Practice by Bjarne Stroustrup).
In which he introduce Tokens:
“A token is a sequence of characters that represents something we consider a unit, such as a number or an operator. That’s the way a C++ compiler deals with its source. Actually, “tokenizing” in some form or another is the way most analysis of text starts.”
class Token {
public:
char kind;
double value;
};
I do get what they are but he never explains this in detail and its quite confusing to me.
Tokenizing is important to the process of figuring out what a program does. What Bjarne is referring to in relation to C++ source deals with how a programs meaning is affected by the tokenization rules. In particular, we must know what the tokens are, and how they are determined. Specifically, how can we identify a single token when it appears next to other characters, and how should we delimit tokens if there is ambiguity.
For instance, consider the prefix operators ++ and +. Let's assume we only had one token + to work with. What is the meaning of the following snippet?
int i = 1;
++i;
With + only, is the above going to just apply unary + on i twice? Or is it going to increment it once? It's ambiguous, naturally. We need an additional token, and therefore introduce ++ as it's own "word" in the language.
But now there is another (albeit smaller) problem. What if the programmer wants to just apply unary + twice, and not increment? Token processing rules are needed. So if we determine that a white space is always a separator for tokens, our programmer may write:
int i = 1;
+ +i;
Roughly speaking, a C++ implementation starts with a file full of characters, transforms them initially to a sequence of tokens ("words" with meaning in the C++ language), and then checks if the tokens appear in a "sentence" that has some valid meaning.
He's refering to the lexical analysis - the necessary piece of every compiler. It is a tool for the compiler to treat a text (as in: a sequence of bytes) in a meaningful way. For example consider the following line in C++
double x = (15*3.0); // my variable
when the compiler looks at the text it first splits the line into a sequence of tokens which may look like this:
Token {"identifier", "double"}
Token {"space", " "}
Token {"identifier", "x"}
Token {"space", " "}
Token {"operator", "="}
Token {"space", " "}
Token {"separator", "("}
Token {"literal_integer", "15"}
Token {"operator", "*"}
Token {"literal_float", "3.0"}
Token {"separator", ")"}
Token {"separator", ";"}
Token {"space", " "}
Token {"comment", "// my variable"}
Token {"end_of_line"}
It doesn't have to be interpreted like above (note that in my case both kind and value are strings), its just an example how it can be done. You usually do this via some regular expressions.
Anyway tokens are easier to understand for the machine then a raw text. Next step for the compiler is to create so called abstract syntax tree based on the tokenization and finally add meaning to everything.
Also note that unless you are writing a parser it is unlikely you will ever use the concept.
As mentioned by others Bjrane is referring to Lexical analysis.
In general terms tokenizing || creating tokens, is a process of processing input streams and dividing them into blocks, without worrying about whitespaces etc. best described earlier by #StoryTeller.
"or as bjrane said: is a sequence of characters that represent something we consider a unit".
The token itself is an example of a C++ user-defined type'UDT' like int or char, so token can be used to define variables and hold values.
UDT can have member functions as well as data members. In your code you define two member functions which is very basic.
1)Kind, 2)Value
class Token {
public:
char kind;
double value;
};
Based on it we can initialize or construct its objects.
Token token_kind_one{'+'};
Initializing token_kind_one with its kind(operator) '+'.
Token token_kind_two{'8',3.14};
and token_kind_two with its kind(integer/number) '8' and with a value of 3.14.
Lets assume we have an expression of ten characters 1+2*3(5/4), which translates to ten tokens.
Tokens:
|----------------------|---------------------|
Kind |'8' |'+' |'8' |'*'|'8'|'('|'8' |'/'|'8' |')'|
|----------------------|---------------------|
Value | 1 | | 2 | | 3 | | 5 | | 4 | |
|----------------------|---------------------|
C++ compiler transfer file data to a token sequence skipping all whitespaces. To make it understandable to itself.
Broadly speaking, a compiler will run multiple operations on a given source code, before converting it into a binary format. One of the first stages is running a tokenizer, where the contents of a source file are converted to Tokens, which are units understood by the compiler. For example, if you write a statementint a, the tokenizer might create a structure to store this information.
Type: integer
Identifier: A
Reserved Word: No
Line number: 10
This would be then referred to as a token, and most of the code in a source file will be broken down into similar structures.

Why do preprocessor commands have to start as first nonwhite space

I am trying to do a #ifndef part way through a setter line and I received this error
"Error 20 error C2014: preprocessor command must start as first nonwhite space"
I am aware of the error means, I am just curious of why it is like that? Is it a compiler choice? What is the reasoning behind this? That it is easier for the user to notice?
Here is the code if someone is wondering:
inline void SetSomething(int value) { #ifndef XBOX ASSERT(value <= 1); #endif test = value; };
At first C did not have any standard preprocessor. Then people started using preprocessing as an external tool. You might note that the # is the same as with comments in general in Unix-land shell scripts.
As the language evolved the preprocessor became integrated with the compiler and more part of the language proper, but it kept its totally different structure, namely, in particular that it's line oriented while the C and C++ core languages are free form.
After that the lines have blurred a bit more. Now the preprocessing typically adds #line directives or the equivalent for use by the core language compiler, also #pragma directives are for the core language compiler, and in the other direction we now have _Pragma (IIRC). Still the structure is mostly as it was originally. C and C++ are evolved languages, not designed languages.
Taking a look into the standard (section 16 "Preprocessing Directives") starting with # as the frirst non whitespace character is what makes a preprocessing directive by definition.
A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the following constraints:
The first token in the sequence is a # preprocessing token that (at the start of translation phase 4) is either
the first character in the source file (optionally after white space containing no new-line characters) or that
follows white space containing at least one new-line character.
If you want the most important reason, it's because the standard says so.
If you want to know why the standard says so, it's the easiest way to get the neccessary functionality.
Remember that preprocessing and compiling are two potentially completely separate tasks, and the preprocessor has no idea at all about the language of its output.

C++ language symbol separator

I need to parse some c++ files to get some information out of it. One user case is I have a enum value "ID_XYZ", I want to find out how many times it appears in a source file. So my question is what are the separator dividing symbols in C++?
You can't really tokenize C or C++ source code based purely on separator characters -- you pretty much need to read in a character at a time, and figure out whether that character can be part of the current token or not.
Just for a couple of examples, when you see a C-style begin-comment token, you need to look at characters until you encounter a close-comment token. Likewise, strings and pre-processor directives (e.g., #if 0 .... #endif sequences). To do it truly correctly, you also need to deal correctly with trigraphs. For example, consider something like this:
// Why doesn't this work??/
ID_XYZ = 1;
If the lexer doesn't handle trigraphs correctly, it will probably identify this as an instance of your ID_XYZ -- but in reality, it's not -- the ??/ at the end of the previous line is really a trigraph that resolves to \, which means the "single-line" comment actually extends to the end of the next line, and the apparent instance of ID_XYZ is really part of the comment.

In the C++ standard, where does it indicate the spacing protocol for the replacement of category descriptives by the source code it represents?

At the risk of asking a question deemed too nit-picky, I have spent a long time trying to justify (as a single example of something that occurs throughout the standard in different contexts) the following definition of an integer literal in §2.14.2 of the C++11 standard, specifically in regards to one detail, the presence of whitespace in the syntax notation itself.
(Note that this example - the definition of an integer literal - is not the point of my question. The point of my question is to ask about the syntax description notation used by the C++ standard itself, specifically in regards to whitespace between grammatical category names. The example I give here - the definition of an integer literal - is specifically chosen only because it acts as an example that is simple and clear-cut.)
(Abbreviated for concision, from §2.14.2):
integer-literal:
decimal-literal integer-suffix_opt
decimal-literal:
nonzero-digit
decimal-literal digit
(with nonzero-digit and digit as expected, [0] 1 ... 9). (Note: The above text is all in italics in the standard.)
This all makes sense to me, assuming that the SPACE between the syntax category descriptives decimal-literal and digit is understood to NOT be present in the actual source code, but is only present in the syntax description itself as it appears here in section §2.14.2.
This convention - placing a space between category descriptives within the notation, where it is understood that the space is not to be present in the source code - is used in other places in the specification. The example here is just a clear-cut case where the space is clearly not supposed to be present in the source code. (See addendum to this question for counterexamples from the standard where whitespace or other separator/s must be present, or is optional, between category descriptives when those category descriptives are replaced by actual tokens in the source code.)
Again, at the risk of being nit-picky, I cannot find anywhere in the standard a statement of convention that spaces are NOT to be present in the source code when interpreting notation such as in this example.
The standard does discuss notational convention in §1.6.1 (and thereafter). The only relevant text that I can find regarding this is:
In the syntax notation used in this International Standard, syntactic
categories are indicated by italic type, and literal words and
characters in constant width type. Alternatives are listed on separate
lines except in a few cases where a long set of alternatives is marked
by the phrase “one of.”
I would not be so nit-picky; however, I find the notation used within the standard to be somewhat tricky, so I would like to be clear on all of the details. I appreciate anyone willing to take the time to fill me in on this.
ADDENDUM In response to comments in which a claim is made similar to "it's obvious that whitespace should not be included in the final source code, so there's no need for the standard to explicitly state this": I have chosen a trivial example in this question, where it is obvious. There are many cases in the standard where it isn't obvious without a. priori knowledge of the language (in my opinion), such as §8.0.4 discussing "const" and "volatile":
cv-qualifier-seq:
cv-qualifier cv-qualifier-seq_opt
... Note the opposite assumption here (whitespace, or another separator or separators, is required in the final source code), but that's not possible to deduce from the syntax notation itself.
There are also cases where a space is optional, such as:
noptr-abstract-declarator:
noptr-abstract-declarator_opt parameters-and-qualifiers
(In this example, to make a point, I won't give the section number or paraphrase what is being discussed; I'll just ask if it's obvious from the grammar notation itself that, in this context, whitespace in the final source code is optional between the tokens.)
I suspect that the comments along these lines - "it's obvious, so that's what it must be" - are the result of the fact that the example I've chosen is so obvious. That's exactly why I chose the example.
§2.7.1
There are five kinds of tokens: identifiers, keywords, literals,
operators, and other separators. Blanks, horizontal and vertical tabs,
newlines, formfeeds, and comments (collectively, “white space”), as
described below, are ignored except as they serve to separate tokens.
So, if a literal is a token, and whitespace serves to seperate tokens, space in between the digits of a literal would be interpreted as two separate tokens, and therefore cannot be part of the same literal.
I'm reasonably certain there is no more direct explanation of this fact in the standard.
The notation used is similar enough to typical BNF that they take many of the same general conventions for granted, including the fact that whitespace in the notation has no significance beyond separating the tokens of the BNF itself -- that if/when whitespace has significance in the source code beyond separating tokens, they'll include notation to specify it directly (e.g., for most preprocessing directives, the new-line is specified directly:
# ifdef identifier new-line groupopt
or:
# include < h-char-sequence> new-line
The blame for that probably goes back to the Algol 68 standard, which went so far overboard in its attempts at precisely specifying syntax that it was essentially impossible for anybody to read without weeks of full-time study1. Since then, any more than the most cursory explanation of the syntax description language leads to rejection on the basis that it's too much like Algol 68 and will undoubtedly fail because it's too formal and nobody will ever read or understand it.
1 How could it be that bad you ask? It basically went like this: they started with a formal English description of a syntax description language. That wasn't used to define Algol 68 though -- it was used to specify (even more precisely) another syntax description language. That second syntax description language was then used to specify the syntax of Algol 68 itself. So, you had to learn two separate syntax description languages before you could start to read the Algol 68 syntax itself at all. As you can undoubtedly guess, almost nobody ever did.
As you say, the standard says:
literal words and characters in constant width type
So, if a literal space were to be included in a rule, it would have to be rendered in a constant width type. Close examination of the standard will reveal that the space in the production you refer to is narrower than the constant width type. (Also your attempt to quote the standard is a misrepresentation because it renders in constant-width type that which should be rendered in italics, with a consequent semantic change.)
Ok, that was the "aspiring language lawyer" answer; furthermore, it doesn't really work because it fails on all the productions which are of the form:
One of:
0 1 2 3 4 5 6 7 8 9
I think, in reality, the answer is that whitespace is not part of the formal grammar, because it serves only to separate tokens; furthermore, that statement is mostly true of the grammar itself, whose tokens are separated by whitespace without that whitespace being a token, except that indentation in the grammar matters, unlike indentation in a program.
Addendum to answer the addendum
It's not actually true that const and volatile need to be separated by whitespace. They simply need to be separate tokens. Example:
#define A(x)x
A(const)A(volatile)A(int)A(x)A(;)
Again, more seriously, Chapter 2 (with particular reference to 2.2 and 2.5, but you have to read the entire text) describe how the program text is processed in order to produce a stream of tokens. All of the rules in which you claim whitespace must be ignored are in this part of the grammar, and all of the rules in which you claim whitespace might be required are not.
These are really two separate grammars, but the lexical grammar is necessarily incomplete because you need to consider the operation of the preprocessor in order to apply it.
I believe that everything I said can be gleaned from the standard. Here are some excerpts:
2.2(3) The source file is decomposed into preprocessing tokens (2.5) and sequences of white-space characters (including comments)… The process of dividing a source file’s characters into preprocessing tokens is context-dependent.
…
2.2(7) White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. (2.7). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.
I think that all this makes it clear that there are two grammars, one lexical -- that is, it produces a lexeme (token) from a sequence of graphemes (characters) -- and the other syntactic -- that is, it produces an abstract syntax tree from a sequence of lexemes (tokens). In neither case (with a small exception, which I'll get to in a minute) is whitespace considered anything other than something which stops two lexemes from running into each other if the lexical grammar would otherwise allow that. (See the algorithm in 2.5(3).)
C++ is not syntactically pretty, so there are almost always exceptions. One of these, inherited from C, is the difference between:
#define A(X)(X)
and
#define A (X)(X)
Preprocessing directives have their own parsing rules, and this one is typified by the definition:
lparen:
  a ( character not immediately preceded by white-space
This, I would say, is the exception that proves the rule [Note 1]. The fact that it is necessary to say that this ( is not preceded by white-space shows that the normal use of the token ( in a syntactic rule does not say anything about its blancospatial context.
So, to paraphrase Ray Cummings (not Albert Einstein, as is sometimes claimed), "time and white-space are all that separate one token from another." [Note 2]
[Note 1] I use the phrase here in its original legal sense, as perCicero.
[Note 2]:
"Time," said George, "why I can give you a definition of time. It's what keeps everything from happening at once."
A ripple of laughter went about the little group of men.
"Quite so," agreed the Chemist. "And, gentlemen, that's not so funny as it sounds. As a matter of fact, it is really not a bad scientific definition. Time and space are all that separate one event from another…
-- From The man who mastered time, by Ray Cummings, 1929, Ace Books. See first page, in Google books
The Standard actually has two separate grammars.
The preprocessor grammar, described in sections 2 and 16, defines how a sequence of source characters is converted to a sequence of preprocessing tokens and whitespace characters, in translation phases 1-6. In some of these phases and parts of this grammar, whitespace is significant.
Whitespace characters which are not part of preprocessing tokens stop being significant after translation phase 4. The Standard explicitly says at the start of translation phase 7 to discard whitespace characters between preprocessing tokens.
The language grammar defines how a sequence of tokens (converted from preprocessing tokens) are syntactically and semantically interpreted in translation phase 7. There is no such thing as whitespace in this grammar. (By this point, ' ' is a character-literal just like 'c' is.)
In both grammars, the space between grammar components visible in the Standard has nothing to do with source or execution whitespace characters, it's just there to make the Standard legible. When the preprocessor grammar depends on whitespace, it spells it out with words, for example:
c-char:
any member of the source character set except the single-quote ', backslash \, or new-line character
escape-sequence
universal-character-name
and
control-line:
...
# define identifier lparen identifier-list[opt] ) replacement-list newline
...
lparen:
a ( character not immediately preceded by white-space
So there may not be whitespace between digits of an integer-literal because the preprocessor grammar does not allow it.
One other important rule here is from C++11 2.5p3:
If the input stream has been parsed into preprocessing tokens up to a given character:
If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. ...
Otherwise, if the next three characters are <:: and the subsequent character is neither : nor >, the < is treated as a preprocessor token by itself and not as the first character of the alternative token <:.
Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.
So there must be whitespace between const and volatile tokens because otherwise, the longest-token-possible rule would convert that to a single identifier token constvolatile.

Why is this C or C++ macro not expanded by the preprocessor?

Can someone points me the problem in the code when compiled with gcc 4.1.0.
#define X 10
int main()
{
double a = 1e-X;
return 0;
}
I am getting error:Exponent has no digits.
When I replace X with 10, it works fine. Also I checked with g++ -E command to see the file with preprocessors applied, it has not replaced X with 10.
I was under the impression that preprocessor replaces every macro defined in the file with the replacement text with applying any intelligence. Am I wrong?
I know this is a really silly question but I am confused and I would rather be silly than confused :).
Any comments/suggestions?
The preprocessor is not a text processor, it works on the level of tokens. In your code, after the define, every occurence of the token X would be replaced by the token 10. However, there is not token X in the rest of your code.
1e-X is syntactically invalid and cannot be turned into a token, which is basically what the error is telling you (it says that to make it a valid token -- in this case a floating point literal -- you have to provide a valid exponent).
When you write 1e-X all together like that, the X isn't a separate symbol for the preprocessor to replace - there needs to be whitespace (or certain other symbols) on either side. Think about it a little and you'll realize why.. :)
Edit: "12-X" is valid because it gets parsed as "12", "-", "X" which are three separate tokens. "1e-X" can't be split like that because "1e-" doesn't form a valid token by itself, as Jonathan mentioned in his answer.
As for the solution to your problem, you can use token-concatenation:
#define E(X) 1e-##X
int main()
{
double a = E(10); // expands to 1e-10
return 0;
}
Several people have said that 1e-X is lexed as a single token, which is partially correct. To explain:
There are two classes of tokens during translation: preprocessing tokens and tokens. A source file is initially decomposed into preprocessing tokens; these tokens are then used in all of the preprocessing tasks, including macro replacement. After preprocessing, each preprocessing token is converted into a token; these resulting tokens are used during actual compilation.
There are fewer types of preprocessing tokens than there are types of tokens. For example, keywords (e.g. for, while, if) are not significant during the preprocessing phases, so there is no keyword preprocessing token. Keywords are simply lexed as identifiers. When the conversion from preprocessing tokens to tokens takes place, each identifier preprocessing token is inspected; if it matches a keyword, it is converted into a keyword token; otherwise it is converted into an identifier token.
There is only one type of numeric token during preprocessing: preprocessing number. This type of preprocessing token corresponds to two different types of tokens: integer literal and floating literal.
The preprocessing number preprocessing token is defined very broadly. Effectively it matches any sequence of characters that begins with a digit or a decimal point followed by any number of digits, nondigits (e.g. letters), and e+ and e-. So, all of the following are valid preprocessing number preprocessing tokens:
1.0e-10
.78
42
1e-X
1helloworld
The first two can be converted into floating literals; the third can be converted into an integer literal. The last two are not valid integer literals or floating literals; those preprocessing tokens cannot be converted into tokens. This is why you can preprocess the source without error but cannot compile it: the error occurs in the conversion from preprocessing tokens to tokens.
GCC 4.5.0 doesn't change the X either.
The answer is going to lie in how the preprocessor interprets preprocessing tokens - and in the 'maximal munch' rule. The 'maximal munch' rule is what dictates that 'x+++++y' is treated as 'x ++ ++ + y' and hence is erroneous, rather than as 'x ++ + ++ y' which is legitimate.
The issue is why does the preprocessor interpret '1e-X' as a single preprocessing token. Clearly, it will treat '1e-10' as a single token. There is no valid interpretation for '1e-' unless it is followed by a digit once it passes the preprocessor. So, I have to guess that the preprocessor sees '1e-X' as a single token (actually erroneous). But I have not dissected the correct clauses in the standard to see where it is required. But the definition of a 'preprocessing number' or 'pp-number' in the standard (see below) is somewhat different from the definition of a valid integer or floating point constant and allows many 'pp-numbers' that are not valid as an integer or floating point constant.
If it helps, the output of the Sun C Compiler for 'cc -E -v soq.c' is:
# 1 "soq.c"
# 2
int main()
{
"soq.c", line 4: invalid input token: 1e-X
double a = 1e-X ;
return 0;
}
#ident "acomp: Sun C 5.9 SunOS_sparc Patch 124867-09 2008/11/25"
cc: acomp failed for soq.c
So, at least one C compiler rejects the code in the preprocessor - it might be that the GCC preprocessor is a little slack (I tried to provoke it into complaining with gcc -Wall -pedantic -std=c89 -Wextra -E soq.c but it did not utter a squeak). And using 3 X's in both the macro and the '1e-XXX' notation showed that all three X's were consumed by both GCC and Sun C Compiler.
C Standard Definition of Preprocessing Number
From the C Standard - ISO/IEC 9899:1999 §6.4.8 Preprocessing Numbers:
pp-number:
digit
. digit
pp-number digit
pp-number identifier-nondigit
pp-number e sign
pp-number E sign
pp-number p sign
pp-number P sign
pp-number .
Given this, '1e-X' is a valid 'pp-number', and therefore the X is not a separate token (nor is the 'XXX' in '1e-XXX' a separate token). Therefore, the preprocessor cannot expand the X; it isn't a separate token subject to expansion.