C11 Compilation. Phase of translation #1 and #5. Universal character names - c++

I'm trying to understand Universal Character Names in the C11 standard and found that the N1570 draft of the C11 standard has much less detail than the C++11 standard with respect to Translation Phases 1 and 5 and the formation and handling of UCNs within them. This is what each has to say:
Translation Phase 1
N1570 Draft C11 5.1.1.2p1.1:
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
C++11 2.2p1.1:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
Translation Phase 5
N1570 Draft C11 5.1.1.2p1.5:
Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; [...]
C++ 2.2p1.5:
Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set; [...]
(emphasis was added on differences)
The Questions
In the C++11 standard, it is very clear that source file characters not in the basic source character set are converted to UCNs, and that they are treated exactly as would have been a UCN in that same place, with the sole exception of raw-strings. Is the same true of C11? When a C11 compiler sees a multi-byte UTF-8 character such as °, does it too translate this to \u00b0 in phase 1, and treat it just as if \u00b0 had appeared there instead?
To put it in a different way, at the end of which translation phase, if any, are the following snippets of code transformed into textually equivalent forms for the first time in C11?
const char* hell° = "hell°";
and
const char* hell\u00b0 = "hell\u00b0";
If in 2., the answer is "in none", then during which translation phase are those two identifiers first understood to refer to the same thing, despite being textually different?
In C11, are UCNs in character/string literals also converted in phase 5? If so, why omit this from the draft standard?
How are UCNs in identifiers (as opposed to in character/string literals as already mentioned) handled in both C11 and C++11? Are they also converted in phase 5? Or is this something implementation-defined? Does GCC for instance print out such identifiers in UCN-coded form, or in actual UTF-8?

Comments turned into an answer
Interesting question!
The C standard can leave more of the conversions unstated because they are implementation-defined (and C has no raw strings to confuse the issue).
What it says in the C standard is sufficient — except that it leaves your question 1 unanswerable.
Q2 has to be 'Phase 5', I think, with caveats about it being 'the token stream is equivalent'.
Q3 is strictly N/A, but Phase 7 is probably the answer.
Q4 is 'yes', and it says so because it mentions 'escape sequences' and UCNs are escape sequences.
Q5 is 'Phase 5' too.
Can the C++11-mandated processes in Phase 1 and 5 be taken as compliant within the wording of C11 (putting aside raw strings)?
I think they are effectively the same; the difference arises primarily from the raw literal issue which is specific to C++. Generally, the C and C++ standards try not to make things gratuitously different, and in particular try to the workings of the preprocessor and the low-level character parsing the same in both (which has been easier since C99 added support for C++ // comments, but which evidently got harder with the addition of raw literals to C++11).
One day, I'll have to study the raw literal notations and their implications more thoroughly.

First, please note that these distinction exist since 1998; UCN were first introduced in C++98, a new standard (ISO/IEC 14882, 1st edition:1998), and then made their way into the C99 revision of the C standard; but the C committee (and existing implementers, and their pre-existing implementations) did not feel the C++ way was the only way to achieve the trick, particularly with corner cases and the use of smaller character sets than Unicode, or just different; for example, the requirement to ship the mapping tables from whatever-supported-encodings to Unicode was a preoccupation for C vendors in 1998.
The C standard (consciously) avoids deciding this, and let the compiler chooses how to proceed. While your reasoning takes obviously place with the context of UTF-8 character sets used for both source and execution, there are a large (and pre-existing) range of different C99/C11 compilers available which are using different sets; and the committee felt it should not restrict the implementers too much on this issue. In my experience, most compilers keep it distinct in practice (for performance reasons.)
Because of this freedom, some compilers can have it identical after phase 1 (like a C++ compiler shall), while other can left it distinct as late as phase 7 for the first degree character; the second degree characters (in the string) ought to be the same after phase 5, assuming the degree character is part of the extended execution character set supported by the implementation.
For the other answers, I won't add anything to Jonathan's.
About your additional question about the C++ more deterministic process to be Standard-C-compliant, it is clearly a goal to be so; and if you find a corner case which shows otherwise (a C++11-compliant preprocessor which would not conform to the C99 and C11 standards), then you should consider asking the WG14 committee about a potential defect.
Obviously, the reverse is not true: it is possible to write a pre-processor with handling of UCN which complies to C99/C11 but not to the C++ standards; the most obvious difference is with
#define str(t) #t
#define str_is(x, y) const char * x = y " is " str(y)
str_is(hell°, "hell°");
str_is(hell\u00B0, "hell\u00B0");
which a C-compliant preprocessor can render in a similar same way as your examples (and most do so) and as such, will have distinct renderings; but I am under the impression that a C++-compliant preprocessor is required to transform into (strictly equivalent)
const char* hell° = "hell°" " is " "\"hell\\u00b0\"";
const char* hell\u00b0 = "hell\\u00b0" " is " "\"hell\\u00b0\"";
Last, but not least, I believe not much compilers are fully compliant to this very level of details!

Related

C++: Is there a standard definition for end-of-line in a multi-line string constant?

If I have a multi-line string C++11 string constant such as
R"""line 1
line 2
line3"""
Is it defined what character(s) the line terminator/separator consist of?
The intent is that a newline in a raw string literal maps to a single
'\n' character. This intent is not expressed as clearly as it
should be, which has led to some confusion.
Citations are to the 2011 ISO C++ standard.
First, here's the evidence that it maps to a single '\n' character.
A note in section 2.14.5 [lex.string] paragraph 4 says:
[ Note: A source-file new-line in a raw string literal results in a
new-line in the resulting execution string-literal. Assuming no
whitespace at the beginning of lines in the following example, the
assert will succeed:
const char *p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
— end note ]
This clearly states that a newline is mapped to a single '\n'
character. It also matches the observed behavior of g++ 6.2.0 and
clang++ 3.8.1 (tests done on a Linux system using source files with
Unix-style and Windows-style line endings).
Given the clearly stated intent in the note and the behavior of two
popular compilers, I'd say it's safe to rely on this -- though it
would be interesting to see how other compilers actually handle this.
However, a literal reading of the normative wording of the
standard could easily lead to a different conclusion, or at least
to some uncertainty.
Section 2.5 [lex.pptoken] paragraph 3 says (emphasis added):
Between the initial and final double quote characters of the
raw string, any transformations performed in phases 1 and 2
(trigraphs, universal-character-names, and line splicing)
are reverted; this reversion shall apply before any d-char,
r-char, or delimiting parenthesis is identified.
The phases of translation are specified in 2.2 [lex.phases]. In phase 1:
Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary.
If we assume that the mapping of physical source file characters to the
basic character set and the introduction of new-line characters are
"tranformations", we might reasonably conclude that, for example,
a newline in the middle of a raw string literal in a Windows-format
source file should be equivalent to a \r\n sequence. (I can imagine
that being useful for Windows-specific code.)
(This interpretation does lead to problems with systems where the
end-of-line indicator is not a sequence of characters, for example
where each line is a fixed-width record. Such systems are rare
these days.)
As "Cheers and hth. - Alf"'s answer
points out, there is an open
Defect Report
for this issue. It was submitted in 2013 and has not yet been
resolved.
Personally, I think the root of the confusion is the word "any"
(emphasis added as before):
Between the initial and final double quote characters of the raw
string, any transformations performed in phases 1 and 2 (trigraphs,
universal-character-names, and line splicing) are reverted; this
reversion shall apply before any d-char, r-char, or delimiting
parenthesis is identified.
Surely the mapping of physical source file characters to
the basic source character set can reasonably be thought of
as a transformation. The parenthesized clause "(trigraphs,
universal-character-names, and line splicing)" seems to be intended
to specify which transformations are to be reverted, but that
either attempts to change the meaning of the word "transformations"
(which the standard does not formally define) or contradicts the use
of the word "any".
I suggest that changing the word "any" to "certain" would express
the apparent intent much more clearly:
Between the initial and final double quote characters of the raw
string, certain transformations performed in phases 1 and 2 (trigraphs,
universal-character-names, and line splicing) are reverted; this
reversion shall apply before any d-char, r-char, or delimiting
parenthesis is identified.
This wording would make it much clearer that "trigraphs,
universal-character-names, and line splicing" are the only
transformations that are to be reverted. (Not everything done
in translation phases 1 and 2 is reverted, just those specific
listed transformations.)
The standard seems to indicate that:
R"""line 1
line 2
line3"""
is equivalent to:
"line 1\nline 2\nline3"
From 2.14.5 String literals of the C++11 standard:
4 [ Note: A source-file new-line in a raw string literal results in a new-line in the resulting execution string literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:
const char *p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
—end note ]
5 [ Example: The raw string
R"a(
)\
a"
)a"
is equivalent to "\n)\\\na\"\n".
Note: the question has changed substantially since the answers were posted. Only half of it remains, namely the pure C++ aspect. The network focus in this answer addresses the original question's “sending a multi-line string to a server with well-defined end-of-line requirements”. I do not chase question evolution in general.
Internally in the program, the C++ standard for newline is \n. This is used also for newline in a raw literal. There is no special convention for raw literals.
Usually \n maps to ASCII linefeed, which is the value 10.
I'm not sure what it maps to in EBCDIC, but you can check that if needed.
On the wire, however, it's my impression that most protocols use ASCII carriage return plus linefeed, i.e. 13 followed by 10. This is sometimes called CRLF, after the ASCII abbreviations CR for carriage return and LF for linefeed. When the C++ escapes are mapped to ASCII this is simply \r\n in C++.
You need to abide by the requirements of the protocol you're using.
For ordinary file/stream i/o the C++ standard library takes care of mapping the internal \n to whatever convention the host environment uses. This is called text mode, as opposed to binary mode where no mapping is performed.
For network i/o, which is not covered by the standard library, the application code must do this itself, either directly or via some library functions.
There is an active issue about this, core language defect report #1655 “Line endings in raw string literals”, submitted by Mike Miller 2013-04-26, where he asks,
” is it intended that, for example, a CRLF in the source of a raw string literal is to be represented as a newline character or as the original characters?
Since line ending values differ depending on the encoding of the original file, and considering that in some file systems there is not an encoding of line endings, but instead lines as records, it's clear that the intention is not to represent the file contents as-is – since that's impossible to do in all cases. But as far as I can see this DR is not yet resolved.

C++ Compilation. Phase of translation #1. Universal character name

I can't understand what does it mean in c++ standard:
Any source file character not in the basic source character set (2.3)
is replaced by the universal-character-name that designates that
charac- ter. (An implementation may use any internal encoding, so long
as an actual extended character encountered in the source file, and
the same extended character expressed in the source file as a
universal-character-name (i.e., using the \uXXXX notation), are
handled equivalently except where this replacement is reverted in a
raw string literal.)
As I understand, if compiler sees charcter not in the basic character set it's just replaced it with sequence of characters in this format '\uNNNN' or '\UNNNNNNNN'. But I don't get how to obtain this NNNN or NNNNNNNN.
So this is my question: how to do conversion ?
Note the preceding sentence which states:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary.
That is, it's entirely up to the compiler how it actually interprets the characters or bytes that make up your file. In doing this interpretation, it must decide which of the physical characters belong to the basic source character set and which don't. If a character does not belong, then it is replaced with the universal character name (or at least, the effect is as if it had done).
The point of this is to reduce the source file down to a very small set of characters - there are only 96 characters in the basic source character set. Any character not in the basic source character set has been replaced by \, u or U, and some hexadecimal digits (0-F).
A universal character name is one of:
\uNNNN
\UNNNNNNNN
Where each N is a hexadecimal digit. The meaning of these digits is given in §2.3:
The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800–0xDFFF, inclusive), the program is ill-formed.
The ISO/IEC 10646 standard originated before Unicode and defined the Universal Character Set (UCS). It assigned code points to characters and specified how those code points should be encoded. The Unicode Consortium and the ISO group then joined forces to work on Unicode. The Unicode standard specifies much more than ISO/IEC 10646 does (algorithms, functional character specifications, etc.) but both standards are now kept in sync.
So you can think of the NNNN or NNNNNNNN as the Unicode code point for that character.
As an example, consider a line in your source file containing this:
const char* str = "Hellô";
Since ô is not in the basic source character set, that line is internally translated to:
const char* str = "Hell\u00F4";
This will give the same result.
There are only certain parts of your code that a universal-character-name is permitted:
In string literals
In character literals
In identifiers (however, this is not very well supported)
But I don't get how to obtain this NNNN or NNNNNNNN. So this is my question: how to do conversion?
The mapping is implementation-defined (e.g. §2.3 footnote 14). For instance if I save the following file as Latin-1:
#include <iostream>
int main() {
std::cout << "Hallö\n";
}
And compile it with g++ on OS X, I get the following output after running it:
Hell�
… but if I had saved it as UTF-8 I would have gotten this:
Hellö
Because GCC assumes UTF-8 as the input encoding on my system.
Other compilers may perform different mappings.
So, if your file is called Hello°¶.c, the compile would, when using that name internally, e.g. if we do:
cout << __FILE__ << endl;
the compiler would translate Hello°¶.c to Hello\u00b0\u00b6.c.
However, when I just tried this with g++ it doesn't do that...
But the assembler output contains:
.string "Hello\302\260\302\266.c"

C++11 character literal '\xC4' standard type with UTF-8 execution character set?

Consider a C++11 compiler that has an execution character set of UTF-8 (and is compliant with the x86-64 ABI which requires the char type be a signed 8-bit byte).
The letter Ä (umlaut) has unicode code point of 0xC4, and has a 2 code unit UTF-8 representation of {0xC3, 0x84}
The compiler assigns the character literal '\xC4' a type of int with a value of 0xC4.
Is the compiler standard-compliant and ABI-compliant? What is your reasoning?
Relevant quotes from C++11 standard:
2.14.3.1
An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than
one c-char is a multicharacter literal. A multicharacter literal has type int and implementation-defined
value.
2.14.3.4
The escape \xhhh consists of the backslash followed by x followed by
one or more hexadecimal digits that are taken to specify the value of the desired character. The value of a character
literal is implementation-defined if it falls outside of the implementation-defined range defined for char
§2.14.3 paragraph 1 is undoubtedly the relevant text in the (C++11) standard. However, there were several defects in the original text, and the latest version contains the following text, emphasis added:
A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
Although this has been accepted as a defect, it does not actually form part of any standard. However, it stands as a recommendation and I suspect that many compilers will implement it.
From 2.1.14.3p4:
The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char
x86 compilers historically (and as you point out, that practice is now an official standard of some sort) have signed chars. \xc7 is out of range for that, so the implementation is required to document the literal value it will produce.
It looks like your implementation promotes out-of-range char literals specified with \x escapes to (in-range) integer literals.
You're mixing apples, oranges, pears and kumquats :)
Yes, "\xc4" is a legal character literal. Specifically, what the standard calls a "narrow character literal".
From the C++ standard:
The glyphs for the members of the basic source character set are
intended to identify characters from the subset of ISO/IEC 10646 which
corresponds to the ASCII character set. However, because the mapping
from source file characters to the source character set (described in
translation phase 1) is specified as implementation-defined, an
implementation is required to document how the basic source characters
are represented in source files.
This might help clarify:
Rules for C++ string literals escape character
This will might also help, if you're not familiar with it:
The absolute minimum every software developer should know about Unicode
Here is another good, concise - and illuminating - reference:
IBM Developerworks: Character literals

In the C++ standard, where does it indicate the spacing protocol for the replacement of category descriptives by the source code it represents?

At the risk of asking a question deemed too nit-picky, I have spent a long time trying to justify (as a single example of something that occurs throughout the standard in different contexts) the following definition of an integer literal in §2.14.2 of the C++11 standard, specifically in regards to one detail, the presence of whitespace in the syntax notation itself.
(Note that this example - the definition of an integer literal - is not the point of my question. The point of my question is to ask about the syntax description notation used by the C++ standard itself, specifically in regards to whitespace between grammatical category names. The example I give here - the definition of an integer literal - is specifically chosen only because it acts as an example that is simple and clear-cut.)
(Abbreviated for concision, from §2.14.2):
integer-literal:
decimal-literal integer-suffix_opt
decimal-literal:
nonzero-digit
decimal-literal digit
(with nonzero-digit and digit as expected, [0] 1 ... 9). (Note: The above text is all in italics in the standard.)
This all makes sense to me, assuming that the SPACE between the syntax category descriptives decimal-literal and digit is understood to NOT be present in the actual source code, but is only present in the syntax description itself as it appears here in section §2.14.2.
This convention - placing a space between category descriptives within the notation, where it is understood that the space is not to be present in the source code - is used in other places in the specification. The example here is just a clear-cut case where the space is clearly not supposed to be present in the source code. (See addendum to this question for counterexamples from the standard where whitespace or other separator/s must be present, or is optional, between category descriptives when those category descriptives are replaced by actual tokens in the source code.)
Again, at the risk of being nit-picky, I cannot find anywhere in the standard a statement of convention that spaces are NOT to be present in the source code when interpreting notation such as in this example.
The standard does discuss notational convention in §1.6.1 (and thereafter). The only relevant text that I can find regarding this is:
In the syntax notation used in this International Standard, syntactic
categories are indicated by italic type, and literal words and
characters in constant width type. Alternatives are listed on separate
lines except in a few cases where a long set of alternatives is marked
by the phrase “one of.”
I would not be so nit-picky; however, I find the notation used within the standard to be somewhat tricky, so I would like to be clear on all of the details. I appreciate anyone willing to take the time to fill me in on this.
ADDENDUM In response to comments in which a claim is made similar to "it's obvious that whitespace should not be included in the final source code, so there's no need for the standard to explicitly state this": I have chosen a trivial example in this question, where it is obvious. There are many cases in the standard where it isn't obvious without a. priori knowledge of the language (in my opinion), such as §8.0.4 discussing "const" and "volatile":
cv-qualifier-seq:
cv-qualifier cv-qualifier-seq_opt
... Note the opposite assumption here (whitespace, or another separator or separators, is required in the final source code), but that's not possible to deduce from the syntax notation itself.
There are also cases where a space is optional, such as:
noptr-abstract-declarator:
noptr-abstract-declarator_opt parameters-and-qualifiers
(In this example, to make a point, I won't give the section number or paraphrase what is being discussed; I'll just ask if it's obvious from the grammar notation itself that, in this context, whitespace in the final source code is optional between the tokens.)
I suspect that the comments along these lines - "it's obvious, so that's what it must be" - are the result of the fact that the example I've chosen is so obvious. That's exactly why I chose the example.
§2.7.1
There are five kinds of tokens: identifiers, keywords, literals,
operators, and other separators. Blanks, horizontal and vertical tabs,
newlines, formfeeds, and comments (collectively, “white space”), as
described below, are ignored except as they serve to separate tokens.
So, if a literal is a token, and whitespace serves to seperate tokens, space in between the digits of a literal would be interpreted as two separate tokens, and therefore cannot be part of the same literal.
I'm reasonably certain there is no more direct explanation of this fact in the standard.
The notation used is similar enough to typical BNF that they take many of the same general conventions for granted, including the fact that whitespace in the notation has no significance beyond separating the tokens of the BNF itself -- that if/when whitespace has significance in the source code beyond separating tokens, they'll include notation to specify it directly (e.g., for most preprocessing directives, the new-line is specified directly:
# ifdef identifier new-line groupopt
or:
# include < h-char-sequence> new-line
The blame for that probably goes back to the Algol 68 standard, which went so far overboard in its attempts at precisely specifying syntax that it was essentially impossible for anybody to read without weeks of full-time study1. Since then, any more than the most cursory explanation of the syntax description language leads to rejection on the basis that it's too much like Algol 68 and will undoubtedly fail because it's too formal and nobody will ever read or understand it.
1 How could it be that bad you ask? It basically went like this: they started with a formal English description of a syntax description language. That wasn't used to define Algol 68 though -- it was used to specify (even more precisely) another syntax description language. That second syntax description language was then used to specify the syntax of Algol 68 itself. So, you had to learn two separate syntax description languages before you could start to read the Algol 68 syntax itself at all. As you can undoubtedly guess, almost nobody ever did.
As you say, the standard says:
literal words and characters in constant width type
So, if a literal space were to be included in a rule, it would have to be rendered in a constant width type. Close examination of the standard will reveal that the space in the production you refer to is narrower than the constant width type. (Also your attempt to quote the standard is a misrepresentation because it renders in constant-width type that which should be rendered in italics, with a consequent semantic change.)
Ok, that was the "aspiring language lawyer" answer; furthermore, it doesn't really work because it fails on all the productions which are of the form:
One of:
0 1 2 3 4 5 6 7 8 9
I think, in reality, the answer is that whitespace is not part of the formal grammar, because it serves only to separate tokens; furthermore, that statement is mostly true of the grammar itself, whose tokens are separated by whitespace without that whitespace being a token, except that indentation in the grammar matters, unlike indentation in a program.
Addendum to answer the addendum
It's not actually true that const and volatile need to be separated by whitespace. They simply need to be separate tokens. Example:
#define A(x)x
A(const)A(volatile)A(int)A(x)A(;)
Again, more seriously, Chapter 2 (with particular reference to 2.2 and 2.5, but you have to read the entire text) describe how the program text is processed in order to produce a stream of tokens. All of the rules in which you claim whitespace must be ignored are in this part of the grammar, and all of the rules in which you claim whitespace might be required are not.
These are really two separate grammars, but the lexical grammar is necessarily incomplete because you need to consider the operation of the preprocessor in order to apply it.
I believe that everything I said can be gleaned from the standard. Here are some excerpts:
2.2(3) The source file is decomposed into preprocessing tokens (2.5) and sequences of white-space characters (including comments)… The process of dividing a source file’s characters into preprocessing tokens is context-dependent.
…
2.2(7) White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. (2.7). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.
I think that all this makes it clear that there are two grammars, one lexical -- that is, it produces a lexeme (token) from a sequence of graphemes (characters) -- and the other syntactic -- that is, it produces an abstract syntax tree from a sequence of lexemes (tokens). In neither case (with a small exception, which I'll get to in a minute) is whitespace considered anything other than something which stops two lexemes from running into each other if the lexical grammar would otherwise allow that. (See the algorithm in 2.5(3).)
C++ is not syntactically pretty, so there are almost always exceptions. One of these, inherited from C, is the difference between:
#define A(X)(X)
and
#define A (X)(X)
Preprocessing directives have their own parsing rules, and this one is typified by the definition:
lparen:
  a ( character not immediately preceded by white-space
This, I would say, is the exception that proves the rule [Note 1]. The fact that it is necessary to say that this ( is not preceded by white-space shows that the normal use of the token ( in a syntactic rule does not say anything about its blancospatial context.
So, to paraphrase Ray Cummings (not Albert Einstein, as is sometimes claimed), "time and white-space are all that separate one token from another." [Note 2]
[Note 1] I use the phrase here in its original legal sense, as perCicero.
[Note 2]:
"Time," said George, "why I can give you a definition of time. It's what keeps everything from happening at once."
A ripple of laughter went about the little group of men.
"Quite so," agreed the Chemist. "And, gentlemen, that's not so funny as it sounds. As a matter of fact, it is really not a bad scientific definition. Time and space are all that separate one event from another…
-- From The man who mastered time, by Ray Cummings, 1929, Ace Books. See first page, in Google books
The Standard actually has two separate grammars.
The preprocessor grammar, described in sections 2 and 16, defines how a sequence of source characters is converted to a sequence of preprocessing tokens and whitespace characters, in translation phases 1-6. In some of these phases and parts of this grammar, whitespace is significant.
Whitespace characters which are not part of preprocessing tokens stop being significant after translation phase 4. The Standard explicitly says at the start of translation phase 7 to discard whitespace characters between preprocessing tokens.
The language grammar defines how a sequence of tokens (converted from preprocessing tokens) are syntactically and semantically interpreted in translation phase 7. There is no such thing as whitespace in this grammar. (By this point, ' ' is a character-literal just like 'c' is.)
In both grammars, the space between grammar components visible in the Standard has nothing to do with source or execution whitespace characters, it's just there to make the Standard legible. When the preprocessor grammar depends on whitespace, it spells it out with words, for example:
c-char:
any member of the source character set except the single-quote ', backslash \, or new-line character
escape-sequence
universal-character-name
and
control-line:
...
# define identifier lparen identifier-list[opt] ) replacement-list newline
...
lparen:
a ( character not immediately preceded by white-space
So there may not be whitespace between digits of an integer-literal because the preprocessor grammar does not allow it.
One other important rule here is from C++11 2.5p3:
If the input stream has been parsed into preprocessing tokens up to a given character:
If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. ...
Otherwise, if the next three characters are <:: and the subsequent character is neither : nor >, the < is treated as a preprocessor token by itself and not as the first character of the alternative token <:.
Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.
So there must be whitespace between const and volatile tokens because otherwise, the longest-token-possible rule would convert that to a single identifier token constvolatile.

C++ Preprocessor string literal concatenation

I found this regarding how the C preprocessor should handle string literal concatenation (phase 6). However, I can not find anything regarding how this is handled in C++ (does C++ use the C preprocessor?).
The reason I ask is that I have the following:
const char * Foo::encoding = "\0" "1234567890\0abcdefg";
where encoding is a static member of class Foo. Without the availability of concatenation I wouldnt be able to write that sequence of characters like that.
const char * Foo::encoding = "\01234567890\0abcdefg";
Is something entirely different due to the way \012 is interpreted.
I dont have access to multiple platforms and I'm curious how confident I should be that the above is always handled correctly - i.e. I will always get { 0, '1', '2', '3', ... }
The language (C as well as C++) has no "preprocessor". "Preprocessor", as a separate functional unit, is an implementation detail. The way the source file(s) is handled if defined by so called phases of translation. One of the phases in C, as well as in C++ involves concatenating string literals.
In C++ language standard it is described in 2.1. For C++ (C++03) it is phase 6
6 Adjacent ordinary string literal
tokens are concatenated. Adjacent wide
string literal tokens are
concatenated.
Yes, it will be handled as you describe, because it is in stage 5 that,
Each source character set member and escape sequence in character constants and
string literals is converted to the corresponding member of the execution character
set (C99 §5.1.1.2/1)
The language in C++03 is effectively the same:
Each source character set member, escape sequence, or universal-character-name in character literals and string literals is converted to a member of the execution character set (C++03 §2.1/5)
So, escape sequences (like \0) are converted into members of the execution character set in stage five, before string literals are concatenated in stage six.
Because of the agreement between the C++ and C standards. Most, if not all, C++ implementations use a C preprocessor, so yes, C++ uses the C preprocessor.