C++ Preprocessor string literal concatenation

C++ Preprocessor string literal concatenation - c++

I found this regarding how the C preprocessor should handle string literal concatenation (phase 6). However, I can not find anything regarding how this is handled in C++ (does C++ use the C preprocessor?).
The reason I ask is that I have the following:
const char * Foo::encoding = "\0" "1234567890\0abcdefg";
where encoding is a static member of class Foo. Without the availability of concatenation I wouldnt be able to write that sequence of characters like that.
const char * Foo::encoding = "\01234567890\0abcdefg";
Is something entirely different due to the way \012 is interpreted.
I dont have access to multiple platforms and I'm curious how confident I should be that the above is always handled correctly - i.e. I will always get { 0, '1', '2', '3', ... }

The language (C as well as C++) has no "preprocessor". "Preprocessor", as a separate functional unit, is an implementation detail. The way the source file(s) is handled if defined by so called phases of translation. One of the phases in C, as well as in C++ involves concatenating string literals.
In C++ language standard it is described in 2.1. For C++ (C++03) it is phase 6
6 Adjacent ordinary string literal
tokens are concatenated. Adjacent wide
string literal tokens are
concatenated.

Yes, it will be handled as you describe, because it is in stage 5 that,
Each source character set member and escape sequence in character constants and
string literals is converted to the corresponding member of the execution character
set (C99 §5.1.1.2/1)
The language in C++03 is effectively the same:
Each source character set member, escape sequence, or universal-character-name in character literals and string literals is converted to a member of the execution character set (C++03 §2.1/5)
So, escape sequences (like \0) are converted into members of the execution character set in stage five, before string literals are concatenated in stage six.

Because of the agreement between the C++ and C standards. Most, if not all, C++ implementations use a C preprocessor, so yes, C++ uses the C preprocessor.

Related

Compilation of string literals

Why can two string literals separated by a space, tab or "\n" be compiled without an error?
int main()
{
char * a = "aaaa" "bbbb";
}
"aaaa" is a char*
"bbbb" is a char*
There is no specific concatenation rule to process two string literals. And obviously the following code gives an error during compilation:
#include <iostream>
int main()
{
char * a = "aaaa";
char * b = "bbbb";
std::cout << a b;
}
Is this concatenation common to all compilers? Where is the null termination of "aaaa"? Is "aaaabbbb" a continuous block of RAM?

If you see e.g. this translation phase reference in phase 6 it does:
Adjacent string literals are concatenated.
And that's exactly what happens here. You have two adjacent string literals, and they are concatenated into a single string literal.
It is standard behavior.
It only works for string literals, not two pointer variables, as you noticed.

In this statement
char * a = "aaaa" "bbbb";
the compiler in some step of compilation before the syntax analysis considers adjacent string literals as one literal.
So for the compiler the above statement is equivalent to
char * a = "aaaabbbb";
that is the compiler stores only one string literal "aaaabbbb"

Adjacent string literals are concatenated as per the rules of C (and C++) standard. But no such rule exists for adjacent identifiers (i.e. variables a and b).
To quote, C++14 (N3797 draft), § 2.14.5:
In translation phase 6 (2.2), adjacent string literals are
concatenated. If both string literals have the same encoding-prefix,
the resulting concatenated string literal has that encoding-prefix. If
one string literal has no encoding-prefix, it is treated as a string
literal of the same encoding-prefix as the other operand. If a UTF-8
string literal token is adjacent to a wide string literal token, the
program is ill-formed. Any other concatenations are
conditionally-supported with implementation-defined behavior.

In C and C++ compiles adjacent string literals as a single string literal. For example this:
"Some text..." "and more text"
is equivalent to:
"Some text...and more text"
That for historical reasons:
The original C language was designed in 1969-1972 when computing was still dominated by the 80 column punched card. Its designers used 80 column devices such as the ASR-33 Teletype. These devices did not automatically wrap text, so there was a real incentive to keep source code within 80 columns. Fortran and Cobol had explicit continuation mechanisms to do so, before they finally moved to free format.
It was a stroke of brilliance for Dennis Ritchie (I assume) to realise that there was no ambiguity in the grammar and that long ASCII strings could be made to fit into 80 columns by the simple expedient of getting the compiler to concatenate adjacent literal strings. Countless C programmers were grateful for that small feature.
Once the feature is in, why would it ever be removed? It causes no grief and is frequently handy. I for one wish more languages had it. The modern trend is to have extended strings with triple quotes or other symbols, but the simplicity of this feature in C has never been outdone.
Similar question here.

String literals placed side-by-side are concatenated at translation phase 6 (after the preprocessor). That is, "Hello," " world!" yields the (single) string "Hello, world!". If the two strings have the same encoding prefix (or neither has one), the resulting string will have the same encoding prefix (or no prefix).
(source)

C11 Compilation. Phase of translation #1 and #5. Universal character names

I'm trying to understand Universal Character Names in the C11 standard and found that the N1570 draft of the C11 standard has much less detail than the C++11 standard with respect to Translation Phases 1 and 5 and the formation and handling of UCNs within them. This is what each has to say:
Translation Phase 1
N1570 Draft C11 5.1.1.2p1.1:
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
C++11 2.2p1.1:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
Translation Phase 5
N1570 Draft C11 5.1.1.2p1.5:
Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; [...]
C++ 2.2p1.5:
Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set; [...]
(emphasis was added on differences)
The Questions
In the C++11 standard, it is very clear that source file characters not in the basic source character set are converted to UCNs, and that they are treated exactly as would have been a UCN in that same place, with the sole exception of raw-strings. Is the same true of C11? When a C11 compiler sees a multi-byte UTF-8 character such as °, does it too translate this to \u00b0 in phase 1, and treat it just as if \u00b0 had appeared there instead?
To put it in a different way, at the end of which translation phase, if any, are the following snippets of code transformed into textually equivalent forms for the first time in C11?
const char* hell° = "hell°";
and
const char* hell\u00b0 = "hell\u00b0";
If in 2., the answer is "in none", then during which translation phase are those two identifiers first understood to refer to the same thing, despite being textually different?
In C11, are UCNs in character/string literals also converted in phase 5? If so, why omit this from the draft standard?
How are UCNs in identifiers (as opposed to in character/string literals as already mentioned) handled in both C11 and C++11? Are they also converted in phase 5? Or is this something implementation-defined? Does GCC for instance print out such identifiers in UCN-coded form, or in actual UTF-8?

Comments turned into an answer
Interesting question!
The C standard can leave more of the conversions unstated because they are implementation-defined (and C has no raw strings to confuse the issue).
What it says in the C standard is sufficient — except that it leaves your question 1 unanswerable.
Q2 has to be 'Phase 5', I think, with caveats about it being 'the token stream is equivalent'.
Q3 is strictly N/A, but Phase 7 is probably the answer.
Q4 is 'yes', and it says so because it mentions 'escape sequences' and UCNs are escape sequences.
Q5 is 'Phase 5' too.
Can the C++11-mandated processes in Phase 1 and 5 be taken as compliant within the wording of C11 (putting aside raw strings)?
I think they are effectively the same; the difference arises primarily from the raw literal issue which is specific to C++. Generally, the C and C++ standards try not to make things gratuitously different, and in particular try to the workings of the preprocessor and the low-level character parsing the same in both (which has been easier since C99 added support for C++ // comments, but which evidently got harder with the addition of raw literals to C++11).
One day, I'll have to study the raw literal notations and their implications more thoroughly.

First, please note that these distinction exist since 1998; UCN were first introduced in C++98, a new standard (ISO/IEC 14882, 1st edition:1998), and then made their way into the C99 revision of the C standard; but the C committee (and existing implementers, and their pre-existing implementations) did not feel the C++ way was the only way to achieve the trick, particularly with corner cases and the use of smaller character sets than Unicode, or just different; for example, the requirement to ship the mapping tables from whatever-supported-encodings to Unicode was a preoccupation for C vendors in 1998.
The C standard (consciously) avoids deciding this, and let the compiler chooses how to proceed. While your reasoning takes obviously place with the context of UTF-8 character sets used for both source and execution, there are a large (and pre-existing) range of different C99/C11 compilers available which are using different sets; and the committee felt it should not restrict the implementers too much on this issue. In my experience, most compilers keep it distinct in practice (for performance reasons.)
Because of this freedom, some compilers can have it identical after phase 1 (like a C++ compiler shall), while other can left it distinct as late as phase 7 for the first degree character; the second degree characters (in the string) ought to be the same after phase 5, assuming the degree character is part of the extended execution character set supported by the implementation.
For the other answers, I won't add anything to Jonathan's.
About your additional question about the C++ more deterministic process to be Standard-C-compliant, it is clearly a goal to be so; and if you find a corner case which shows otherwise (a C++11-compliant preprocessor which would not conform to the C99 and C11 standards), then you should consider asking the WG14 committee about a potential defect.
Obviously, the reverse is not true: it is possible to write a pre-processor with handling of UCN which complies to C99/C11 but not to the C++ standards; the most obvious difference is with
#define str(t) #t
#define str_is(x, y) const char * x = y " is " str(y)
str_is(hell°, "hell°");
str_is(hell\u00B0, "hell\u00B0");
which a C-compliant preprocessor can render in a similar same way as your examples (and most do so) and as such, will have distinct renderings; but I am under the impression that a C++-compliant preprocessor is required to transform into (strictly equivalent)
const char* hell° = "hell°" " is " "\"hell\\u00b0\"";
const char* hell\u00b0 = "hell\\u00b0" " is " "\"hell\\u00b0\"";
Last, but not least, I believe not much compilers are fully compliant to this very level of details!

C++11 character literal '\xC4' standard type with UTF-8 execution character set?

Consider a C++11 compiler that has an execution character set of UTF-8 (and is compliant with the x86-64 ABI which requires the char type be a signed 8-bit byte).
The letter Ä (umlaut) has unicode code point of 0xC4, and has a 2 code unit UTF-8 representation of {0xC3, 0x84}
The compiler assigns the character literal '\xC4' a type of int with a value of 0xC4.
Is the compiler standard-compliant and ABI-compliant? What is your reasoning?
Relevant quotes from C++11 standard:
2.14.3.1
An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than
one c-char is a multicharacter literal. A multicharacter literal has type int and implementation-defined
value.
2.14.3.4
The escape \xhhh consists of the backslash followed by x followed by
one or more hexadecimal digits that are taken to specify the value of the desired character. The value of a character
literal is implementation-defined if it falls outside of the implementation-defined range defined for char

§2.14.3 paragraph 1 is undoubtedly the relevant text in the (C++11) standard. However, there were several defects in the original text, and the latest version contains the following text, emphasis added:
A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
Although this has been accepted as a defect, it does not actually form part of any standard. However, it stands as a recommendation and I suspect that many compilers will implement it.

From 2.1.14.3p4:
The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char
x86 compilers historically (and as you point out, that practice is now an official standard of some sort) have signed chars. \xc7 is out of range for that, so the implementation is required to document the literal value it will produce.
It looks like your implementation promotes out-of-range char literals specified with \x escapes to (in-range) integer literals.

You're mixing apples, oranges, pears and kumquats :)
Yes, "\xc4" is a legal character literal. Specifically, what the standard calls a "narrow character literal".
From the C++ standard:
The glyphs for the members of the basic source character set are
intended to identify characters from the subset of ISO/IEC 10646 which
corresponds to the ASCII character set. However, because the mapping
from source file characters to the source character set (described in
translation phase 1) is specified as implementation-defined, an
implementation is required to document how the basic source characters
are represented in source files.
This might help clarify:
Rules for C++ string literals escape character
This will might also help, if you're not familiar with it:
The absolute minimum every software developer should know about Unicode
Here is another good, concise - and illuminating - reference:
IBM Developerworks: Character literals

C++11: Example of difference between ordinary string literal and UTF-8 string literal?

A string literal that does not begin with an encoding-prefix is an ordinary string
literal, and is initialized with the given characters.
A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.
I don't understand the difference between an ordinary string literal and a UTF-8 string literal.
Can someone provide an example of a situation where they are different? (Cause different compiler output)
(I mean from the POV of the standard, not any particular implementation)
Each source character set member in a character literal or a string literal, as well as each escape
sequence and universal-character-name in a character literal or a non-raw string literal, is converted to
the corresponding member of the execution character set.

The C and C++ languages allow a huge amount of latitude in their implementations. C was written long before UTF-8 was "the way to encode text in single bytes": different systems had different text encodings.
So what the byte values are for a string in C and C++ are really up to the compiler. 'A' is whatever the compiler's chosen encoding is for the character A, which may not agree with UTF-8.
C++ has added the requirement that real UTF-8 string literals must be supported by compilers. The bit value of u8"A"[0] is fixed by the C++ standard through the UTF-8 standard, regardless of the preferred encoding of the platform the compiler is targeting.
Now, much as most platforms C++ targets use 2's complement integers, most compilers have character encodings that are mostly compatible with UTF-8. So for strings like "hello world", u8"hello world" will almost certainly be identical.
For a concrete example, from man gcc
-fexec-charset=charset
Set the execution character set, used for string and character constants. The default is UTF-8. charset can be any encoding supported by the system's iconv library routine.
-finput-charset=charset
Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's iconv library routine.
is an example of being able to change the execution and input character sets of C/C++.

Implementation of string literal concatenation in C and C++

AFAIK, this question applies equally to C and C++
Step 6 of the "translation phases" specified in the C standard (5.1.1.2 in the draft C99 standard) states that adjacent string literals have to be concatenated into a single literal. I.e.
printf("helloworld.c" ": %d: Hello "
"world\n", 10);
Is equivalent (syntactically) to:
printf("helloworld.c: %d: Hello world\n", 10);
However, the standard doesn't seem to specify which part of the compiler has to handle this - should it be the preprocessor (cpp) or the compiler itself. Some online research tells me that this function is generally expected to be performed by the preprocessor (source #1, source #2, and there are more), which makes sense.
However, running cpp in Linux shows that cpp doesn't do it:
eliben#eliben-desktop:~/test$ cat cpptest.c
int a = 5;
"string 1" "string 2"
"string 3"
eliben#eliben-desktop:~/test$ cpp cpptest.c
# 1 "cpptest.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "cpptest.c"
int a = 5;
"string 1" "string 2"
"string 3"
So, my question is: where should this feature of the language be handled, in the preprocessor or the compiler itself?
Perhaps there's no single good answer. Heuristic answers based on experience, known compilers, and general good engineering practice will be appreciated.
P.S. If you're wondering why I care about this... I'm trying to figure out whether my Python based C parser should handle string literal concatenation (which it doesn't do, at the moment), or leave it to cpp which it assumes runs before it.

The standard doesn't specify a preprocessor vs. a compiler, it just specifies the phases of translation you already noted. Traditionally, phases 1 through 4 were in the preprocessor, Phases 5 though 7 in the compiler, and phase 8 the linker -- but none of that is required by the standard.

Unless the preprocessor is specified to handle this, it's safe to assume it's the compiler's job.
Edit:
Your "I.e." link at the beginning of the post answers the question:
Adjacent string literals are concatenated at compile time; this allows long strings to be split over multiple lines, and also allows string literals resulting from C preprocessor defines and macros to be appended to strings at compile time...

In the ANSI C standard, this detail is covered in section 5.1.1.2, item (6):
5.1.1.2 Translation phases
...
4. Preprocessing directives are executed and macro invocations are expanded. ...
5. Each source character set member and escape sequence in character constants and string literals is converted to a member of the execution character set.
6. Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated.
The standard does not define that the implementation must use a pre-processor and compiler, per se.
Step 4 is clearly a preprocessor responsibility.
Step 5 requires that the "execution character set" be known. This information is also required by the compiler. It is easier to port the compiler to a new platform if the preprocessor does not contain platform dependendencies, so the tendency is to implement step 5, and thus step 6, in the compiler.

I would handle it in the scanning token part of the parser, so in the compiler. It seems more logical. The preprocessor has not to know the "structure" of the language, and in fact it ignores it usually so that macros can generate uncompilable code. It handles nothing more than what it is entitled to handle by directives that are specifically addressed to it (# ...), and the "consequences" of them (like those of a #define x h, which would make the preprocessor change a lot of x into h)

There are tricky rules for how string literal concatenation interacts with escape sequences.
Suppose you have
const char x1[] = "a\15" "4";
const char y1[] = "a\154";
const char x2[] = "a\r4";
const char y2[] = "al";
then x1 and x2 must wind up equal according to strcmp, and the same for y1 and y2. (This is what Heath is getting at in quoting the translation steps - escape conversion happens before string constant concatenation.) There's also a requirement that if any of the string constants in a concatenation group has an L or U prefix, you get a wide or Unicode string. Put it all together and it winds up being significantly more convenient to do this work as part of the "compiler" rather than the "preprocessor."

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js