Implementation of string literal concatenation in C and C++ - c++

AFAIK, this question applies equally to C and C++
Step 6 of the "translation phases" specified in the C standard (5.1.1.2 in the draft C99 standard) states that adjacent string literals have to be concatenated into a single literal. I.e.
printf("helloworld.c" ": %d: Hello "
"world\n", 10);
Is equivalent (syntactically) to:
printf("helloworld.c: %d: Hello world\n", 10);
However, the standard doesn't seem to specify which part of the compiler has to handle this - should it be the preprocessor (cpp) or the compiler itself. Some online research tells me that this function is generally expected to be performed by the preprocessor (source #1, source #2, and there are more), which makes sense.
However, running cpp in Linux shows that cpp doesn't do it:
eliben#eliben-desktop:~/test$ cat cpptest.c
int a = 5;
"string 1" "string 2"
"string 3"
eliben#eliben-desktop:~/test$ cpp cpptest.c
# 1 "cpptest.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "cpptest.c"
int a = 5;
"string 1" "string 2"
"string 3"
So, my question is: where should this feature of the language be handled, in the preprocessor or the compiler itself?
Perhaps there's no single good answer. Heuristic answers based on experience, known compilers, and general good engineering practice will be appreciated.
P.S. If you're wondering why I care about this... I'm trying to figure out whether my Python based C parser should handle string literal concatenation (which it doesn't do, at the moment), or leave it to cpp which it assumes runs before it.

The standard doesn't specify a preprocessor vs. a compiler, it just specifies the phases of translation you already noted. Traditionally, phases 1 through 4 were in the preprocessor, Phases 5 though 7 in the compiler, and phase 8 the linker -- but none of that is required by the standard.

Unless the preprocessor is specified to handle this, it's safe to assume it's the compiler's job.
Edit:
Your "I.e." link at the beginning of the post answers the question:
Adjacent string literals are concatenated at compile time; this allows long strings to be split over multiple lines, and also allows string literals resulting from C preprocessor defines and macros to be appended to strings at compile time...

In the ANSI C standard, this detail is covered in section 5.1.1.2, item (6):
5.1.1.2 Translation phases
...
4. Preprocessing directives are executed and macro invocations are expanded. ...
5. Each source character set member and escape sequence in character constants and string literals is converted to a member of the execution character set.
6. Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated.
The standard does not define that the implementation must use a pre-processor and compiler, per se.
Step 4 is clearly a preprocessor responsibility.
Step 5 requires that the "execution character set" be known. This information is also required by the compiler. It is easier to port the compiler to a new platform if the preprocessor does not contain platform dependendencies, so the tendency is to implement step 5, and thus step 6, in the compiler.

I would handle it in the scanning token part of the parser, so in the compiler. It seems more logical. The preprocessor has not to know the "structure" of the language, and in fact it ignores it usually so that macros can generate uncompilable code. It handles nothing more than what it is entitled to handle by directives that are specifically addressed to it (# ...), and the "consequences" of them (like those of a #define x h, which would make the preprocessor change a lot of x into h)

There are tricky rules for how string literal concatenation interacts with escape sequences.
Suppose you have
const char x1[] = "a\15" "4";
const char y1[] = "a\154";
const char x2[] = "a\r4";
const char y2[] = "al";
then x1 and x2 must wind up equal according to strcmp, and the same for y1 and y2. (This is what Heath is getting at in quoting the translation steps - escape conversion happens before string constant concatenation.) There's also a requirement that if any of the string constants in a concatenation group has an L or U prefix, you get a wide or Unicode string. Put it all together and it winds up being significantly more convenient to do this work as part of the "compiler" rather than the "preprocessor."

Related

Ensure every string literal is wrapped inside macro

I want to wrap every string literal in my project with a macro. I want to make sure every string literal in my project is wrapped with a macro, and have some external tool help provide me the location in which there's a string literal not wrapped in a macro.
Is there any way I could use Clang Plugins to ensure that every string literal is wrapped inside macro?
Cases I want to handle:
#define MY_ASSERT(Y) {if(!(Y)) throw Exception(#Y); }
The #Y should be warned as unwrapped string literal.
"a" "b" "c"
It will require that the whole thing will be inside a macro, like this:
MY_STR("a" "b" "c")
How could I do that with Clang plugin, or is there an other way in general to do it?
You could do that with the DMS Software Reengineering Toolkit and its C++ front end.
DMS can read source code according it the explicit grammar definition of C++ (handles C++17 in GCC and MS dialects), builds an AST, applies rewrite rules supplied to modify the tree, and then prettyprints the AST back to source text, preserving comments, text alignments, number radixes, etc.
To do this, you need just one DMS rule (see DMS Rewrite Rules for details):
rule wrap_string_in_macro(s:string_literal):primary_expression->primary_expression
= "\s" -> " my_macro_name(\s) ";
The nonterminal string_literal covers the wide variety of C++ strings (8bit, ISO, wide, raw, sequence of strings, ...) so you don't have to worry about them, this rule will pick them up. But your macro might need to worry about those. So you could arguably write a larger set of rules so you can specialize the macro call:
rule wrap_ISO_string_in_macro(s:ISO_STRING_LITERAL):primary_expression->primary_expression
= "\s" -> " my_macro_name_for_ISO_string(\s) ";
rule wrap_ISO_string_in_macro(s:WIDE_STRING_LITERAL):primary_expression->primary_expression
= "\s" -> " my_macro_name_for_wide_string(\s) ";
...
These rules will pick up individual strings, but that leaves the problem of handling sequences of strings:
rule wrap_ISO_string_list_in_macro(seq: string_literal_list,s:ISO_STRING_LITERAL):primary_expression->primary_expression
= " \string_literal_list \s" -> " my_macro_name_for_ISO_string_list(\s) ";
...

Compilation of string literals

Why can two string literals separated by a space, tab or "\n" be compiled without an error?
int main()
{
char * a = "aaaa" "bbbb";
}
"aaaa" is a char*
"bbbb" is a char*
There is no specific concatenation rule to process two string literals. And obviously the following code gives an error during compilation:
#include <iostream>
int main()
{
char * a = "aaaa";
char * b = "bbbb";
std::cout << a b;
}
Is this concatenation common to all compilers? Where is the null termination of "aaaa"? Is "aaaabbbb" a continuous block of RAM?
If you see e.g. this translation phase reference in phase 6 it does:
Adjacent string literals are concatenated.
And that's exactly what happens here. You have two adjacent string literals, and they are concatenated into a single string literal.
It is standard behavior.
It only works for string literals, not two pointer variables, as you noticed.
In this statement
char * a = "aaaa" "bbbb";
the compiler in some step of compilation before the syntax analysis considers adjacent string literals as one literal.
So for the compiler the above statement is equivalent to
char * a = "aaaabbbb";
that is the compiler stores only one string literal "aaaabbbb"
Adjacent string literals are concatenated as per the rules of C (and C++) standard. But no such rule exists for adjacent identifiers (i.e. variables a and b).
To quote, C++14 (N3797 draft), § 2.14.5:
In translation phase 6 (2.2), adjacent string literals are
concatenated. If both string literals have the same encoding-prefix,
the resulting concatenated string literal has that encoding-prefix. If
one string literal has no encoding-prefix, it is treated as a string
literal of the same encoding-prefix as the other operand. If a UTF-8
string literal token is adjacent to a wide string literal token, the
program is ill-formed. Any other concatenations are
conditionally-supported with implementation-defined behavior.
In C and C++ compiles adjacent string literals as a single string literal. For example this:
"Some text..." "and more text"
is equivalent to:
"Some text...and more text"
That for historical reasons:
The original C language was designed in 1969-1972 when computing was still dominated by the 80 column punched card. Its designers used 80 column devices such as the ASR-33 Teletype. These devices did not automatically wrap text, so there was a real incentive to keep source code within 80 columns. Fortran and Cobol had explicit continuation mechanisms to do so, before they finally moved to free format.
It was a stroke of brilliance for Dennis Ritchie (I assume) to realise that there was no ambiguity in the grammar and that long ASCII strings could be made to fit into 80 columns by the simple expedient of getting the compiler to concatenate adjacent literal strings. Countless C programmers were grateful for that small feature.
Once the feature is in, why would it ever be removed? It causes no grief and is frequently handy. I for one wish more languages had it. The modern trend is to have extended strings with triple quotes or other symbols, but the simplicity of this feature in C has never been outdone.
Similar question here.
String literals placed side-by-side are concatenated at translation phase 6 (after the preprocessor). That is, "Hello," " world!" yields the (single) string "Hello, world!". If the two strings have the same encoding prefix (or neither has one), the resulting string will have the same encoding prefix (or no prefix).
(source)

What are the definitions for valid and invalid pp-tokens?

I want to extensively use the ##-operator and enum magic to handle a huge bunch of similar access-operations, error handling and data flow.
If applying the ## and # preprocessor operators results in an invalid pp-token, the behavior is undefined in C.
The order of preprocessor operation in general is not defined (*) in C90 (see The token pasting operator). Now in some cases it happens (said so in different sources, including the MISRA Committee, and the referenced page) that the order of multiple ##/#-Operators influences the occurrence of undefined behavior. But I have a really hard time to understand the examples of these sources and pin down the common rule.
So my questions are:
What are the rules for valid pp-tokens?
Are there difference between the different C and C++ Standards?
My current problem: Is the following legal with all 2 operator orders?(**)
#define test(A) test_## A ## _THING
int test(0001) = 2;
Comments:
(*) I don't use "is undefined" because this has nothing to do with undefined behavior yet IMHO, but rather unspecified behavior. More than one ## or # operator being applied do not in general render the program to be erroneous. There is obviously an order — we just can't predict which — so the order is unspecified.
(**) This is no actual application for the numbering. But the pattern is equivalent.
What are the rules for valid pp-tokens?
These are spelled out in the respective standards; C11 §6.4 and C++11 §2.4. In both cases, they correspond to the production preprocessing-token. Aside from pp-number, they shouldn't be too surprising. The remaining possibilities are identifiers (including keywords), "punctuators" (in C++, preprocessing-op-or-punc), string and character literals, and any single non-whitespace character which doesn't match any other production.
With a few exceptions, any sequence of characters can be decomposed into a sequence of valid preprocessing-tokens. (One exception is unmatched quotes and apostrophes: a single quote or apostrophe is not a valid preprocessing-token, so a text including an unterminated string or character literal cannot be tokenised.)
In the context of the ## operator, though, the result of the concatenation must be a single preprocessing-token. So an invalid concatenation is a concatenation whose result is a sequence of characters which comprise multiple preprocessing-tokens.
Are there differences between C and C++?
Yes, there are slight differences:
C++ has user defined string and character literals, and allows "raw" string literals. These literals will be tokenized differently in C, so they might be multiple preprocessing-tokens or (in the case of raw string literals) even invalid preprocessing-tokens.
C++ includes the symbols ::, .* and ->*, all of which would be tokenised as two punctuator tokens in C. Also, in C++, some things which look like keywords (eg. new, delete) are part of preprocessing-op-or-punc (although these symbols are valid preprocessing-tokens in both languages.)
C allows hexadecimal floating point literals (eg. 1.1p-3), which are not valid preprocessing-tokens in C++.
C++ allows apostrophes to be used in integer literals as separators (1'000'000'000). In C, this would probably result in unmatched apostrophes.
There are minor differences in the handling of universal character names (eg. \u0234).
In C++, <:: will be tokenised as <, :: unless it is followed by : or >. (<::: and <::> are tokenised normally, using the longest-match rule.) In C, there are no exceptions to the longest-match rule; <:: is always tokenised using the longest-match rule, so the first token will always be <:.
Is it legal to concatenate test_, 0001, and _THING, even though concatenation order is unspecified?
Yes, that is legal in both languages.
test_ ## 0001 => test_0001 (identifier)
test_0001 ## _THING => test_0001_THING (identifier)
0001 ## _THING => 0001_THING (pp-number)
test_ ## 0001_THING => test_0001_THING (identifier)
What are examples of invalid token concatenation?
Suppose we have
#define concat3(a, b, c) a ## b ## c
Now, the following are invalid regardless of concatenation order:
concat3(., ., .)
.. is not a token even though ... is. But the concatenation must proceed in some order, and .. would be a necessary intermediate value; since that is not a single token, the concatenation would be invalid.
concat3(27,e,-7)
Here, -7 is two tokens, so it cannot be concatenated.
And here is a case in which concatenation order matters:
concat3(27e, -, 7)
If this is concatenated left-to-right, it will result in 27e- ## 7, which is the concatenation of two pp-numbers. But - cannot be concatenated with 7, because (as above) -7 is not a single token.
What exactly is a pp-number?
In general terms, a pp-number is a superset of tokens which might be converted into (single) numeric literals or might be invalid. The definition is intentionally broad, partly in order to allow (some) token concatenations, and partly to insulate the preprocessor from the periodic changes in numeric formats. The precise definition can be found in the respective standards, but informally a token is a pp-number if:
It starts with a decimal digit or a period (.) followed by a decimal digit.
The rest of the token is letters, numbers and periods, possibly including sign characters (+, -) if preceded by an exponent symbol. The exponent symbol can be E or e in both languages; and also P and p in C since C99.
In C++, a pp-number can also include (but not start with) an apostrophe followed by a letter or digit.
Note: Above, letter includes underscore. Also, universal character names can be used (except following an apostrophe in C++).
Once preprocessing is terminated, all pp-numbers will be converted to numeric literals if possible. If the conversion is not possible (because the token doesn't correspond to the syntax for any numeric literal), the program is invalid.
#define test(A) test_## A ## _THING
int test(0001) = 2;
This is legal with both LTR and RTL evaluation, since both test_0001 and 0001_THING are valid preprocessor-tokens. The former is an identifier, while the latter is a pp-number; pp-numbers are not checked for suffix correctness until a later stage of compilation; think e.g. 0001u an unsigned octal literal.
A few examples to show that the order of evaluation does matter:
#define paste2(a,b) a##b
#define paste(a,b) paste2(a,b)
#if defined(LTR)
#define paste3(a,b,c) paste(paste(a,b),c)
#elif defined(RTL)
#define paste3(a,b,c) paste(a,paste(b,c))
#else
#define paste3(a,b,c) a##b##c
#endif
double a = paste3(1,.,e3), b = paste3(1e,+,3); // OK LTR, invalid RTL
#define stringify2(x) #x
#define stringify(x) stringify2(x)
#define stringify_paste3(a,b,c) stringify(paste3(a,b,c))
char s[] = stringify_paste3(%:,%,:); // invalid LTR, OK RTL
If your compiler uses a consistent order of evaluation (either LTR or RTL) and presents an error on generation of an invalid pp-token, then precisely one of these lines will generate an error. Naturally, a lax compiler could well allow both, while a strict compiler might allow neither.
The second example is rather contrived; because of the way the grammar is constructed it's very difficult to find a pp-token that is valid when build RTL but not when built LTR.
There are no significant differences between C and C++ in this regard; the two standards have identical language (up to section headers) describing the process of macro replacement. The only way the language could influence the process would be in the valid preprocessing-tokens: C++ (especially recently) has more forms of valid preprocessing-tokens, such as user-defined string literals.

C11 Compilation. Phase of translation #1 and #5. Universal character names

I'm trying to understand Universal Character Names in the C11 standard and found that the N1570 draft of the C11 standard has much less detail than the C++11 standard with respect to Translation Phases 1 and 5 and the formation and handling of UCNs within them. This is what each has to say:
Translation Phase 1
N1570 Draft C11 5.1.1.2p1.1:
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
C++11 2.2p1.1:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
Translation Phase 5
N1570 Draft C11 5.1.1.2p1.5:
Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; [...]
C++ 2.2p1.5:
Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set; [...]
(emphasis was added on differences)
The Questions
In the C++11 standard, it is very clear that source file characters not in the basic source character set are converted to UCNs, and that they are treated exactly as would have been a UCN in that same place, with the sole exception of raw-strings. Is the same true of C11? When a C11 compiler sees a multi-byte UTF-8 character such as °, does it too translate this to \u00b0 in phase 1, and treat it just as if \u00b0 had appeared there instead?
To put it in a different way, at the end of which translation phase, if any, are the following snippets of code transformed into textually equivalent forms for the first time in C11?
const char* hell° = "hell°";
and
const char* hell\u00b0 = "hell\u00b0";
If in 2., the answer is "in none", then during which translation phase are those two identifiers first understood to refer to the same thing, despite being textually different?
In C11, are UCNs in character/string literals also converted in phase 5? If so, why omit this from the draft standard?
How are UCNs in identifiers (as opposed to in character/string literals as already mentioned) handled in both C11 and C++11? Are they also converted in phase 5? Or is this something implementation-defined? Does GCC for instance print out such identifiers in UCN-coded form, or in actual UTF-8?
Comments turned into an answer
Interesting question!
The C standard can leave more of the conversions unstated because they are implementation-defined (and C has no raw strings to confuse the issue).
What it says in the C standard is sufficient — except that it leaves your question 1 unanswerable.
Q2 has to be 'Phase 5', I think, with caveats about it being 'the token stream is equivalent'.
Q3 is strictly N/A, but Phase 7 is probably the answer.
Q4 is 'yes', and it says so because it mentions 'escape sequences' and UCNs are escape sequences.
Q5 is 'Phase 5' too.
Can the C++11-mandated processes in Phase 1 and 5 be taken as compliant within the wording of C11 (putting aside raw strings)?
I think they are effectively the same; the difference arises primarily from the raw literal issue which is specific to C++. Generally, the C and C++ standards try not to make things gratuitously different, and in particular try to the workings of the preprocessor and the low-level character parsing the same in both (which has been easier since C99 added support for C++ // comments, but which evidently got harder with the addition of raw literals to C++11).
One day, I'll have to study the raw literal notations and their implications more thoroughly.
First, please note that these distinction exist since 1998; UCN were first introduced in C++98, a new standard (ISO/IEC 14882, 1st edition:1998), and then made their way into the C99 revision of the C standard; but the C committee (and existing implementers, and their pre-existing implementations) did not feel the C++ way was the only way to achieve the trick, particularly with corner cases and the use of smaller character sets than Unicode, or just different; for example, the requirement to ship the mapping tables from whatever-supported-encodings to Unicode was a preoccupation for C vendors in 1998.
The C standard (consciously) avoids deciding this, and let the compiler chooses how to proceed. While your reasoning takes obviously place with the context of UTF-8 character sets used for both source and execution, there are a large (and pre-existing) range of different C99/C11 compilers available which are using different sets; and the committee felt it should not restrict the implementers too much on this issue. In my experience, most compilers keep it distinct in practice (for performance reasons.)
Because of this freedom, some compilers can have it identical after phase 1 (like a C++ compiler shall), while other can left it distinct as late as phase 7 for the first degree character; the second degree characters (in the string) ought to be the same after phase 5, assuming the degree character is part of the extended execution character set supported by the implementation.
For the other answers, I won't add anything to Jonathan's.
About your additional question about the C++ more deterministic process to be Standard-C-compliant, it is clearly a goal to be so; and if you find a corner case which shows otherwise (a C++11-compliant preprocessor which would not conform to the C99 and C11 standards), then you should consider asking the WG14 committee about a potential defect.
Obviously, the reverse is not true: it is possible to write a pre-processor with handling of UCN which complies to C99/C11 but not to the C++ standards; the most obvious difference is with
#define str(t) #t
#define str_is(x, y) const char * x = y " is " str(y)
str_is(hell°, "hell°");
str_is(hell\u00B0, "hell\u00B0");
which a C-compliant preprocessor can render in a similar same way as your examples (and most do so) and as such, will have distinct renderings; but I am under the impression that a C++-compliant preprocessor is required to transform into (strictly equivalent)
const char* hell° = "hell°" " is " "\"hell\\u00b0\"";
const char* hell\u00b0 = "hell\\u00b0" " is " "\"hell\\u00b0\"";
Last, but not least, I believe not much compilers are fully compliant to this very level of details!

C++ Preprocessor string literal concatenation

I found this regarding how the C preprocessor should handle string literal concatenation (phase 6). However, I can not find anything regarding how this is handled in C++ (does C++ use the C preprocessor?).
The reason I ask is that I have the following:
const char * Foo::encoding = "\0" "1234567890\0abcdefg";
where encoding is a static member of class Foo. Without the availability of concatenation I wouldnt be able to write that sequence of characters like that.
const char * Foo::encoding = "\01234567890\0abcdefg";
Is something entirely different due to the way \012 is interpreted.
I dont have access to multiple platforms and I'm curious how confident I should be that the above is always handled correctly - i.e. I will always get { 0, '1', '2', '3', ... }
The language (C as well as C++) has no "preprocessor". "Preprocessor", as a separate functional unit, is an implementation detail. The way the source file(s) is handled if defined by so called phases of translation. One of the phases in C, as well as in C++ involves concatenating string literals.
In C++ language standard it is described in 2.1. For C++ (C++03) it is phase 6
6 Adjacent ordinary string literal
tokens are concatenated. Adjacent wide
string literal tokens are
concatenated.
Yes, it will be handled as you describe, because it is in stage 5 that,
Each source character set member and escape sequence in character constants and
string literals is converted to the corresponding member of the execution character
set (C99 §5.1.1.2/1)
The language in C++03 is effectively the same:
Each source character set member, escape sequence, or universal-character-name in character literals and string literals is converted to a member of the execution character set (C++03 §2.1/5)
So, escape sequences (like \0) are converted into members of the execution character set in stage five, before string literals are concatenated in stage six.
Because of the agreement between the C++ and C standards. Most, if not all, C++ implementations use a C preprocessor, so yes, C++ uses the C preprocessor.