Compilation of string literals - c++

Why can two string literals separated by a space, tab or "\n" be compiled without an error?
int main()
{
char * a = "aaaa" "bbbb";
}
"aaaa" is a char*
"bbbb" is a char*
There is no specific concatenation rule to process two string literals. And obviously the following code gives an error during compilation:
#include <iostream>
int main()
{
char * a = "aaaa";
char * b = "bbbb";
std::cout << a b;
}
Is this concatenation common to all compilers? Where is the null termination of "aaaa"? Is "aaaabbbb" a continuous block of RAM?

If you see e.g. this translation phase reference in phase 6 it does:
Adjacent string literals are concatenated.
And that's exactly what happens here. You have two adjacent string literals, and they are concatenated into a single string literal.
It is standard behavior.
It only works for string literals, not two pointer variables, as you noticed.

In this statement
char * a = "aaaa" "bbbb";
the compiler in some step of compilation before the syntax analysis considers adjacent string literals as one literal.
So for the compiler the above statement is equivalent to
char * a = "aaaabbbb";
that is the compiler stores only one string literal "aaaabbbb"

Adjacent string literals are concatenated as per the rules of C (and C++) standard. But no such rule exists for adjacent identifiers (i.e. variables a and b).
To quote, C++14 (N3797 draft), § 2.14.5:
In translation phase 6 (2.2), adjacent string literals are
concatenated. If both string literals have the same encoding-prefix,
the resulting concatenated string literal has that encoding-prefix. If
one string literal has no encoding-prefix, it is treated as a string
literal of the same encoding-prefix as the other operand. If a UTF-8
string literal token is adjacent to a wide string literal token, the
program is ill-formed. Any other concatenations are
conditionally-supported with implementation-defined behavior.

In C and C++ compiles adjacent string literals as a single string literal. For example this:
"Some text..." "and more text"
is equivalent to:
"Some text...and more text"
That for historical reasons:
The original C language was designed in 1969-1972 when computing was still dominated by the 80 column punched card. Its designers used 80 column devices such as the ASR-33 Teletype. These devices did not automatically wrap text, so there was a real incentive to keep source code within 80 columns. Fortran and Cobol had explicit continuation mechanisms to do so, before they finally moved to free format.
It was a stroke of brilliance for Dennis Ritchie (I assume) to realise that there was no ambiguity in the grammar and that long ASCII strings could be made to fit into 80 columns by the simple expedient of getting the compiler to concatenate adjacent literal strings. Countless C programmers were grateful for that small feature.
Once the feature is in, why would it ever be removed? It causes no grief and is frequently handy. I for one wish more languages had it. The modern trend is to have extended strings with triple quotes or other symbols, but the simplicity of this feature in C has never been outdone.
Similar question here.

String literals placed side-by-side are concatenated at translation phase 6 (after the preprocessor). That is, "Hello," " world!" yields the (single) string "Hello, world!". If the two strings have the same encoding prefix (or neither has one), the resulting string will have the same encoding prefix (or no prefix).
(source)

Related

difference between L"" and u8""

Is there any difference between the followings?
auto s1 = L"你好";
auto s2 = u8"你好";
Are s1 and s2 referring to the same type?
If no, what's the difference and which one is preferred?
They are not the same type.
s2 is a UTF-8 or narrow string literal. The C++11 draft standard section 2.14.5 String literals paragraph 7 says:
A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.
And paragraph 8 says:
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration (3.7).
s1 is a wide string literal which can support UTF-16 and UTF-32. Section 2.14.5 String literals paragraph 11 says:
A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.
See UTF8, UTF16, and UTF32 for a good discussion on the differences and advantages of each.
A quick way to determine types is to use typeid:
std::cout << typeid(s1).name() << std::endl ;
std::cout << typeid(s2).name() << std::endl ;
On my system this is the output:
PKw
PKc
Checking each of these with c++filt -t gives me:
wchar_t const*
char const*
L"" creates a null-terminated string, of type const wchar_t[]. This is valid in C++03. (Note that wchar_t refers to an implementation-dependent "wide-character" type).
u8"" creates a null-terminated UTF-8 string, of type const char[]. This is valid only in C++11.
Which one you choose is strongly dependent on what needs you have. L"" works in C++03, so if you need to work with older code (which may need to be compiled with a C++03 compiler), you'll need to use that. u8"" is easier to work with in many circumstances, particularly when the system in question normally expects char * strings.
The first is a wide character string, which might be encoded as UTF-16 or UTF-32, or something else entirely (though Unicode is now common enough that a completely different encoding is pretty unlikely).
The second is a string of narrow characters using UTF-8 encoding.
As to which is preferred: it'll depend on what you're doing, what platform you're coding for, etc. If you're mostly dealing with something like a web page/URL that's already encoded as UTF-8, and you'll probably just read it in, possibly verify its content, and later echo it back, it may well make sense to store it as UTF-8 as well.
Wide character strings vary by platform. If, for one example, you're coding for Windows, and a lot of the code interacts directly with the OS (which uses UTF-16) then storing your strings as UTF-16 can make a great deal of sense (and that's what Microsoft's compiler uses for wide character strings).

Are trigraph substitutions reverted when a raw string is created through concatenation?

It's pretty common to use macros and token concatenation to switch between wide and narrow strings at compile time.
#define _T(x) L##x
const wchar_t *wide1 = _T("hello");
const wchar_t *wide2 = L"hello";
And in C++11 it should be valid to concoct a similar thing with raw strings:
#define RAW(x) R##x
const char *raw1 = RAW("(Hello)");
const char *raw2 = R"(Hello)";
Since macro expansion and token concatenation happens before escape sequence substitution, this should prevent escape sequences being replaced in the quoted string.
But how does this apply to trigraphs? Are raw strings formed through concatenation with normal strings still subject to having their trigraph substitutions reverted?
const char *trigraph = RAW("(??=)"); // Is this "#" or "??="?
No, the trigraph is not reverted in your example.
[lex.phases]p1 identifies three phases of translation relevant to your question:
1. Trigraph sequences are replaced by corresponding single-character internal representations.
3. The source file is decomposed into preprocessing tokens.
4. Macro invocations are expanded.
Phase 1 is defined by [lex.trigraph]p1. At this stage, your code is translated to const char *trigraph = RAW("(#)").
Phase 3 is defined by [lex.pptoken]. This is the stage where trigraphs are reverted in raw string literals. Paragraph 3 says:
If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted.
That is not the case in your example, therefore the trigraph is not reverted. Your code is transformed into the preprocessing-token sequence const char * trigraph = RAW ( "(#)" )
Finally, in phase 4, the RAW macro is expanded and the token-paste occurs, resulting in the following sequence of preprocessing-tokens: const char * trigraph = R"(#)". The r-char-sequence of the string literal comprises a #. Phase 3 has already occurred, and there is no other point at which reversion of trigraphs occurs.
Trigraph substitution happens before macro processing.
UPD Please disregard this. I haven't realized that c++0x reverts trigraphs in raw string literals.
UPD2 2.5.3 describes the process of forming raw-string-literal preprocessing tokens. Trigraph reversal is a part of this process. There are no raw-string-literals which are not preprocessing tokens. So the answer to your question seems to be yes.

Why must C/C++ string literal declarations be single-line?

Is there any particular reason that multi-line string literals such as the following are not permitted in C++?
string script =
"
Some
Formatted
String Literal
";
I know that multi-line string literals may be created by putting a backslash before each newline.
I am writing a programming language (similar to C) and would like to allow the easy creation of multi-line strings (as in the above example).
Is there any technical reason for avoiding this kind of string literal? Otherwise I would have to use a python-like string literal with a triple quote (which I don't want to do):
string script =
"""
Some
Formatted
String Literal
""";
Why must C/C++ string literal declarations be single-line?
The terse answer is "because the grammar prohibits multiline string literals." I don't know whether there is a good reason for this other than historical reasons.
There are, of course, ways around this. You can use line splicing:
const char* script = "\
Some\n\
Formatted\n\
String Literal\n\
";
If the \ appears as the last character on the line, the newline will be removed during preprocessing.
Or, you can use string literal concatenation:
const char* script =
" Some\n"
" Formatted\n"
" String Literal\n";
Adjacent string literals are concatenated during preprocessing, so these will end up as a single string literal at compile-time.
Using either technique, the string literal ends up as if it were written:
const char* script = " Some\n Formatted\n String Literal\n";
One has to consider that C was not written to be an "Applications" programming language but a systems programming language. It would not be inaccurate to say it was designed expressly to rewrite Unix. With that in mind, there was no EMACS or VIM and your user interfaces were serial terminals. Multiline string declarations would seem a bit pointless on a system that did not have a multiline text editor. Furthermore, string manipulation would not be a primary concern for someone looking to write an OS at that particular point in time. The traditional set of UNIX scripting tools such as AWK and SED (amongst MANY others) are a testament to the fact they weren't using C to do significant string manipulation.
Additional considerations: it was not uncommon in the early 70s (when C was written) to submit your programs on PUNCH CARDS and come back the next day to get them. Would it have eaten up extra processing time to compile a program with multiline strings literals? Not really. It can actually be less work for the compiler. But you were going to come back for it the next day anyhow in most cases. But nobody who was filling out a punch card was going to put large amounts of text that wasn't needed in their programs.
In a modern environment, there is probably no reason not to include multiline string literals other than designer's preference. Grammatically speaking, it's probably simpler because you don't have to take linefeeds into consideration when parsing the string literal.
In addition to the existing answers, you can work around this using C++11's raw string literals, e.g.:
#include <iostream>
#include <string>
int main() {
std::string str = R"(a
b)";
std::cout << str;
}
/* Output:
a
b
*/
Live demo.
[n3290: 2.14.5/4]: [ Note: A source-file new-line in a raw string
literal results in a new-line in the resulting execution
string-literal. Assuming no whitespace at the beginning of lines in
the following example, the assert will succeed:
const char *p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
—end note ]
Though non-normative, this note and the example that follows it in [n3290: 2.14.5/5] serve to complement the indication in the grammar that the production r-char-sequence may contain newlines (whereas the production s-char-sequence, used for normal string literals, may not).
Others have mentioned some excellent workarounds, I just wanted to address the reason.
The reason is simply that C was created at a time when processing was at a premium and compilers had to be simple and as fast as possible. These days, if C were to be updated (I'm looking at you, C1X), it's quite possible to do exactly what you want. It's unlikely, however. Mostly for historical reasons; such a change could require extensive rewrites of compilers, and so will likely be rejected.
The C preprocessor works on a line-by-line basis, but with lexical tokens. That means that the preprocessor understands that "foo" is a token. If C were to allow multi-line literals, however, the preprocessor would be in trouble. Consider:
"foo
#ifdef BAR
bar
#endif
baz"
The preprocessor isn't able to mess with the inside of a token - but it's operating line-by-line. So how is it supposed to handle this case? The easy solution is to simply forbid multiline strings entirely.
Actually, you can break it up thus:
string script =
"\n"
" Some\n"
" Formatted\n"
" String Literal\n";
Adjacent string literals are concatenated by the compiler.
Strings can lay on multiple lines, but each line has to be quoted individually :
string script =
" \n"
" Some \n"
" Formatted \n"
" String Literal ";
I am writing a programming language
(similar to C) and would like to let
write multi-line strings easily (like
in above example).
There is no reason why you couldn't create a programming language that allows multi-line strings.
For example, Vedit Macro Language (which is C-like scripting language for VEDIT text editor) allows multi-line strings, for example:
Reg_Set(1,"
Some
Formatted
String Literal
")
It is up to you how you define your language syntax.
You can also do:
string useMultiple = "this"
"is "
"a string in C.";
Place one literal after another without any special chars.
Literal declarations doesn't have to be single-line.
GPUImage inlines multiline shader code. Checkout its SHADER_STRING macro.

Implementation of string literal concatenation in C and C++

AFAIK, this question applies equally to C and C++
Step 6 of the "translation phases" specified in the C standard (5.1.1.2 in the draft C99 standard) states that adjacent string literals have to be concatenated into a single literal. I.e.
printf("helloworld.c" ": %d: Hello "
"world\n", 10);
Is equivalent (syntactically) to:
printf("helloworld.c: %d: Hello world\n", 10);
However, the standard doesn't seem to specify which part of the compiler has to handle this - should it be the preprocessor (cpp) or the compiler itself. Some online research tells me that this function is generally expected to be performed by the preprocessor (source #1, source #2, and there are more), which makes sense.
However, running cpp in Linux shows that cpp doesn't do it:
eliben#eliben-desktop:~/test$ cat cpptest.c
int a = 5;
"string 1" "string 2"
"string 3"
eliben#eliben-desktop:~/test$ cpp cpptest.c
# 1 "cpptest.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "cpptest.c"
int a = 5;
"string 1" "string 2"
"string 3"
So, my question is: where should this feature of the language be handled, in the preprocessor or the compiler itself?
Perhaps there's no single good answer. Heuristic answers based on experience, known compilers, and general good engineering practice will be appreciated.
P.S. If you're wondering why I care about this... I'm trying to figure out whether my Python based C parser should handle string literal concatenation (which it doesn't do, at the moment), or leave it to cpp which it assumes runs before it.
The standard doesn't specify a preprocessor vs. a compiler, it just specifies the phases of translation you already noted. Traditionally, phases 1 through 4 were in the preprocessor, Phases 5 though 7 in the compiler, and phase 8 the linker -- but none of that is required by the standard.
Unless the preprocessor is specified to handle this, it's safe to assume it's the compiler's job.
Edit:
Your "I.e." link at the beginning of the post answers the question:
Adjacent string literals are concatenated at compile time; this allows long strings to be split over multiple lines, and also allows string literals resulting from C preprocessor defines and macros to be appended to strings at compile time...
In the ANSI C standard, this detail is covered in section 5.1.1.2, item (6):
5.1.1.2 Translation phases
...
4. Preprocessing directives are executed and macro invocations are expanded. ...
5. Each source character set member and escape sequence in character constants and string literals is converted to a member of the execution character set.
6. Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated.
The standard does not define that the implementation must use a pre-processor and compiler, per se.
Step 4 is clearly a preprocessor responsibility.
Step 5 requires that the "execution character set" be known. This information is also required by the compiler. It is easier to port the compiler to a new platform if the preprocessor does not contain platform dependendencies, so the tendency is to implement step 5, and thus step 6, in the compiler.
I would handle it in the scanning token part of the parser, so in the compiler. It seems more logical. The preprocessor has not to know the "structure" of the language, and in fact it ignores it usually so that macros can generate uncompilable code. It handles nothing more than what it is entitled to handle by directives that are specifically addressed to it (# ...), and the "consequences" of them (like those of a #define x h, which would make the preprocessor change a lot of x into h)
There are tricky rules for how string literal concatenation interacts with escape sequences.
Suppose you have
const char x1[] = "a\15" "4";
const char y1[] = "a\154";
const char x2[] = "a\r4";
const char y2[] = "al";
then x1 and x2 must wind up equal according to strcmp, and the same for y1 and y2. (This is what Heath is getting at in quoting the translation steps - escape conversion happens before string constant concatenation.) There's also a requirement that if any of the string constants in a concatenation group has an L or U prefix, you get a wide or Unicode string. Put it all together and it winds up being significantly more convenient to do this work as part of the "compiler" rather than the "preprocessor."

C++ Preprocessor string literal concatenation

I found this regarding how the C preprocessor should handle string literal concatenation (phase 6). However, I can not find anything regarding how this is handled in C++ (does C++ use the C preprocessor?).
The reason I ask is that I have the following:
const char * Foo::encoding = "\0" "1234567890\0abcdefg";
where encoding is a static member of class Foo. Without the availability of concatenation I wouldnt be able to write that sequence of characters like that.
const char * Foo::encoding = "\01234567890\0abcdefg";
Is something entirely different due to the way \012 is interpreted.
I dont have access to multiple platforms and I'm curious how confident I should be that the above is always handled correctly - i.e. I will always get { 0, '1', '2', '3', ... }
The language (C as well as C++) has no "preprocessor". "Preprocessor", as a separate functional unit, is an implementation detail. The way the source file(s) is handled if defined by so called phases of translation. One of the phases in C, as well as in C++ involves concatenating string literals.
In C++ language standard it is described in 2.1. For C++ (C++03) it is phase 6
6 Adjacent ordinary string literal
tokens are concatenated. Adjacent wide
string literal tokens are
concatenated.
Yes, it will be handled as you describe, because it is in stage 5 that,
Each source character set member and escape sequence in character constants and
string literals is converted to the corresponding member of the execution character
set (C99 §5.1.1.2/1)
The language in C++03 is effectively the same:
Each source character set member, escape sequence, or universal-character-name in character literals and string literals is converted to a member of the execution character set (C++03 §2.1/5)
So, escape sequences (like \0) are converted into members of the execution character set in stage five, before string literals are concatenated in stage six.
Because of the agreement between the C++ and C standards. Most, if not all, C++ implementations use a C preprocessor, so yes, C++ uses the C preprocessor.