Are trigraph substitutions reverted when a raw string is created through concatenation? - c++

It's pretty common to use macros and token concatenation to switch between wide and narrow strings at compile time.
#define _T(x) L##x
const wchar_t *wide1 = _T("hello");
const wchar_t *wide2 = L"hello";
And in C++11 it should be valid to concoct a similar thing with raw strings:
#define RAW(x) R##x
const char *raw1 = RAW("(Hello)");
const char *raw2 = R"(Hello)";
Since macro expansion and token concatenation happens before escape sequence substitution, this should prevent escape sequences being replaced in the quoted string.
But how does this apply to trigraphs? Are raw strings formed through concatenation with normal strings still subject to having their trigraph substitutions reverted?
const char *trigraph = RAW("(??=)"); // Is this "#" or "??="?

No, the trigraph is not reverted in your example.
[lex.phases]p1 identifies three phases of translation relevant to your question:
1. Trigraph sequences are replaced by corresponding single-character internal representations.
3. The source file is decomposed into preprocessing tokens.
4. Macro invocations are expanded.
Phase 1 is defined by [lex.trigraph]p1. At this stage, your code is translated to const char *trigraph = RAW("(#)").
Phase 3 is defined by [lex.pptoken]. This is the stage where trigraphs are reverted in raw string literals. Paragraph 3 says:
If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted.
That is not the case in your example, therefore the trigraph is not reverted. Your code is transformed into the preprocessing-token sequence const char * trigraph = RAW ( "(#)" )
Finally, in phase 4, the RAW macro is expanded and the token-paste occurs, resulting in the following sequence of preprocessing-tokens: const char * trigraph = R"(#)". The r-char-sequence of the string literal comprises a #. Phase 3 has already occurred, and there is no other point at which reversion of trigraphs occurs.

Trigraph substitution happens before macro processing.
UPD Please disregard this. I haven't realized that c++0x reverts trigraphs in raw string literals.
UPD2 2.5.3 describes the process of forming raw-string-literal preprocessing tokens. Trigraph reversal is a part of this process. There are no raw-string-literals which are not preprocessing tokens. So the answer to your question seems to be yes.

Related

Why paired comment can't be placed inside a string in c++?

Normally anything inside /* and */ is considered as a comment.
But the statement,
std::cout << "not-a-comment /* comment */";
prints not-a-comment /* comment */ instead of not-a-comment.
Why does this happen? Are there any other places in c++ where I can't use comments?
This is a consequence of the maximum munch principle. It's a lexing rule that the C++ language follows. When processing a source file, translation is divided into (logical) phases. During phase 3, we get preprocsessing tokens:
[lex.phases]
1.3 The source file is decomposed into preprocessing tokens and
sequences of white-space characters (including comments). A source
file shall not end in a partial preprocessing token or in a partial
comment. Each comment is replaced by one space character. New-line
characters are retained.
Turning comments into white space pp-tokens is done at the same phase. Now a string literal is a pp-token:
[lex.pptoken]
preprocessing-token:
header-name
identifier
pp-number
character-literal
user-defined-character-literal
string-literal
user-defined-string-literal
preprocessing-op-or-punc
each non-white-space character that cannot be one of the above
As are other literals. And the maximum munch principle, tells us that:
3 If the input stream has been parsed into preprocessing tokens up to a
given character:
Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that
would cause further lexical analysis to fail, except that a
header-name is only formed within a #include directive.
So because preprocessing found the opening ", it must keep looking for the longest sequence of characters that will make a valid pp-token (in this case, the token is a string literal). This sequence ends at the closing ". That's why it can't stop and handle the comment, because it is obligated to consume up to the closing quotation mark.
Following these rules you can pin-point the places where comments will not be handled by the pre-processor as comments.
Why does this happen?
Because the comment becomes part of the string literal (eveything between the "" double quotes).
Are there any other places in c++ where I can't use comments?
Yes, the same applies for character literals (using '' single quotes).
You can think of it like single and double quotes have higher precedence before the comment delimiters /**/.

Compilation of string literals

Why can two string literals separated by a space, tab or "\n" be compiled without an error?
int main()
{
char * a = "aaaa" "bbbb";
}
"aaaa" is a char*
"bbbb" is a char*
There is no specific concatenation rule to process two string literals. And obviously the following code gives an error during compilation:
#include <iostream>
int main()
{
char * a = "aaaa";
char * b = "bbbb";
std::cout << a b;
}
Is this concatenation common to all compilers? Where is the null termination of "aaaa"? Is "aaaabbbb" a continuous block of RAM?
If you see e.g. this translation phase reference in phase 6 it does:
Adjacent string literals are concatenated.
And that's exactly what happens here. You have two adjacent string literals, and they are concatenated into a single string literal.
It is standard behavior.
It only works for string literals, not two pointer variables, as you noticed.
In this statement
char * a = "aaaa" "bbbb";
the compiler in some step of compilation before the syntax analysis considers adjacent string literals as one literal.
So for the compiler the above statement is equivalent to
char * a = "aaaabbbb";
that is the compiler stores only one string literal "aaaabbbb"
Adjacent string literals are concatenated as per the rules of C (and C++) standard. But no such rule exists for adjacent identifiers (i.e. variables a and b).
To quote, C++14 (N3797 draft), § 2.14.5:
In translation phase 6 (2.2), adjacent string literals are
concatenated. If both string literals have the same encoding-prefix,
the resulting concatenated string literal has that encoding-prefix. If
one string literal has no encoding-prefix, it is treated as a string
literal of the same encoding-prefix as the other operand. If a UTF-8
string literal token is adjacent to a wide string literal token, the
program is ill-formed. Any other concatenations are
conditionally-supported with implementation-defined behavior.
In C and C++ compiles adjacent string literals as a single string literal. For example this:
"Some text..." "and more text"
is equivalent to:
"Some text...and more text"
That for historical reasons:
The original C language was designed in 1969-1972 when computing was still dominated by the 80 column punched card. Its designers used 80 column devices such as the ASR-33 Teletype. These devices did not automatically wrap text, so there was a real incentive to keep source code within 80 columns. Fortran and Cobol had explicit continuation mechanisms to do so, before they finally moved to free format.
It was a stroke of brilliance for Dennis Ritchie (I assume) to realise that there was no ambiguity in the grammar and that long ASCII strings could be made to fit into 80 columns by the simple expedient of getting the compiler to concatenate adjacent literal strings. Countless C programmers were grateful for that small feature.
Once the feature is in, why would it ever be removed? It causes no grief and is frequently handy. I for one wish more languages had it. The modern trend is to have extended strings with triple quotes or other symbols, but the simplicity of this feature in C has never been outdone.
Similar question here.
String literals placed side-by-side are concatenated at translation phase 6 (after the preprocessor). That is, "Hello," " world!" yields the (single) string "Hello, world!". If the two strings have the same encoding prefix (or neither has one), the resulting string will have the same encoding prefix (or no prefix).
(source)

Concatenation and the standard

According to this page "A ## operator between any two successive identifiers in the replacement-list runs parameter replacement on the two identifiers". That is, the preprocessor operator ## acts on identifiers. Microsoft's page says ", each occurrence of the token-pasting operator in token-string is removed, and the tokens preceding and following it are concatenated". That is, the preprocessor operator ## acts on tokens.
I have looked for a definition of an identifier and/or token and the most I have found is this link: "An identifier is an arbitrary long sequence of digits, underscores, lowercase and uppercase Latin letters, and Unicode characters. A valid identifier must begin with a non-digit character".
According to that definition, the following macro should not work (on two accounts):
#define PROB1(x) x##0000
#define PROB2(x,y) x##y
int PROB1(z) = PROB2( 1, 2 * 3 );
Does the standard have some rigorous definitions regarding ## and the objects it acts on? Or, is it mostly 'try and see if it works' (a.k.a. implementation defined)?
The standard is extremely precise, both about what can be concatenated, and about what a valid token is.
The en.cppreference.com page is imprecise; what are concatenated are preprocessing tokens, not identifiers. The Microsoft page is much closer to the standard, although it omits some details and fails to distinguish "preprocessing token" from "token", which are slightly different concepts.
What the standard actually says (§16.3.3/3):
For both object-like and function-like macro invocations, before the replacement list is reexamined for more macro names to replace, each instance of a ## preprocessing token in the replacement list (not from an
argument) is deleted and the preceding preprocessing token is concatenated with the following preprocessing token.…
For reference, "preprocessing token" is defined in §2.4 to be one of the following:
header-name
identifier
pp-number
character-literal
user-defined-character-literal
string-literal
user-defined-string-literal
preprocessing-op-or-punc
each non-white-space character that cannot be one of the above
Most of the time, the tokens to be combined are identifiers (and numbers), but it is quite possible to generate a multicharacter token by concatenating individual characters. (Given the last item in the list of possible preprocessor tokens, any single non-whitespace character is a preprocessor token, even if it is not a letter, digit or standard punctuation symbol.)
The result of a concatenation must be a preprocessing token:
If the result is not a valid preprocessing token, the behavior is undefined. The resulting token is available for further macro replacement.
Note that the replacement of a function-like macro's argument names with the actual arguments may result in the argument name being replaced by 0 tokens or more than one token. If that argument is used on either side of a concatenation operator:
In the case that the actual argument had zero tokens, nothing is concatenated. (The Microsoft page implies that the concatenation operator will concatenate whatever tokens end up preceding and following it.)
In the case that the actual argument has more than one token, the one which is concatenated is the one which precedes or follows the concatenation operator.
As an example of the last case, remember that -42 is two preprocessing tokens (and two tokens, after preprocessing): - and 42. Consequently, although you can concatenate the pp-number 42E with the pp-number 3, resulting in the pp-number (and valid token) 42E3, you cannot create the token 42E-3 from 42E and -3, because only the - would be concatenated, resulting in two pp-number tokens: 42E-3. (The first of these is a valid preprocessing token but it cannot be converted into a valid token, so a tokenization error will be reported.)
In a sequence of concatenations:
#define concat3(a,b,c) a ## b ## c
the order of concatenations is not defined. So it is unspecified whether concat3(42E,-,3) is valid; if the first two tokens are concatenated first, all is well, but if the second two are concatenated first, the result is not a valid preprocessing token. On the other hand, concat3(.,.,.) must be an error, because .. is not a valid token, and so neither a##b nor b##c can be processed. So it is impossible to produce the token ... with concatenation.

Why does stringizing an euro sign within a string literal using UTF8 not produce an UCN?

The spec says that at phase 1 of compilation
Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character.
And at phase 4 it says
Preprocessing directives are executed, macro invocations are expanded
At phase 5, we have
Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set
For the # operator, we have
a \ character is inserted before each " and \ character of a character literal or string literal (including the delimiting " characters).
Hence I conducted the following test
#define GET_UCN(X) #X
GET_UCN("€")
With an input character set of UTF-8 (matching my file's encoding), I expected the following preprocessing result of the #X operation: "\"\\u20AC\"". GCC, Clang and boost.wave don't transform the € into a UCN and instead yield "\"€\"". I feel like I'm missing something. Can you please explain?
It's simply a bug. §2.1/1 says about Phase 1,
(An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)
This is not a note or footnote. C++0x adds an exception for raw string literals, which might solve your problem at hand if you have one.
This program clearly demonstrates the malfunction:
#include <iostream>
#define GET_UCN(X) L ## #X
int main() {
std::wcout << GET_UCN("€") << '\n' << GET_UCN("\u20AC") << '\n';
}
http://ideone.com/lb9jc
Because both strings are wide, the first is required to be corrupted into several characters if the compiler fails to interpret the input multibyte sequence. In your given example, total lack of support for UTF-8 could cause the compiler to slavishly echo the sequence right through.
"and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set"
used to be
"or universal-character-name in character literals and string literals is converted to a member of the execution character set"
Maybe you need a future version of g++.
I'm not sure where you got that citation for translation phase 1—the C99 standard says this about translation phase 1 in §5.1.1.2/1:
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
So in this case, the Euro character € (represented as the multibyte sequence E2 82 AC in UTF-8) is mapped into the execution character set, which also happens to be UTF-8, so its representation remains the same. It doesn't get converted into a universal character name because, well, there's nothing that says that it should.
I suspect you'll find that the euro sign does not satisfy the condition Any source file character not in the basic source character set so the rest of the text you quote doesn't apply.
Open your test file with your favourite binary editor and check what value is used to represent the euro sign in GET_UCN("€")

C++ Preprocessor string literal concatenation

I found this regarding how the C preprocessor should handle string literal concatenation (phase 6). However, I can not find anything regarding how this is handled in C++ (does C++ use the C preprocessor?).
The reason I ask is that I have the following:
const char * Foo::encoding = "\0" "1234567890\0abcdefg";
where encoding is a static member of class Foo. Without the availability of concatenation I wouldnt be able to write that sequence of characters like that.
const char * Foo::encoding = "\01234567890\0abcdefg";
Is something entirely different due to the way \012 is interpreted.
I dont have access to multiple platforms and I'm curious how confident I should be that the above is always handled correctly - i.e. I will always get { 0, '1', '2', '3', ... }
The language (C as well as C++) has no "preprocessor". "Preprocessor", as a separate functional unit, is an implementation detail. The way the source file(s) is handled if defined by so called phases of translation. One of the phases in C, as well as in C++ involves concatenating string literals.
In C++ language standard it is described in 2.1. For C++ (C++03) it is phase 6
6 Adjacent ordinary string literal
tokens are concatenated. Adjacent wide
string literal tokens are
concatenated.
Yes, it will be handled as you describe, because it is in stage 5 that,
Each source character set member and escape sequence in character constants and
string literals is converted to the corresponding member of the execution character
set (C99 §5.1.1.2/1)
The language in C++03 is effectively the same:
Each source character set member, escape sequence, or universal-character-name in character literals and string literals is converted to a member of the execution character set (C++03 §2.1/5)
So, escape sequences (like \0) are converted into members of the execution character set in stage five, before string literals are concatenated in stage six.
Because of the agreement between the C++ and C standards. Most, if not all, C++ implementations use a C preprocessor, so yes, C++ uses the C preprocessor.