[lex.ccon] contains the following definition for c-char:
c-char:
any member of the source character
set
except the
single-quote ’, backslash \, or new-line character
escape-sequence
universal-character-name
Given that the new-line character in C is the escape-sequence \n, isn't there a contradiction in the definition above?
PS: note that the C++ Standard doesn't really define what is a new-line character.
No, when the definition says "new-line character" it means an actual new-line, not the special two-character sequence (backslash and n) that can be interpreted by the compiler as a new-line in special circumstances (inside constant string or character literals).
The C++ Standard does say that the new-line characters are introduced for end-of-line indicators in translation phase 1, the source file is decomposed into preprocessing tokens (A character-literal is a preprocessing token) and sequences of white-space characters (including comments) in translation phase 3, and each escape sequence (\n is a escape sequence) is converted to the corresponding member of the execution character set in translation phase 5.
So it is clearly defined that when forming a character-literal, the character sequence \n is not turned into a new-line character, and the end-of-line indicator (which, as well as all the details of translation phase 1, is implementation-defined, but it is generally agreed that the end-of-line indicator is LF on Unix and Unix-like systems, CR+LF on Windows) has already been turned into the new-line character.
Related
As stated in Phases of translation
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character.
src file chars are mapped to basic source character set in an impl-def way, I clearly know it. "Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character" , what should I do when I'm supposed to accept a character which is not UNICODE character? Should I map it to a single basic character set member or to a universal character name(futhurmore, convert it to basic source chars seq "/uxxxxx"), both in myown manner? the latter, as I know , is not refered in standard.
I am referring to: Why should text files end with a newline?
One of the answers quotes the C89 standard. Which in brief dictates that a file must end with a new line, which is not immediately preceded by a backslash.
Does that apply to the most recent C++ standard?
#include <iostream>
using namespace std;
int main()
{
cout << "Hello World!" << endl;
return 0;
}
//\
Is the above valid? (Assuming there is a newline after //\, which I've been unable to display)
The given code is legal in the case of C++, but not for C.
Indeed, the C (N1570) standard says:
Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice. A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such
splicing takes place.
The C++ standard (N3797) formulates it a bit differently (emphasis mine):
Each instance of a backslash character (\) immediately followed by a new-line character is deleted,
splicing physical source lines to form logical source lines. Only the last backslash on any physical
source line shall be eligible for being part of such a splice. If, as a result, a character sequence that
matches the syntax of a universal-character-name is produced, the behavior is undefined. A source file
that is not empty and that does not end in a new-line character, or that ends in a new-line character
immediately preceded by a backslash character before any such splicing takes place, shall be processed
as if an additional new-line character were appended to the file.
As per [lex.phases] p2 and p3, your particular case is also ill-formed in c++ standard.
[lex.phases] p2 says
Each sequence of a backslash character () immediately followed by zero or more whitespace characters other than new-line followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. Except for splices reverted in a raw string literal, if a splice results in a character sequence that matches the syntax of a universal-character-name, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a splice, shall be processed as if an additional new-line character were appended to the file.
Since you said
Assuming there is a newline after //, which I've been unable to display
Hence, the last visible \ is eligible as a splice. So, the sequence consisted of \ and the new-line character is deleted. It means the last character in this source file is / but without being followed by a newline. // starts a comment according to [lex.comment] p1
The characters // start a comment, which terminates immediately before the next new-line character.
As per [lex.phases] p3
The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of whitespace characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.
In your case, the characters // start a comment but have no new line to terminate it. Hence, it's a partial comment. The program is ill-formed.
I have an array defined as follows:
extern const char config_reg[] = {
0x05, //comment
0x00, //comment
0x00, // \\ <-- double backslash
0x01, //comment
0x03
}
As you can see, there is a double backslash inside a comment (the <-- double backslash and preceding spaces do not appear in the actual source file). When I compile this code (minus the "<-- double backslash") it acts as if the following line is non existent - i.e. equivalent to writing:
extern const char config_reg[] = {
0x05, //comment
0x00, //comment
0x00, //
0x03
}
Is this intended C++ behaviour? If so, what is its intended purpose?
I am compiling using the Parallax Propeller Simple IDE to compile my code - not a particularly good compiler, by all accounts. Is it likely that the compiler implementation is causing this behaviour?
That's correct, assuming that the <-- double backslash and preceding spaces aren't actually in the code.
A single backslash would also produce the same effect.
The newline splicing for backslash-newline occurs before comment analysis, so the 0x01 line is part of the same line as the // \\ comment, so it isn't seen when the comment analysis is done.
The ISO/IEC 14882:2011 (C++11) standard says:
2.2 Phases of translation [lex.phases]
¶1 The precedence among the syntax rules of translation is specified by the following phases.11
Physical source file characters are mapped, in an implementation-defined manner, to the basic source
character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical
source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced
by corresponding single-character internal representations. Any source file character not in the basic
source character set (2.3) is replaced by the universal-character-name that designates that character.
(An implementation may use any internal encoding, so long as an actual extended character
encountered in the source file, and the same extended character expressed in the source file as a
universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this
replacement is reverted in a raw string literal.)
Each instance of a backslash character (\) immediately followed by a new-line character is deleted,
splicing physical source lines to form logical source lines. Only the last backslash on any physical
source line shall be eligible for being part of such a splice. If, as a result, a character sequence that
matches the syntax of a universal-character-name is produced, the behavior is undefined. A source file
that is not empty and that does not end in a new-line character, or that ends in a new-line character
immediately preceded by a backslash character before any such splicing takes place, shall be processed
as if an additional new-line character were appended to the file.
The source file is decomposed into preprocessing tokens (2.5) and sequences of white-space characters
(including comments). A source file shall not end in a partial preprocessing token or in a partial comment.12 Each comment is replaced by one space character. New-line characters are retained. Whether
each nonempty sequence of white-space characters other than new-line is retained or replaced by one
space character is unspecified. The process of dividing a source file’s characters into preprocessing tokens
is context-dependent. [ Example: see the handling of < within a #include preprocessing directive.
—end example ]
11) Implementations must behave as if these separate phases occur, although in practice different phases might be folded
together.
12) A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that
requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment
would arise from a source file ending with an unclosed /* comment.
Yes, the second phase of translation involves "splicing physical source lines to form logical source lines"; if a line ends with a backslash, the following line is considered to be a continuation of that line. This is the standard behaviour. This occurs before the removal of comments in the third phase, so the fact that the backslash occurs in a comment doesn't change anything.
Line splicing is used quite frequently in C to split macros over multiple lines, since a preprocessor directive extends to the end of the line. It is much rarer in C++, which relies much less on macros than C.
I believe the original purpose in C was to work around restrictions on line length that existed on some now-archaic systems.
A \ at the end of a line escapes the newline character.
Thus in your example, it will extend the comment to the next line. The writer of the snippet probably used \\ instead of just \ for aesthetic purposes. But it doesn't only work with comments. For example this is allowed (but redundant):
int a; \
int b;
Some compilers allow whitespace between the \ and the newline but may issue a warning.
The spec says that at phase 1 of compilation
Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character.
And at phase 4 it says
Preprocessing directives are executed, macro invocations are expanded
At phase 5, we have
Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set
For the # operator, we have
a \ character is inserted before each " and \ character of a character literal or string literal (including the delimiting " characters).
Hence I conducted the following test
#define GET_UCN(X) #X
GET_UCN("€")
With an input character set of UTF-8 (matching my file's encoding), I expected the following preprocessing result of the #X operation: "\"\\u20AC\"". GCC, Clang and boost.wave don't transform the € into a UCN and instead yield "\"€\"". I feel like I'm missing something. Can you please explain?
It's simply a bug. §2.1/1 says about Phase 1,
(An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)
This is not a note or footnote. C++0x adds an exception for raw string literals, which might solve your problem at hand if you have one.
This program clearly demonstrates the malfunction:
#include <iostream>
#define GET_UCN(X) L ## #X
int main() {
std::wcout << GET_UCN("€") << '\n' << GET_UCN("\u20AC") << '\n';
}
http://ideone.com/lb9jc
Because both strings are wide, the first is required to be corrupted into several characters if the compiler fails to interpret the input multibyte sequence. In your given example, total lack of support for UTF-8 could cause the compiler to slavishly echo the sequence right through.
"and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set"
used to be
"or universal-character-name in character literals and string literals is converted to a member of the execution character set"
Maybe you need a future version of g++.
I'm not sure where you got that citation for translation phase 1—the C99 standard says this about translation phase 1 in §5.1.1.2/1:
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
So in this case, the Euro character € (represented as the multibyte sequence E2 82 AC in UTF-8) is mapped into the execution character set, which also happens to be UTF-8, so its representation remains the same. It doesn't get converted into a universal character name because, well, there's nothing that says that it should.
I suspect you'll find that the euro sign does not satisfy the condition Any source file character not in the basic source character set so the rest of the text you quote doesn't apply.
Open your test file with your favourite binary editor and check what value is used to represent the euro sign in GET_UCN("€")
The latest draft of C++0x, n3126, says:
Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines.
...
Within the r-char-sequence of a raw string literal, any transformations performed in
phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted.
Technically this means that the C++ preprocessor only recognizes a backslash followed by the newline character, but I know that some C++ implementations also allow Windows- or classic Mac-style line endings as well.
Will conforming implementations of C++0x be required to preserve the newline sequence that immediately followed a backslash character \ within the r-char-sequence of a raw string? Maybe a better question is: would it be expected of a Windows C++0x compiler to undo each line splice with "\\\r\n" instead of "\\\n"?
Translation phase 1 starts with
Physical source file characters are
mapped, in an implementation-defined
manner, to the basic source character
set (introducing newline characters
for end-of-line indicators) if
necessary. Trigraph
sequences (2.3) are replaced [...]
I'd interpret the requirement "any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing)" as explicitly not reverting the transformation from source file characters to the basic source character set. Instead, source characters are later converted to the execution character set, and you get newline characters there.
If you need a specific line ending sequence, you can insert it explicitly, and use string literal concatenation:
char* nitpicky = "I must have a \\r\\n line ending!\r\n"
"Otherwise, some other piece of code will misinterpret this line!";