What C and C++ standards says about whitespace character (or several characters) after backslash? Does it guarantees to join lines anyway or not?
int main()
{
// Comment \
int foo;
}
MSVC and gcc works different in this case.
For reference, the standard quote is (§2.2/1, abridged, emphasis mine):
Phases of Translation
[...]
2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. If, as a result, a character sequence that matches the syntax of a universal-character-name is produced, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, shall be processed as if an additional new-line character were appended to the file.
[...]
The implementation-defined part that other answers are mentioning is in the definition of "new-line".
(Note that comments are not replaced until phase 3, so that in this code:
int main()
{
int x = 0;
// assuming the definition of new-line is as expected, this function
// will return 0, not 5 (no whitespace after this backslash: ) \
x = 5;
return x;
}
x = 5; will be appended to the end of the comment, then ultimately removed.)
The C standard leaves it implementation-defined how a text file is broken into lines (as part of translation phase 1, if memory serves). For the purpose of \-newline, GCC defines a line ending as zero or more ASCII horizontal whitespace characters (SPC, TAB, VT, or FF) followed by one of the three common ASCII line termination sequences: CR, LF, or CR LF.
I do not know what MSVC does, but I would not be at all surprised if it is different.
Related
[lex.ccon] contains the following definition for c-char:
c-char:
any member of the source character
set
except the
single-quote ’, backslash \, or new-line character
escape-sequence
universal-character-name
Given that the new-line character in C is the escape-sequence \n, isn't there a contradiction in the definition above?
PS: note that the C++ Standard doesn't really define what is a new-line character.
No, when the definition says "new-line character" it means an actual new-line, not the special two-character sequence (backslash and n) that can be interpreted by the compiler as a new-line in special circumstances (inside constant string or character literals).
The C++ Standard does say that the new-line characters are introduced for end-of-line indicators in translation phase 1, the source file is decomposed into preprocessing tokens (A character-literal is a preprocessing token) and sequences of white-space characters (including comments) in translation phase 3, and each escape sequence (\n is a escape sequence) is converted to the corresponding member of the execution character set in translation phase 5.
So it is clearly defined that when forming a character-literal, the character sequence \n is not turned into a new-line character, and the end-of-line indicator (which, as well as all the details of translation phase 1, is implementation-defined, but it is generally agreed that the end-of-line indicator is LF on Unix and Unix-like systems, CR+LF on Windows) has already been turned into the new-line character.
I am referring to: Why should text files end with a newline?
One of the answers quotes the C89 standard. Which in brief dictates that a file must end with a new line, which is not immediately preceded by a backslash.
Does that apply to the most recent C++ standard?
#include <iostream>
using namespace std;
int main()
{
cout << "Hello World!" << endl;
return 0;
}
//\
Is the above valid? (Assuming there is a newline after //\, which I've been unable to display)
The given code is legal in the case of C++, but not for C.
Indeed, the C (N1570) standard says:
Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice. A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such
splicing takes place.
The C++ standard (N3797) formulates it a bit differently (emphasis mine):
Each instance of a backslash character (\) immediately followed by a new-line character is deleted,
splicing physical source lines to form logical source lines. Only the last backslash on any physical
source line shall be eligible for being part of such a splice. If, as a result, a character sequence that
matches the syntax of a universal-character-name is produced, the behavior is undefined. A source file
that is not empty and that does not end in a new-line character, or that ends in a new-line character
immediately preceded by a backslash character before any such splicing takes place, shall be processed
as if an additional new-line character were appended to the file.
As per [lex.phases] p2 and p3, your particular case is also ill-formed in c++ standard.
[lex.phases] p2 says
Each sequence of a backslash character () immediately followed by zero or more whitespace characters other than new-line followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. Except for splices reverted in a raw string literal, if a splice results in a character sequence that matches the syntax of a universal-character-name, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a splice, shall be processed as if an additional new-line character were appended to the file.
Since you said
Assuming there is a newline after //, which I've been unable to display
Hence, the last visible \ is eligible as a splice. So, the sequence consisted of \ and the new-line character is deleted. It means the last character in this source file is / but without being followed by a newline. // starts a comment according to [lex.comment] p1
The characters // start a comment, which terminates immediately before the next new-line character.
As per [lex.phases] p3
The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of whitespace characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.
In your case, the characters // start a comment but have no new line to terminate it. Hence, it's a partial comment. The program is ill-formed.
I have an array defined as follows:
extern const char config_reg[] = {
0x05, //comment
0x00, //comment
0x00, // \\ <-- double backslash
0x01, //comment
0x03
}
As you can see, there is a double backslash inside a comment (the <-- double backslash and preceding spaces do not appear in the actual source file). When I compile this code (minus the "<-- double backslash") it acts as if the following line is non existent - i.e. equivalent to writing:
extern const char config_reg[] = {
0x05, //comment
0x00, //comment
0x00, //
0x03
}
Is this intended C++ behaviour? If so, what is its intended purpose?
I am compiling using the Parallax Propeller Simple IDE to compile my code - not a particularly good compiler, by all accounts. Is it likely that the compiler implementation is causing this behaviour?
That's correct, assuming that the <-- double backslash and preceding spaces aren't actually in the code.
A single backslash would also produce the same effect.
The newline splicing for backslash-newline occurs before comment analysis, so the 0x01 line is part of the same line as the // \\ comment, so it isn't seen when the comment analysis is done.
The ISO/IEC 14882:2011 (C++11) standard says:
2.2 Phases of translation [lex.phases]
¶1 The precedence among the syntax rules of translation is specified by the following phases.11
Physical source file characters are mapped, in an implementation-defined manner, to the basic source
character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical
source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced
by corresponding single-character internal representations. Any source file character not in the basic
source character set (2.3) is replaced by the universal-character-name that designates that character.
(An implementation may use any internal encoding, so long as an actual extended character
encountered in the source file, and the same extended character expressed in the source file as a
universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this
replacement is reverted in a raw string literal.)
Each instance of a backslash character (\) immediately followed by a new-line character is deleted,
splicing physical source lines to form logical source lines. Only the last backslash on any physical
source line shall be eligible for being part of such a splice. If, as a result, a character sequence that
matches the syntax of a universal-character-name is produced, the behavior is undefined. A source file
that is not empty and that does not end in a new-line character, or that ends in a new-line character
immediately preceded by a backslash character before any such splicing takes place, shall be processed
as if an additional new-line character were appended to the file.
The source file is decomposed into preprocessing tokens (2.5) and sequences of white-space characters
(including comments). A source file shall not end in a partial preprocessing token or in a partial comment.12 Each comment is replaced by one space character. New-line characters are retained. Whether
each nonempty sequence of white-space characters other than new-line is retained or replaced by one
space character is unspecified. The process of dividing a source file’s characters into preprocessing tokens
is context-dependent. [ Example: see the handling of < within a #include preprocessing directive.
—end example ]
11) Implementations must behave as if these separate phases occur, although in practice different phases might be folded
together.
12) A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that
requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment
would arise from a source file ending with an unclosed /* comment.
Yes, the second phase of translation involves "splicing physical source lines to form logical source lines"; if a line ends with a backslash, the following line is considered to be a continuation of that line. This is the standard behaviour. This occurs before the removal of comments in the third phase, so the fact that the backslash occurs in a comment doesn't change anything.
Line splicing is used quite frequently in C to split macros over multiple lines, since a preprocessor directive extends to the end of the line. It is much rarer in C++, which relies much less on macros than C.
I believe the original purpose in C was to work around restrictions on line length that existed on some now-archaic systems.
A \ at the end of a line escapes the newline character.
Thus in your example, it will extend the comment to the next line. The writer of the snippet probably used \\ instead of just \ for aesthetic purposes. But it doesn't only work with comments. For example this is allowed (but redundant):
int a; \
int b;
Some compilers allow whitespace between the \ and the newline but may issue a warning.
Is it possible in C++ to write a macro, which AFTER expansion will output a backslash sign?
Right now I'm using a code:
#define SOME_ENUM(XX) \
XX(FirstValue,) \
XX(SecondValue,) \
XX(SomeOtherValue,=50) \
XX(OneMoreValue,=100) \
but I want to write a macro, which will generate the code above, so I want to be able to write:
ENUM_BEGIN(name) // it should output: #define SOME_ENUM(XX) \
ENUM(ONE) // it should output: XX(ONE,) \
//...
But I was not able to write a macro like ENUM_BEGIN, because it should expand to something with backslash on the end.
Is it possible in C++?
No, it is not possible. Relevant to this would be §2.2.1, translation phase 2 described in ISO/IEC 14882:2011(E):
Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to
form logical source lines. Only the last backslash on any physical
source line shall be eligible for being part of such a splice. If, as
a result, a character sequence that matches the syntax of a
universal-character-name is produced, the behavior is undefined. A
source file that is not empty and that does not end in a new-line
character, or that ends in a new-line character immediately preceded
by a backslash character before any such splicing takes place, shall
be processed as if an additional new-line character were appended to
the file.
Basically what will happen is the \\\n (where the \n is physically in the source, not an escape), will be treated as a \ character, followed by a line splice. The remaining \ will most likely result in a syntax error (there may be situations where it is legal, but I don't currently see any), and not treated during subsequent translation phases as a line splice (line splicing occurs in only phase #2).
I haven't found any documentation for it, but I would've thought that you could just do \\ and you'll generate a backslash.
However, in my research, I see that may not be the biggest thing you'll have to deal with. As millsj just commented, you'll have issues outputting the # in your ENUM_BEGIN. See Escaping a # symbol in a #define macro? .
The latest draft of C++0x, n3126, says:
Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines.
...
Within the r-char-sequence of a raw string literal, any transformations performed in
phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted.
Technically this means that the C++ preprocessor only recognizes a backslash followed by the newline character, but I know that some C++ implementations also allow Windows- or classic Mac-style line endings as well.
Will conforming implementations of C++0x be required to preserve the newline sequence that immediately followed a backslash character \ within the r-char-sequence of a raw string? Maybe a better question is: would it be expected of a Windows C++0x compiler to undo each line splice with "\\\r\n" instead of "\\\n"?
Translation phase 1 starts with
Physical source file characters are
mapped, in an implementation-defined
manner, to the basic source character
set (introducing newline characters
for end-of-line indicators) if
necessary. Trigraph
sequences (2.3) are replaced [...]
I'd interpret the requirement "any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing)" as explicitly not reverting the transformation from source file characters to the basic source character set. Instead, source characters are later converted to the execution character set, and you get newline characters there.
If you need a specific line ending sequence, you can insert it explicitly, and use string literal concatenation:
char* nitpicky = "I must have a \\r\\n line ending!\r\n"
"Otherwise, some other piece of code will misinterpret this line!";