What is the encoding of unprefixed string literals in C++? [duplicate]

What is the encoding of unprefixed string literals in C++? [duplicate] - c++

This question already has answers here:
Is the u8 string literal necessary in C++11
(4 answers)
Choose default type and encoding for C++ string literals at compile time
(1 answer)
Closed last month.
What is the encoding of unprefixed string literals in C++? For example, all string literals are parsed and stored as UTF-16 in Java, as UTF-8 in Python3. I guess this is the case with C++ u8"" literals. But I'm not clear about normal literals like "".
What should be the output of following code?
#include <iostream>
#include <iomanip>
int main() {
auto c = "Hello, World!";
while(*c) {
std::cout << std::hex << (unsigned int){*c++} << " ";
}
}
When I run this in my machine, it gives following output:
48 65 6c 6c 6f 2c 20 57 6f 72 6c 64 21
But is this guarantied? Cppreference page for string literals says that characters inside normal string literals are from the translation character set, and translation character set states that:
The translation character set consists of the following elements:
each character named by ISO/IEC 10646, as identified by its unique UCS scalar value, and
a distinct character for each UCS scalar value where no named character is assigned.
From this definition, it seems translation character set refers to Unicode (or its superset). Then is there no difference between "" and u8"" except for explicitness?
Suppose if I want my string to be in EBCDIC encoding (just as an exercise), what is the correct way to achieve it in C++?
EDIT: The linked Cppreference page for string literals does say that it is implementation defined. Does that mean, should I avoid using them?

Encoding of string literals is controlled by compiler settings.
Default settings depend on compiler. AFAIK by default MSVC uses encoding defined by system locale. On gcc/clang utf-8 is assumed.
In MSVC you can change this by using /execution-charset: switch.
Gcc clang have -fexec-charset= switch.
Note you haveto instruct standard library what is current encoding of your string literals. This is one of features of std::locale::global.
Here is my other answer where I did some experiments with MSVC.

Related

single vs double quotes C++ - interesting, unexpected behaviour [duplicate]

I didn't know that C and C++ allow multicharacter literal: not 'c' (of type int in C and char in C++), but 'tralivali' (of type int!)
enum
{
ActionLeft = 'left',
ActionRight = 'right',
ActionForward = 'forward',
ActionBackward = 'backward'
};
Standard says:
C99 6.4.4.4p10: "The value of an
integer character constant containing
more than one character (e.g., 'ab'),
or containing a character or escape
sequence that does not map to a
single-byte execution character, is
implementation-defined."
I found they are widely used in C4 engine. But I suppose they are not safe when we are talking about platform-independend serialization. Thay can be confusing also because look like strings. So what is multicharacter literal's scope of usage, are they useful for something? Are they in C++ just for compatibility with C code? Are they considered to be a bad feature as goto operator or not?

It makes it easier to pick out values in a memory dump.
Example:
enum state { waiting, running, stopped };
vs.
enum state { waiting = 'wait', running = 'run.', stopped = 'stop' };
a memory dump after the following statement:
s = stopped;
might look like:
00 00 00 02 . . . .
in the first case, vs:
73 74 6F 70 s t o p
using multicharacter literals. (of course whether it says 'stop' or 'pots' depends on byte ordering)

I don't know how extensively this is used, but "implementation-defined" is a big red-flag to me. As far as I know, this could mean that the implementation could choose to ignore your character designations and just assign normal incrementing values if it wanted. It may do something "nicer", but you can't rely on that behavior across compilers (or even compiler versions). At least "goto" has predictable (if undesirable) behavior...
That's my 2c, anyway.
Edit: on "implementation-defined":
From Bjarne Stroustrup's C++ Glossary:
implementation defined - an aspect of C++'s semantics that is defined for each implementation rather than specified in the standard for every implementation. An example is the size of an int (which must be at least 16 bits but can be longer). Avoid implementation defined behavior whenever possible. See also: undefined. TC++PL C.2.
also...
undefined - an aspect of C++'s semantics for which no reasonable behavior is required. An example is dereferencing a pointer with the value zero. Avoid undefined behavior. See also: implementation defined. TC++PL C.2.
I believe this means the comment is correct: it should at least compile, although anything beyond that is not specified. Note the advice in the definition, also.

Four character literals, I've seen and used. They map to 4 bytes = one 32 bit word. It's very useful for debugging purposes as said above. They can be used in a switch/case statement with ints, which is nice.
This (4 Chars) is pretty standard (ie supported by GCC and VC++ at least), although results (actual values compiled) may vary from one implementation to another.
But over 4 chars? I wouldn't use.
UPDATE: From the C4 page: "For our simple actions, we'll just provide an enumeration of some values, which is done in C4 by specifying four-character constants". So they are using 4 chars literals, as was my case.

Multicharacter literals allow one to specify int values via the equivalent representation in characters. Useful for enums, FourCC codes and tags, and non-type template parameters. With a multicharacter literal, a FourCC code can be typed directly into the source, which is handy.
The implementation in gcc is described at https://gcc.gnu.org/onlinedocs/cpp/Implementation-defined-behavior.html . Note that the value is truncated to the size of the type int, so 'efgh' == 'abcdefgh' if your ints are 4 chars wide, although gcc will issue a warning on the literal that overflows.
Unfortunately, gcc will issue a warning on all multi-character literals if -pedantic is passed, as their behavior is implementation-defined. As you can see above, it is perhaps possible for equality of two multi-character literals to change if you switch implementations.

In C++14 specification draft N4527 section 2.13.3, entry 2:
... An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
Previous answers to your question pertained mostly on real machines that did support multicharacter literals. Specifically, on platforms where int is 4 bytes, four-byte multicharacter is fine and can be used for convenience, as per Ferrucio's mem dump example. But, as there is no guarantee that this will ever work or work the same way on other platforms, use of multicharacter literals should be deprecated for portable programs.

unbelievable, every compiler I know places the first character of a UINT defined as 4-character constant in the low significant byte (little indian) --- but Visual C does it in opposite direction 🙄
// file signature
#define SFKFILE_SIGNATURE 'SFPK' (S=53)
// check header
if (out_FileHdr->Signature != SFKFILE_SIGNATURE)
fails on VC:
Borland: 4B504653 4B504653
Watcom: 4B504653 4B504653
VisualC: 4B504653 5346504B

C++ utf-8 literals in GCC and MSVC

Here i have some simple code:
#include <iostream>
#include <cstdint>
int main()
{
const unsigned char utf8_string[] = u8"\xA0";
std::cout << std::hex << "Size: " << sizeof(utf8_string) << std::endl;
for (int i=0; i < sizeof(utf8_string); i++) {
std::cout << std::hex << (uint16_t)utf8_string[i] << std::endl;
}
}
I see different behavior here with MSVC and GCC.
MSVC sees "\xA0" as not encoded unicode sequence, and encodes it to utf-8.
So in MSVC the output is:
C2A0
Which is correctly encoded in utf8 unicode symbol U+00A0.
But in case of GCC nonthing happens. It treats string as simple bytes. There's no change even if i remove u8 before string literal.
Both compilers encode to utf8 with output C2A0 if the string is set to: u8"\u00A0";
Why do compilers behave differently and which actually does it right?
Software used for test:
GCC 8.3.0
MSVC 19.00.23506
C++ 11

They're both wrong.
As far as I can tell, the C++17 standard says here that:
The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.
Although there are other hints, this seems to be the strongest indication that escape sequences are not multi-byte and that MSVC's behaviour is wrong.
There are tickets for this which are currently marked as Under Investigation:
https://developercommunity.visualstudio.com/content/problem/225847/hex-escape-codes-in-a-utf8-literal-are-treated-in.html
https://developercommunity.visualstudio.com/content/problem/260684/escape-sequences-in-unicode-string-literals-are-ov.html
However it also says here about UTF-8 literals that:
If the value is not representable with a single UTF-8 code unit, the program is ill-formed.
Since 0xA0 is not a valid UTF-8 character, the program should not compile.
Note that:
UTF-8 literals starting with u8 are defined as being narrow.
\xA0 is an escape sequence
\u00A0 is considered a universal character name and not an escape sequence

This is CWG issue 1656.
It has been resolved in the current standard draft through P2029R4 so that the numeric escape sequences are to be considered by their value as a single code unit, not as a unicode code point which is then encoded to UTF-8. This is even if it results in an invalid UTF-8 sequence.
Therefore GCC's behavior is/will be correct.

I can't tell you which way is true to the standard.
The way MSVC does it is at least logically consistent and easily explainable. The three escape sequences \x, \u, and \U behave identically except for the number of hex digits they pull from the input: 2, 4, or 8. Each defines a Unicode codepoint that must then be encoded to UTF-8. To embed a byte without encoding leads to the possibility of creating an invalid UTF-8 sequence.

Why do compilers behave differently and which actually does it right?
Compilers behave differently because of the way they decided to implement the C++ standard:
GCC uses strict rules and implements the standard as is
MSVC uses loose rules and implements the standard in a more practical "real-world" kind of way
So things that fail in GCC will usually work in MSVC because it's more permissible. And MSVC handles some of these issues automatically.
Here is a similar example:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33167.
It follows the standard, but it's not what you would expect.
As to which does it right, depends on what your definition of "right" is.

C++ and the Integer Value of Characters [duplicate]

I didn't know that C and C++ allow multicharacter literal: not 'c' (of type int in C and char in C++), but 'tralivali' (of type int!)
enum
{
ActionLeft = 'left',
ActionRight = 'right',
ActionForward = 'forward',
ActionBackward = 'backward'
};
Standard says:
C99 6.4.4.4p10: "The value of an
integer character constant containing
more than one character (e.g., 'ab'),
or containing a character or escape
sequence that does not map to a
single-byte execution character, is
implementation-defined."
I found they are widely used in C4 engine. But I suppose they are not safe when we are talking about platform-independend serialization. Thay can be confusing also because look like strings. So what is multicharacter literal's scope of usage, are they useful for something? Are they in C++ just for compatibility with C code? Are they considered to be a bad feature as goto operator or not?

It makes it easier to pick out values in a memory dump.
Example:
enum state { waiting, running, stopped };
vs.
enum state { waiting = 'wait', running = 'run.', stopped = 'stop' };
a memory dump after the following statement:
s = stopped;
might look like:
00 00 00 02 . . . .
in the first case, vs:
73 74 6F 70 s t o p
using multicharacter literals. (of course whether it says 'stop' or 'pots' depends on byte ordering)

I don't know how extensively this is used, but "implementation-defined" is a big red-flag to me. As far as I know, this could mean that the implementation could choose to ignore your character designations and just assign normal incrementing values if it wanted. It may do something "nicer", but you can't rely on that behavior across compilers (or even compiler versions). At least "goto" has predictable (if undesirable) behavior...
That's my 2c, anyway.
Edit: on "implementation-defined":
From Bjarne Stroustrup's C++ Glossary:
implementation defined - an aspect of C++'s semantics that is defined for each implementation rather than specified in the standard for every implementation. An example is the size of an int (which must be at least 16 bits but can be longer). Avoid implementation defined behavior whenever possible. See also: undefined. TC++PL C.2.
also...
undefined - an aspect of C++'s semantics for which no reasonable behavior is required. An example is dereferencing a pointer with the value zero. Avoid undefined behavior. See also: implementation defined. TC++PL C.2.
I believe this means the comment is correct: it should at least compile, although anything beyond that is not specified. Note the advice in the definition, also.

Four character literals, I've seen and used. They map to 4 bytes = one 32 bit word. It's very useful for debugging purposes as said above. They can be used in a switch/case statement with ints, which is nice.
This (4 Chars) is pretty standard (ie supported by GCC and VC++ at least), although results (actual values compiled) may vary from one implementation to another.
But over 4 chars? I wouldn't use.
UPDATE: From the C4 page: "For our simple actions, we'll just provide an enumeration of some values, which is done in C4 by specifying four-character constants". So they are using 4 chars literals, as was my case.

Multicharacter literals allow one to specify int values via the equivalent representation in characters. Useful for enums, FourCC codes and tags, and non-type template parameters. With a multicharacter literal, a FourCC code can be typed directly into the source, which is handy.
The implementation in gcc is described at https://gcc.gnu.org/onlinedocs/cpp/Implementation-defined-behavior.html . Note that the value is truncated to the size of the type int, so 'efgh' == 'abcdefgh' if your ints are 4 chars wide, although gcc will issue a warning on the literal that overflows.
Unfortunately, gcc will issue a warning on all multi-character literals if -pedantic is passed, as their behavior is implementation-defined. As you can see above, it is perhaps possible for equality of two multi-character literals to change if you switch implementations.

In C++14 specification draft N4527 section 2.13.3, entry 2:
... An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
Previous answers to your question pertained mostly on real machines that did support multicharacter literals. Specifically, on platforms where int is 4 bytes, four-byte multicharacter is fine and can be used for convenience, as per Ferrucio's mem dump example. But, as there is no guarantee that this will ever work or work the same way on other platforms, use of multicharacter literals should be deprecated for portable programs.

unbelievable, every compiler I know places the first character of a UINT defined as 4-character constant in the low significant byte (little indian) --- but Visual C does it in opposite direction 🙄
// file signature
#define SFKFILE_SIGNATURE 'SFPK' (S=53)
// check header
if (out_FileHdr->Signature != SFKFILE_SIGNATURE)
fails on VC:
Borland: 4B504653 4B504653
Watcom: 4B504653 4B504653
VisualC: 4B504653 5346504B

What does the 'L' in front a string mean in C++?

this->textBox1->Name = L"textBox1";
Although it seems to work without the L, what is the purpose of the prefix? The way it is used doesn't even make sense to a hardcore C programmer.

It's a wchar_t literal, for extended character set. Wikipedia has a little discussion on this topic, and c++ examples.

'L' means wchar_t, which, as opposed to a normal character, requires 16-bits of storage rather than 8-bits. Here's an example:
"A" = 41
"ABC" = 41 42 43
L"A" = 00 41
L"ABC" = 00 41 00 42 00 43
A wchar_t is twice big as a simple char. In daily use you don't need to use wchar_t, but if you are using windows.h you are going to need it.

It means the text is stored as wchar_t characters rather than plain old char characters.
(I originally said it meant unicode. I was wrong about that. But it can be used for unicode.)

It means that it is a wide character, wchar_t.
Similar to 1L being a long value.

It means it's an array of wide characters (wchar_t) instead of narrow characters (char).
It's a just a string of a different kind of character, not necessarily a Unicode string.

L is a prefix used for wide strings. Each character uses several bytes (depending on the size of wchar_t). The encoding used is independent from this prefix. I mean it must not be necessarily UTF-16 unlike stated in other answers here.

Here is an example of the usage:
By adding L before the char you can return Unicode characters as char32_t type:
char32_t utfRepresentation()
{
if (m_is_white)
{
return L'♔';
}
return L'♚';
};

Multicharacter literal in C and C++

I didn't know that C and C++ allow multicharacter literal: not 'c' (of type int in C and char in C++), but 'tralivali' (of type int!)
enum
{
ActionLeft = 'left',
ActionRight = 'right',
ActionForward = 'forward',
ActionBackward = 'backward'
};
Standard says:
C99 6.4.4.4p10: "The value of an
integer character constant containing
more than one character (e.g., 'ab'),
or containing a character or escape
sequence that does not map to a
single-byte execution character, is
implementation-defined."
I found they are widely used in C4 engine. But I suppose they are not safe when we are talking about platform-independend serialization. Thay can be confusing also because look like strings. So what is multicharacter literal's scope of usage, are they useful for something? Are they in C++ just for compatibility with C code? Are they considered to be a bad feature as goto operator or not?

It makes it easier to pick out values in a memory dump.
Example:
enum state { waiting, running, stopped };
vs.
enum state { waiting = 'wait', running = 'run.', stopped = 'stop' };
a memory dump after the following statement:
s = stopped;
might look like:
00 00 00 02 . . . .
in the first case, vs:
73 74 6F 70 s t o p
using multicharacter literals. (of course whether it says 'stop' or 'pots' depends on byte ordering)

I don't know how extensively this is used, but "implementation-defined" is a big red-flag to me. As far as I know, this could mean that the implementation could choose to ignore your character designations and just assign normal incrementing values if it wanted. It may do something "nicer", but you can't rely on that behavior across compilers (or even compiler versions). At least "goto" has predictable (if undesirable) behavior...
That's my 2c, anyway.
Edit: on "implementation-defined":
From Bjarne Stroustrup's C++ Glossary:
implementation defined - an aspect of C++'s semantics that is defined for each implementation rather than specified in the standard for every implementation. An example is the size of an int (which must be at least 16 bits but can be longer). Avoid implementation defined behavior whenever possible. See also: undefined. TC++PL C.2.
also...
undefined - an aspect of C++'s semantics for which no reasonable behavior is required. An example is dereferencing a pointer with the value zero. Avoid undefined behavior. See also: implementation defined. TC++PL C.2.
I believe this means the comment is correct: it should at least compile, although anything beyond that is not specified. Note the advice in the definition, also.

Four character literals, I've seen and used. They map to 4 bytes = one 32 bit word. It's very useful for debugging purposes as said above. They can be used in a switch/case statement with ints, which is nice.
This (4 Chars) is pretty standard (ie supported by GCC and VC++ at least), although results (actual values compiled) may vary from one implementation to another.
But over 4 chars? I wouldn't use.
UPDATE: From the C4 page: "For our simple actions, we'll just provide an enumeration of some values, which is done in C4 by specifying four-character constants". So they are using 4 chars literals, as was my case.

Multicharacter literals allow one to specify int values via the equivalent representation in characters. Useful for enums, FourCC codes and tags, and non-type template parameters. With a multicharacter literal, a FourCC code can be typed directly into the source, which is handy.
The implementation in gcc is described at https://gcc.gnu.org/onlinedocs/cpp/Implementation-defined-behavior.html . Note that the value is truncated to the size of the type int, so 'efgh' == 'abcdefgh' if your ints are 4 chars wide, although gcc will issue a warning on the literal that overflows.
Unfortunately, gcc will issue a warning on all multi-character literals if -pedantic is passed, as their behavior is implementation-defined. As you can see above, it is perhaps possible for equality of two multi-character literals to change if you switch implementations.

In C++14 specification draft N4527 section 2.13.3, entry 2:
... An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
Previous answers to your question pertained mostly on real machines that did support multicharacter literals. Specifically, on platforms where int is 4 bytes, four-byte multicharacter is fine and can be used for convenience, as per Ferrucio's mem dump example. But, as there is no guarantee that this will ever work or work the same way on other platforms, use of multicharacter literals should be deprecated for portable programs.

unbelievable, every compiler I know places the first character of a UINT defined as 4-character constant in the low significant byte (little indian) --- but Visual C does it in opposite direction 🙄
// file signature
#define SFKFILE_SIGNATURE 'SFPK' (S=53)
// check header
if (out_FileHdr->Signature != SFKFILE_SIGNATURE)
fails on VC:
Borland: 4B504653 4B504653
Watcom: 4B504653 4B504653
VisualC: 4B504653 5346504B

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

What is the encoding of unprefixed string literals in C++? [duplicate] - c++

Related

single vs double quotes C++ - interesting, unexpected behaviour [duplicate]

C++ utf-8 literals in GCC and MSVC

C++ and the Integer Value of Characters [duplicate]

What does the 'L' in front a string mean in C++?

Multicharacter literal in C and C++

Categories

Resources