Is there any difference between the followings?
auto s1 = L"你好";
auto s2 = u8"你好";
Are s1 and s2 referring to the same type?
If no, what's the difference and which one is preferred?
They are not the same type.
s2 is a UTF-8 or narrow string literal. The C++11 draft standard section 2.14.5 String literals paragraph 7 says:
A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.
And paragraph 8 says:
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration (3.7).
s1 is a wide string literal which can support UTF-16 and UTF-32. Section 2.14.5 String literals paragraph 11 says:
A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.
See UTF8, UTF16, and UTF32 for a good discussion on the differences and advantages of each.
A quick way to determine types is to use typeid:
std::cout << typeid(s1).name() << std::endl ;
std::cout << typeid(s2).name() << std::endl ;
On my system this is the output:
PKw
PKc
Checking each of these with c++filt -t gives me:
wchar_t const*
char const*
L"" creates a null-terminated string, of type const wchar_t[]. This is valid in C++03. (Note that wchar_t refers to an implementation-dependent "wide-character" type).
u8"" creates a null-terminated UTF-8 string, of type const char[]. This is valid only in C++11.
Which one you choose is strongly dependent on what needs you have. L"" works in C++03, so if you need to work with older code (which may need to be compiled with a C++03 compiler), you'll need to use that. u8"" is easier to work with in many circumstances, particularly when the system in question normally expects char * strings.
The first is a wide character string, which might be encoded as UTF-16 or UTF-32, or something else entirely (though Unicode is now common enough that a completely different encoding is pretty unlikely).
The second is a string of narrow characters using UTF-8 encoding.
As to which is preferred: it'll depend on what you're doing, what platform you're coding for, etc. If you're mostly dealing with something like a web page/URL that's already encoded as UTF-8, and you'll probably just read it in, possibly verify its content, and later echo it back, it may well make sense to store it as UTF-8 as well.
Wide character strings vary by platform. If, for one example, you're coding for Windows, and a lot of the code interacts directly with the OS (which uses UTF-16) then storing your strings as UTF-16 can make a great deal of sense (and that's what Microsoft's compiler uses for wide character strings).
Related
The Standard says N3797::3.9.1 [basic.fundamental]:
Type wchar_t is a distinct type whose values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales (22.3.1).
I can't imagine how we can use that type. Could you give an example where plain char isn't working? I thought it may be helpful if we use two different language simultaneously. But plain char is Ok in case for cyrillic and latinica
#include <iostream>
char cp[] = "LATINICA_КИРИЛЛИЦА";
int main()
{
std::cout << cp; //LATINICA_КИРИЛЛИЦА
}
DEMO
In your example, you use Unicode. Indeed you could type not only in Latin or Cyrillic, but in Thai, Arabic, Chinese in other words any Unicode symbol. Your example with some more symbols link
The case is in encoding. In your example you are using char to store Unicode symbols encoded in UTF-8. See this for more details. The main advantage of UTF-8 in backward compatibility with ASCII. The main disadvantage of using UTF-8 is variable symbol length.
There are other types of encoding for Unicode symbols. The most common (except UTF-8) are UTF-16 and UTF-32. You should be aware that the UTF-16 encoding is still variable length, however the code unit is now 16bit. UTF-32 encoding is constant length.
The type wchar_t is usually used to store symbols in UTF-16 or UTF-32 encoding depending on the system.
It depends what encoding you decide to use. Any single UTF-8 value can be held in an 8-bit char (though one Unicode code-point can take several char values to represent). It's impossible to tell from your question, but I'd guess that your editor and compiler are treating your strings as UTF-8 and that's fine if that's what you want.
Other common encodings include UTF-16, UTF-32, UCS-2 and UCS-4, which have 2-byte, 4-byte, 2-byte and 4-byte values respectively. You can't store these values in an 8-bit char.
The decision of what encoding to use for any given purpose is not straightforward. The main considerations are:
What other systems does your code have to interface to and what encoding do they use?
What libraries do you want to use and what encodings do they use? (eg xerces-c uses UTF-16 throughout)
The tradeoff between complexity and storage size. UTF-32 and UCS-4 have the useful property that every possible displayed character is represented by one value, so you can tell the length of the string from how much memory it takes up without having to look at the values in it (though this assumes that you consider combining diacretic marks as separate characters). However, if all you're representing is ASCII, they take up four times as much memory as UTF-8.
I'd suggest Joel Spolsky's essay on Unicode as a good read.
wchar_t has its own problems, though. The standard didn't specify how big a wchar_t is, so, of course, different compilers have picked different sizes; VC++ used two bytes and gcc (and most others) use four bytes. Wide-character literals, such as L"Hello, world," are similarly confused, being UTF-16 strings in VC++ and UCS-4 in gcc.
To try to clean this up, C++11 introduced two new character types:
char16_t is a character guaranteed to be 16-bits, and with a literal form u"Hello, world."
char32_t is a character guaranteed to be 32-bits, and with a literal form U"Hello, world."
However, these have problems of their own; in particular, <iostream> doesn't provide console streams that can handle them (ie there is no u16cout or u32cerr).
To be more specific I'll provide a normative reference relates to the question: [N3797:8.5.2/1 [dcl.init.string] says:
An array of narrow character type (3.9.1), char16_t array, char32_t
array, or wchar_t array can be initialized by a narrow string literal,
char16_t string literal, char32_t string literal, or wide string
literal, respectively, or by an appropriately-typed string literal
enclosed in braces (2.14.5). Successive characters of the value of the
string literal initialize the elements of the array.
8.5.2/2:
There shall not be more initializers than there are array elements.
In the case of
#include <iostream>
char cp[] = "LATINICA_КИРИЛЛИЦА";
int main()
{
std::cout << sizeof(cp) << std::endl; //28
}
DEMO
for some language, like English, it's not necessary to use wchar_t.but some language, like Chinese, you'd better use wchar_t.
although char is able to store string, likechar p[] = "你好"
but it may show messy code when you run you program in different computer, especially the computer using different language.
if you use wchar_t, you can avoid this.
A string literal that does not begin with an encoding-prefix is an ordinary string
literal, and is initialized with the given characters.
A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.
I don't understand the difference between an ordinary string literal and a UTF-8 string literal.
Can someone provide an example of a situation where they are different? (Cause different compiler output)
(I mean from the POV of the standard, not any particular implementation)
Each source character set member in a character literal or a string literal, as well as each escape
sequence and universal-character-name in a character literal or a non-raw string literal, is converted to
the corresponding member of the execution character set.
The C and C++ languages allow a huge amount of latitude in their implementations. C was written long before UTF-8 was "the way to encode text in single bytes": different systems had different text encodings.
So what the byte values are for a string in C and C++ are really up to the compiler. 'A' is whatever the compiler's chosen encoding is for the character A, which may not agree with UTF-8.
C++ has added the requirement that real UTF-8 string literals must be supported by compilers. The bit value of u8"A"[0] is fixed by the C++ standard through the UTF-8 standard, regardless of the preferred encoding of the platform the compiler is targeting.
Now, much as most platforms C++ targets use 2's complement integers, most compilers have character encodings that are mostly compatible with UTF-8. So for strings like "hello world", u8"hello world" will almost certainly be identical.
For a concrete example, from man gcc
-fexec-charset=charset
Set the execution character set, used for string and character constants. The default is UTF-8. charset can be any encoding supported by the system's iconv library routine.
-finput-charset=charset
Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's iconv library routine.
is an example of being able to change the execution and input character sets of C/C++.
From Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
I'm wondering what exactly this means for writing portable applications. Is there any difference between writing this
const char[] str = "Test String";
or this?
const char[] str = u8"Test String";
Is there be any reason not to use the latter for every string literal in your code?
What happens when there are non-ASCII-Characters inside the TestString?
The encoding of "Test String" is the implementation-defined system encoding (the narrow, possibly multibyte one).
The encoding of u8"Test String" is always UTF-8.
The examples aren't terribly telling. If you included some Unicode literals (such as \U0010FFFF) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed in the system-encoded string, and if yes what their value would be, is implementation-defined.
If it helps, imagine you're authoring the source code on an EBCDIC machine. Then the literal "Test String" is always EBCDIC-encoded in the source file itself, but the u8-initialized array contains UTF-8 encoded values, whereas the first array contains EBCDIC-encoded values.
You quote Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
Well, the “For the purpose of” is not true. char has always been guaranteed to be at least 8 bits, that is, CHAR_BIT has always been required to be ≥8, due to the range required for char in the C standard. Which is (quote C++11 §17.5.1.5/1) “incorporated” into the C++ standard.
If I should guess about the purpose of that change of wording, it would be to just clarify things for those readers unaware of the dependency on the C standard.
Regarding the effect of the u8 literal prefix, it
affects the encoding of the string in the executable, but
unfortunately it does not affect the type.
Thus, in both cases "tørrfisk" and u8"tørrfisk" you get a char const[n]. But in the former literal the encoding is whatever is selected for the compiler, e.g. with Latin 1 (or Windows ANSI Western) that would be 8 bytes for the characters plus a nullbyte, for array size 9. While in the latter literal the encoding is guaranteed to be UTF-8, where the “ø” will be encoded with 2 or 3 bytes (I don’t recall exactly), for a slightly larger array size.
If the execution character set of the compiler is set to UTF-8, it makes no difference if u8 is used or not, since the compiler converts the characters to UTF-8 in both cases.
However if the compilers execution character set is the system's non UTF8 codepage (default for e.g. Visual C++), then non ASCII characters might not properly handled when u8 is omitted. For example, the conversion to wide strings will crash e.g. in VS15:
std::string narrowJapanese("スタークラフト");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convertWindows;
std::wstring wide = convertWindows.from_bytes(narrowJapanese); // Unhandled C++ exception in xlocbuf.
The compiler chooses a native encoding natural to the platform. On typical POSIX systems it will probably choose ASCII and something possibly depending on environment's setting for character values outside the ASCII range. On mainframes it will probably choose EBCDIC. Comparing strings received, e.g., from files or the command line will probably work best with the native character set. When processing files explicitly encoded using UTF-8 you are, however, probably best off using u8"..." strings.
That said, with the recent changes relating to character encodings a fundamental assumption of string processing in C and C++ got broken: each internal character object (char, wchar_t, etc.) used to represent one character. This is clearly not true anymore for a UTF-8 string where each character object just represents a byte of some character. As a result all the string manipulation, character classification, etc. functions won't necessarily work on these strings. We don't have any good library lined up to deal with such strings for inclusion into the standard.
C++11 introduces a new set of string literal prefixes (and even allows user-defined suffixes). On top of this, you can directly use Unicode escape sequences to code a certain symbol without having to worry about encoding.
const char16_t* s16 = u"\u00DA";
const char32_t* s32 = U"\u00DA";
But can I use the unicode escape sequences in wchar_t string literals as well? It would seem to be a defect if this wasn't possible.
const wchar_t* sw = L"\u00DA";
The integer value of sw[0] would of course depend on what wchar_t is on a particular platform, but to all other effects, this should be portable, no?
It would work, but it may not have the desired semantics. \u00DA will expand into as many target characters as necessary for UTF8/16/32 encoding, depending on the size of wchar_t, but bear in mind that wide strings do not have any documented, guaranteed encoding semantics -- they're simply "the system's encoding", with no attempt made to say what that is, or require the user to know what that is.
So it's best not to mix and match. Use either one, but not both, of the two:
system-specific: char*/"", wchar_t*/L"", \x-literals, mbstowcs/wcstombs
Unicode: char*/u8"", char16_t*/u"", char32_t*/U"", \u/\U literals.
(Here are some related questions of mine on the subject.)
What is the type of string literal in C? Is it char * or const char * or const char * const?
What about C++?
In C the type of a string literal is a char[] - it's not const according to the type, but it is undefined behavior to modify the contents. Also, 2 different string literals that have the same content (or enough of the same content) might or might not share the same array elements.
From the C99 standard 6.4.5/5 "String Literals - Semantics":
In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type char, and are initialized with the individual bytes of the multibyte character sequence; for wide string literals, the array elements have type wchar_t, and are initialized with the sequence of wide characters...
It is unspecified whether these arrays are distinct provided their elements have the appropriate values. If the program attempts to modify such an array, the behavior is undefined.
In C++, "An ordinary string literal has type 'array of n const char'" (from 2.13.4/1 "String literals"). But there's a special case in the C++ standard that makes pointer to string literals convert easily to non-const-qualified pointers (4.2/2 "Array-to-pointer conversion"):
A string literal (2.13.4) that is not a wide string literal can be converted to an rvalue of type “pointer to char”; a wide string literal can be converted to an rvalue of type “pointer to wchar_t”.
As a side note - because arrays in C/C++ convert so readily to pointers, a string literal can often be used in a pointer context, much as any array in C/C++.
Additional editorializing: what follows is really mostly speculation on my part about the rationale for the choices the C and C++ standards made regarding string literal types. So take it with a grain of salt (but please comment if you have corrections or additional details):
I think that the C standard chose to make string literal non-const types because there was (and is) so much code that expects to be able to use non-const-qualified char pointers that point to literals. When the const qualifier got added (which if I'm not mistaken was done around ANSI standardization time, but long after K&R C had been around to accumulate a ton of existing code) if they made pointers to string literals only able to be be assigned to char const* types without a cast nearly every program in existence would have required changing. Not a good way to get a standard accepted...
I believe the change to C++ that string literals are const qualified was done mainly to support allowing a literal string to more appropriately match an overload that takes a "char const*" argument. I think that there was also a desire to close a perceived hole in the type system, but the hole was largely opened back up by the special case in array-to-pointer conversions.
Annex D of the standard indicates that the "implicit conversion from const to non-const qualification for string literals (4.2) is deprecated", but I think so much code would still break that it'll be a long time before compiler implementers or the standards committee are willing to actually pull the plug (unless some other clever technique can be devised - but then the hole would be back, wouldn't it?).
A C string literal has type char [n] where n equals number of characters + 1 to account for the implicit zero at the end of the string.
The array will be statically allocated; it is not const, but modifying it is undefined behaviour.
If it had pointer type char * or incomplete type char [], sizeof could not work as expected.
Making string literals const is a C++ idiom and not part of any C standard.
They used to be of type char[]. Now they are of type const char[].
For various historical reasons, string literals were always of type char[] in C.
Early on (in C90), it was stated that modifying a string literal invokes undefined behavior.
They didn't ban such modifications though, nor did they make string literals const char[] which would have made more sense. This was for backwards-compatibility reasons with old code. Some old OS (most notably DOS) didn't protest if you modified string literals, so there was plenty of such code around.
C still has this defect today, even in the most recent C standard.
C++ inherited the same very same defect from C, but in later C++ standards, they have finally made string literals const (flagged obsolete in C++03, finally fixed in C++11).