What is the difference between "UTF-16" and "std::wstring"?

What is the difference between "UTF-16" and "std::wstring"? - c++

Is there any difference between these two string storage formats?

std::wstring is a container of wchar_t. The size of wchar_t is not specified—Windows compilers tend to use a 16-bit type, Unix compilers a 32-bit type.
UTF-16 is a way of encoding sequences of Unicode code points in sequences of 16-bit integers.
Using Visual Studio, if you use wide character literals (e.g. L"Hello World") that contain no characters outside of the BMP, you'll end up with UTF-16, but mostly the two concepts are unrelated. If you use characters outside the BMP, std::wstring will not translate surrogate pairs into Unicode code points for you, even if wchar_t is 16 bits.

UTF-16 is a specific Unicode encoding. std::wstring is a string implementation that uses wchar_t as its underlying type for storing each character. (In contrast, regular std::string uses char).
The encoding used with wchar_t does not necessarily have to be UTF-16—it could also be UTF-32 for example.

UTF-16 is a concept of text represented in 16-bit elements but an actual textual character may consist of more than one element
std::wstring is just a collection of these elements, and is a class primarily concerned with their storage.
The elements in a wstring, wchar_t is at least 16-bits but could be 32 bits.

Related

Issue regarding char datatype in c++

I understand the fact that 'char' datatype is used to store single character and uses 1 byte but what are char16_t, char32_t, and wchar_t used for? We, after all, have to store just a single character only

Regarding char16_t and char32_t, quoting from a Microsoft article:
The char16_t and char32_t types represent 16-bit and 32-bit wide characters, respectively. Unicode encoded as UTF-16 can be stored in the char16_t type, and Unicode encoded as UTF-32 can be stored in the char32_t type. Strings of these types and wchar_t are all referred to as wide strings, though the term often refers specifically to strings of wchar_t type.
And for wchar_t:
The wchar_t type is an implementation-defined wide character type. In the Microsoft compiler, it represents a 16-bit wide character used to store Unicode encoded as UTF-16LE, the native character type on Windows operating systems. The wide character versions of the Universal C Runtime (UCRT) library functions use wchar_t and its pointer and array types as parameters and return values, as do the wide character versions of the native Windows API.
So they can't be said as simply a character. The type differs with the encoding, as mentioned above.
For example, the character u (U+0075) in char16_t encoding is stored as feff0075.

What is the efficient, standards-compliant mechanism for processing Unicode using C++17?

Short version:
If I wanted to write program that can efficiently perform operations with Unicode characters, being able to input and output files in UTF-8 or UTF-16 encodings. What is the appropriate way to do this with C++?
Long version:
C++ predates Unicode, and both have evolved significantly since. I need to know how to write standards-compliant C++ code that is leak-free. I need a clear answers to:
Which string container should I pick?
std::string with UTF-8?
std::wstring (don't really know much about it)
std::u16string with UTF-16?
std::u32string with UTF-32?
Should I stick entirely to one of the above containers or change them when needed?
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?
What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
What are differences between wchar_t and other multi-byte character types? Is wchar_t character or wchar_t string literal capable of storing UTF encodings?

Which string container should I pick?
That is really up to you to decide, based on your own particular needs. Any of the choices you have presented will work, and they each have their own advantages and disadvantages. Generically, UTF-8 is good to use for storage and communication purposes, and is backwards compatible with ASCII. Whereas UTF-16/32 is easier to use when processing Unicode data.
std::wstring (don't really know much about it)
The size of wchar_t is compiler-dependent and even platform-dependent. For instance, on Windows, wchar_t is 2 bytes, making std::wstring usable for UTF-16 encoded strings. On other platforms, wchar_t may be 4 bytes instead, making std::wstring usable for UTF-32 encoded strings instead. That is why wchar_t/std::wstring is generally not used in portable code, and why char16_t/std::u16string and char32_t/std::u32string were introduced in C++11. Even char can have portability issues for UTF-8, since char can be either signed or unsigned at the descretion of the compiler vendors, which is why char8_t/std::u8string was introduced in C++20 for UTF-8.
Should I stick entirely to one of the above containers or change them when needed?
Use whatever containers suit your needs.
Typically, you should use one string type throughout your code. Perform data conversions only at the boundaries where string data enters/leaves your program. For instance, when reading/writing files, network communications, platform system calls, etc.
How to properly convert between them?
There are many ways to handle that.
C++11 and later have std::wstring_convert/std::wbuffer_convert. But these were deprecated in C++17.
There are 3rd party Unicode conversion libraries, such as ICONV, ICU, etc.
There are C library functions, platform system calls, etc.
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?
Yes, if you use appropriate string literal prefixes:
u8 for UTF-8.
L for UTF-16 or UTF-32 (depending on compiler/platform).
u16 for UTF-16.
u32 for UTF-32.
Also, be aware that the charset you use to save your source files can affect how the compiler interprets string literals. So make sure that whatever charset you choose to save your files in, like UTF-8, that you tell your compiler what that charset is, or else you may end up with the wrong string values at runtime.
What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
Each string character may be a single-byte, or be part of a multi-byte representation of a Unicode codepoint. It depends on the encoding of the string, and the character being encoded.
Just as std::wstring (when wchar_t is 2 bytes) and std::u16string can hold strings containing supplementary characters outside of the Unicode BMP, which require UTF-16 surrogates to encode.
When a string container contains a UTF encoded string, each "character" is just a UTF encoded codeunit. UTF-8 encodes a Unicode codepoint as 1-4 codeunits (1-4 chars in a std::string). UTF-16 encodes a codepoint as 1-2 codeunits (1-2 wchar_ts/char16_ts in a std::wstring/std::u16string). UTF-32 encodes a codepoint as 1 codeunit (1 char32_t in a std::u32string).
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
Exactly what you would expect. A std::string holds char elements. Regardless of encoding, operator+=(char) will simply append a single char to the end of the std::string.
How can I distinguish UTF char[] and non-UTF char[] or std::string?
You would need to have outside knowledge of the string's original encoding, or else perform your own heuristic analysis of the char[]/std::string data to see if it conforms to a UTF or not.
What are differences between wchar_t and other multi-byte character types?
Byte size and UTF encoding.
char = ANSI/MBCS or UTF-8
wchar_t = DBCS, UTF-16 or UTF-32, depending on compiler/platform
char8_t = UTF-8
char16_t = UTF-16
char32_t = UTF-32
Is wchar_t character or wchar_t string literal capable of storing UTF encodings?
Yes, UTF-16 or UTF-32, depending on compiler/platform. In case of UTF-16, a single wchar_t can only hold a codepoint value that is in the BMP. A single wchar_t in UTF-32 can hold any codepoint value. A wchar_t string can encode all codepoints in either encoding.
How to properly manipulate UTF strings (such as toupper/tolower conversion) and be compatible with locales simultaneously?
That is a very broad topic, worthy of its own separate question by itself.

using unicode in a C++ program

I want that strings with Unicode characters be correctly handled in my file synchronizer application but I don't know how this kind of encoding works ?
In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)
In internet I see examples using "wide strings or wchar_t ??
So, what's the suitable object to handle unicode characters ? In rapidJson (which supports Unicode, UTF-8, UTF-16, UTF-32) , we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented... I don't understand...
This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string) :
boost::replace_all(listFolder, "\\u00e0", "à");
boost::replace_all(listFolder, "\\u00e2", "â");
boost::replace_all(listFolder, "\\u00e4", "ä");
...

The suitable object to handle Unicode strings in C++ is icu::UnicodeString (check "API References, ICU4C" in the sidebar), at least if you want to really handle Unicode strings (as opposed to just passing them from one point of your application to another).
wchar_t was an early attempt at handling international character sets, which turned out to be a failure because Microsoft's definition of wchar_t as two bytes turned out to be insufficient once Unicode was extended beyond code point 0x10000. Linux defines wchar_t as four bytes, but the inconsistency makes it (and its derived std::wstring) rather useless for portable programming.
TCHAR is a Microsoft define that resolves to char by default and to WCHAR if UNICODE is defined, with WCHAR in turn being wchar_t behind a level of indirection... yeah.
C++11 brought us char16_t and char32_t as well as the corresponding string classes, but those are still instances of basic_string<>, and as such have their shortcomings e.g. when trying to uppercase / lowercase characters that have more than one replacement character (e.g. the German ß would require to be extended to SS in uppercase; the standard library cannot do that).
ICU, on the other hand, goes the full way. For example, it provides normalization and decomposition, which the standard strings do not.
\uxxxx and \UXXXXXXXX are unicode character escapes. The xxxx are a 16-bit hexadecimal number representing a UCS-2 code point, which is equivalent to a UTF-16 code point within the Basic Multilingual Plane.
The XXXXXXXX are a 32-bit hex number, representing a UTF-32 code point, which may be any plane.
How those character escapes are handled depends on the context in which they appear (narrow / wide string, for example), making them somewhat less than perfect.
C++11 introduced "proper" Unicode literals:
u8"..." is always a const char[] in UTF-8 encoding.
u"..." is always a const uchar16_t[] in UTF-16 encoding.
U"..." is always a const uchar32_t[] in UTF-32 encoding.
If you use \uxxxx or \UXXXXXXXX within one of those three, the character literal will always be expanded to the proper code unit sequence.
Note that storing UTF-8 in a std::string is possible, but hazardous. You need to be aware of many things: .length() is not the number of characters in your string. .substr() can lead to partial and invalid sequences. .find_first_of() will not work as expected. And so on.
That being said, in my opinion UTF-8 is the only sane encoding choice for any stored text. There are cases to be made for handling texts as UTF-16 in-memory (the way ICU does), but on file, don't accept anything but UTF-8. It's space-efficient, endianess-independent, and allows for semi-sane handling even by software that is blissfully unaware of Unicode matters (see caveats above).

In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)
That is a unicode character escape sequence. It will be interpreted as a unicode character. The u after the escape character is part of the syntax and it's what differentiates it from other escape sequences. Read the documentation for more information.
So, what's the suitable object to handle unicode characters ?
char for uft-8
char16_t for utf-16
char32_t for utf-32
The size of wchar_t is platform dependent, so you cannot make portable assumptions of which encoding it suits.
we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented...
If you mean that you can store multi-byte utf-8 characters in a char string, then you're correct.
This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string)
What you're attempting to do there is replacing some unicode characters with characters that have a plaftorm defined encoding. If you have other unicode characters besides those, then you end up with a string that has mixed encoding. Also, in some cases it may accidentally replace parts of other byte sequences. I recommend using library to convert encoding or do any other manipulation on encoded strings.

Understanding wchar_t type in c++

The Standard says N3797::3.9.1 [basic.fundamental]:
Type wchar_t is a distinct type whose values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales (22.3.1).
I can't imagine how we can use that type. Could you give an example where plain char isn't working? I thought it may be helpful if we use two different language simultaneously. But plain char is Ok in case for cyrillic and latinica
#include <iostream>
char cp[] = "LATINICA_КИРИЛЛИЦА";
int main()
{
std::cout << cp; //LATINICA_КИРИЛЛИЦА
}
DEMO

In your example, you use Unicode. Indeed you could type not only in Latin or Cyrillic, but in Thai, Arabic, Chinese in other words any Unicode symbol. Your example with some more symbols link
The case is in encoding. In your example you are using char to store Unicode symbols encoded in UTF-8. See this for more details. The main advantage of UTF-8 in backward compatibility with ASCII. The main disadvantage of using UTF-8 is variable symbol length.
There are other types of encoding for Unicode symbols. The most common (except UTF-8) are UTF-16 and UTF-32. You should be aware that the UTF-16 encoding is still variable length, however the code unit is now 16bit. UTF-32 encoding is constant length.
The type wchar_t is usually used to store symbols in UTF-16 or UTF-32 encoding depending on the system.

It depends what encoding you decide to use. Any single UTF-8 value can be held in an 8-bit char (though one Unicode code-point can take several char values to represent). It's impossible to tell from your question, but I'd guess that your editor and compiler are treating your strings as UTF-8 and that's fine if that's what you want.
Other common encodings include UTF-16, UTF-32, UCS-2 and UCS-4, which have 2-byte, 4-byte, 2-byte and 4-byte values respectively. You can't store these values in an 8-bit char.
The decision of what encoding to use for any given purpose is not straightforward. The main considerations are:
What other systems does your code have to interface to and what encoding do they use?
What libraries do you want to use and what encodings do they use? (eg xerces-c uses UTF-16 throughout)
The tradeoff between complexity and storage size. UTF-32 and UCS-4 have the useful property that every possible displayed character is represented by one value, so you can tell the length of the string from how much memory it takes up without having to look at the values in it (though this assumes that you consider combining diacretic marks as separate characters). However, if all you're representing is ASCII, they take up four times as much memory as UTF-8.
I'd suggest Joel Spolsky's essay on Unicode as a good read.
wchar_t has its own problems, though. The standard didn't specify how big a wchar_t is, so, of course, different compilers have picked different sizes; VC++ used two bytes and gcc (and most others) use four bytes. Wide-character literals, such as L"Hello, world," are similarly confused, being UTF-16 strings in VC++ and UCS-4 in gcc.
To try to clean this up, C++11 introduced two new character types:
char16_t is a character guaranteed to be 16-bits, and with a literal form u"Hello, world."
char32_t is a character guaranteed to be 32-bits, and with a literal form U"Hello, world."
However, these have problems of their own; in particular, <iostream> doesn't provide console streams that can handle them (ie there is no u16cout or u32cerr).

To be more specific I'll provide a normative reference relates to the question: [N3797:8.5.2/1 [dcl.init.string] says:
An array of narrow character type (3.9.1), char16_t array, char32_t
array, or wchar_t array can be initialized by a narrow string literal,
char16_t string literal, char32_t string literal, or wide string
literal, respectively, or by an appropriately-typed string literal
enclosed in braces (2.14.5). Successive characters of the value of the
string literal initialize the elements of the array.
8.5.2/2:
There shall not be more initializers than there are array elements.
In the case of
#include <iostream>
char cp[] = "LATINICA_КИРИЛЛИЦА";
int main()
{
std::cout << sizeof(cp) << std::endl; //28
}
DEMO

for some language, like English, it's not necessary to use wchar_t.but some language, like Chinese, you'd better use wchar_t.
although char is able to store string, likechar p[] = "你好"
but it may show messy code when you run you program in different computer, especially the computer using different language.
if you use wchar_t, you can avoid this.

Can BSTR's hold characters that take more than 16 bits to represent?

I am confused about Windows BSTR's and WCHAR's, etc. WCHAR is a 16-bit character intended to allow for Unicode characters. What about characters that take more then 16-bits to represent? Some UTF-8 chars require more then that. Is this a limitation of Windows?
Edit: Thanks for all the answers. I think I understand the Unicode aspect. I am still confused on the Windows/WCHAR aspect though. If WCHAR is a 16-bit char, does Windows really use 2 of them to represent code-points bigger than 16-bits or is the data truncated?

UTF-8 is not the encoding used in Windows' BSTR or WCHAR types. Instead, they use UTF-16, which defines each code point in the Unicode set using either 1 or 2 WCHARs. 2 WCHARs gives exactly the same amount of code points as 4 bytes of UTF-8.
So there is no limitation in Windows character set handling.

UTF8 is an encoding of a Unicode character (codepoint). You may want to read this excellent faq on the subject. To answer your question though, BSTRs are always encoded as UTF-16. If you have UTF-32 encoded strings, you will have to transcode them first.

As others have mentioned, the FAQ has a lot of great information on unicode.
The short answer to your question, however, is that a single unicode character may require more than one 16bit character to represent it. This is also how UTF-8 works; any unicode character that falls outside the range that a single byte is able to represent uses two (or more) bytes.

BSTR simply contains 16 bit code units that can contain any UTF-16 encoded data. As for the OS, Windows has supported surrogate pairs since XP. See the Dr International FAQ

The Unicode standard defines somewhere over a million unique code-points (each code-point represents an 'abstract' character or symbol - e.g. 'E', '=' or '~').
The standard also defines several methods of encoding those million code-points into commonly used fundamental data types, such as 8-bit characters, or 16-byte wchars.
The two most widely used encodings are utf-8 and utf-16.
utf-8 defines how to encode unicode code points into 8-bit chars. Each unicode code-point will map to between 1 and 4 8-bit chars.
utf-16 defines how to encode unicode code points into 16-bit words (WCHAR in Windows). Most code-points will map onto a single 16-bit WCHAR, but there are some that require two WCHARs to represent.
I recommend taking a look at the Unicode standard, and especially the FAQ (http://unicode.org/faq/utf_bom.html)

Windows has used UTF-16 as its native representation since Windows 2000; prior to that it used UCS-2. UTF-16 supports any Unicode character; UCS-2 only supports the BMP. i.e. it will do the right thing.
In general, though, it doesn't matter much, anyway. For most applications strings are pretty opaque, and just passed to some I/O mechanism (for storage in a file or database, or display on-screen, etc.) that will do the Right Thing. You just need to ensure you don't damage the strings at all.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js