conflicts: definition of wchar_t string in C++ standard and Windows implementation? - c++

From c++2003 2.13
A wide string literal has type “array of n const wchar_t” and has static storage duration, where n is the size of the string as defined below
The size of a wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating L’\0’.
From c++0x 2.14.5
A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below
The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.
The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u’\0’.
The statement in C++2003 is quite vague. But in C++0x, when counting the length of the string, the wide string literal wchar_t shall be treated as same as char32_t, and different from char16_t.
There's a post that states clearly how windows implements wchar_t in https://stackoverflow.com/questions/402283?tab=votes%23tab-top
In short, wchar_t in windows is 16bits and encoded using UTF-16. The statement in standard apparently leaves something conflicting in Windows.
for example,
wchar_t kk[] = L"\U000E0005";
This exceeds 16bits and for UTF-16 it needs two 16 bits to encode it (a surrogate pair).
However, from standard, kk is an array of 2 wchar_t (1 for the universal-name \U000E005, 1 for \0).
But in the internal storage, Windows need 3 16-bit wchar_t objects to store it, 2 wchar_t for the surrogate pair, and 1 wchar_t for the \0. Therefore, from array's definition, kk is an array of 3 wchar_t.
It's apparently conflicting to each other.
I think one simplest solution for Windows is to "ban" anything that requires surrogate pair in wchar_t ("ban" any unicode outside BMP).
Is there anything wrong with my understanding?
Thanks.

The standard requires that wchar_t be large enough to hold any character in the supported character set. Based on this, I think your premise is correct -- it is wrong for VC++ to represent the single character \U000E0005 using two wchar_t units.
Characters outside the BMP are rarely used, and Windows itself internally uses UTF-16 encoding, so it is simply convenient (even if incorrect) for VC++ to behave this way. However, rather than "banning" such characters, it is likely that the size of wchar_t will increase in the future while char16_t takes its place in the Windows API.
The answer you linked to is somewhat misleading as well:
On Linux, a wchar_t is 4-bytes, while on Windows, it's 2-bytes
The size of wchar_t depends solely on the compiler and has nothing to do with the operating system. It just happens that VC++ uses 2 bytes for wchar_t, but once again, this could very well change in the future.

Windows knows nothing about wchar_t, because wchar_t is a programming concept. Conversely, wchar_t is just storage, and it knows nothing about the semantic value of the data you store in it (that is, it knows nothing about Unicode or ASCII or whatever.)
If a compiler or SDK that targets Windows defines wchar_t to be 16 bits, then that compiler may be in conflict with the C++0x standard. (I don't know whether there are some get-out clauses that allow wchar_t to be 16 bits.) But in any case the compiler could define wchar_t to be 32 bits (to comply with the standard) and provide runtime functions to convert to/from UTF-16 for when you need to pass your wchar_t* to Windows APIs.

Related

What can I store in a char?

I know that letters/one-digit-numbers can be stored in chars,
char letter = 'c';
char number = '1';
but can emojis or forgain letters be stored in a char? If not, how can I store them?
Is this possible without strings?
A char is typically 8 bits. It may be signed or unsigned (it's up to the compiler), so may have any integer value from -128 to 127 (for signed) or 0 to 255 (for unsigned). If a character can be encoded in that range then it can be stored in a single char variable.
There's also wide characters (wchar_t) whose size depends again on compiler and operating system. They are usually at least 16 bits.
Then there are explicit Unicode characters, char8_t for UTF-8 encoded characters (will be added in the C++23 standard, so might not be widely available yet), char16_t for 16-bit characters in UTF-16 encoding, and char32_t for 32-bit characters in UTF-32 encoding.
For emojis, or just Unicode characters in general, a single char is usually not enough. Use either (multiple) chars/char8_ts in UTF8 encoding, or use (possibly multiple) char16_ts, or use char32_t. Or, if you're targeting Windows and using the Windows API, they use 16-bit wchar_t for UTF-16 encoded characters.

How does convertion between char and wchar_t work in Windows?

In Windows there are the functions like mbstowcs to convert between char and wchar_t. There are also C++ functions such as from_bytes<std::codecvt<wchar_t, char, std::mbstate_t>> to use.
But how does this work beind the scenes as char and wchar_t are obviously of different size? I assume the system codepage is involved in some way? But what happens if a wchar_t can't be correlated to a char (it can after all contain a lot more values)?
Also what happens if code that has to use char (maybe due to a library) is moved between computers with different codepages? Say that it is only using numbers (0-9) which are well within the range of ASCII, would that always be safe?
And finally, what happens on computers where the local language can't be represented in 256 characters? In that case the concept of char seems completely irrelevant other than for storing for example utf8.
It all depends on the cvt facet used, as described here
In your case, (std::codecvt<wchar_t, char, std::mbstate_t>) it all boils down to mbsrtowcs / wcsrtombs using the global locale. (that is the "C" locale, if you don't replace it with the system one)
I don't know about mbstowcs() but I assume it is similar to std::codecvt<cT, bT, std::mbstate_t>. The latter travels in terms of two types:
A character type cT which is in your code wchar_t.
A byte type bT which is normally char.
The third type in play, std::mbstate_t, is used to store any intermediate state between calls to the std::codecvt<...> facet. The facets can't have any mutable state and any state between calls needs to be obtained somehow. Sadly, the structure of std::mbstate_t is left unspecified, i.e., there is no portable way to actually use it when creating own code conversion facets.
Each instance of std::codecvt<...> implements the conversions between bytes of an external encoding, e.g., UTF8, and characters. Originally, each character was meant to be a stand-alone entity but various reasons (primarily from outside the C++ community, notably from changes made to Unicode) have result in the internal characters effectively being an encoding themselves. Typically the internal encodings used are UTF8 for char and UTF16 or UCS4 for wchar_t (depending on whether wchar_t uses 16 or 32 bits).
The decoding conversions done by std::codecvt<...> take the incoming bytes in the external encoding and turn them into characters of the internal encoding. For example, when the external encoding is UTF8 the incoming bytes are converted to 32 bit code points which are then stuck into UTF16 characters by splitting them up into to wchar_t when necessary (e.g., when wchar_t is 16 bit).
The details of this process are unspecified but it will involve some bit masking and shifting. Also, different transformations will use different approaches. If the mapping between the external and internal encoding isn't as trivial as mapping one Unicode representation to another representation there may be suitable tables providing the actual mapping.
I what is in the char array is actually a UTF-8 encoded string, then you can convert it to and from a UTF-16 encoded wchar_t array using
#include <locale>
#include <codecvt>
#include <string>
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string narrow = converter.to_bytes(wide_utf16_source_string);
std::wstring wide = converter.from_bytes(narrow_utf8_source_string);
as described in more detail at https://stackoverflow.com/a/18597384/6345

Confusing sizeof(char) by ISO/IEC in different character set encoding like UTF-16

Assuming that a program is running on a system with UTF-16 encoding character set. So according to The C++ Programming Language - 4th, page 150:
A char can hold a character of the machine’s character set.
→ I think that a char variable will have the size is 2-bytes.
But according to ISO/IEC 14882:2014:
sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1".
or The C++ Programming Language - 4th, page 149:
"[...], so by definition the size of a char is 1"
→ It is fixed with size is 1.
Question: Is there a conflict between these statements above or
is the sizeof(char) = 1 just a default (definition) value and will be implementation-defined depends on each system?
The C++ standard (and C, for that matter) effectively define byte as the size of a char type, not as an eight-bit quantity1. As per C++11 1.7/1 (my bold):
The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation defined.
Hence the expression sizeof(char) is always 1, no matter what.
If you want to see whether you baseline char variable (probably the unsigned variant would be best) can actually hold a 16-bit value, the item you want to look at is CHAR_BIT from <climits>. This holds the number of bits in a char variable.
1 Many standards, especially ones related to communications protocols, use the more exact term octet for an eight-bit value.
Yes, there are a number of serious conflicts and problems with the C++ conflation of rôles for char, but also the question conflates a few things. So a simple direct answer would be like answering “yes”, “no” or “don’t know” to the question “have you stopped beating your wife?”. The only direct answer is the buddhist “mu”, unasking the question.
So let's therefore start with a look at the facts.
Facts about the char type.
The number of bits per char is given by the implementation defined CHAR_BIT from the <limits.h> header. This number is guaranteed to be 8 or larger. With C++03 and earlier that guarantee came from the specification of that symbol in the C89 standard, which the C++ standard noted (in a non-normative section, but still) as “incorporated”. With C++11 and later the C++ standard explicitly, on its own, gives the ≥8 guarantee. On most platforms CHAR_BIT is 8, but on some probably still extant Texas Instruments digital signal processors it’s 16, and other values have been used.
Regardless of the value of CHAR_BIT the sizeof(char) is by definition 1, i.e. it's not implementation defined:
C++11 §5.3.3/1 (in [expr.sizeof]):
” sizeof(char), sizeof(signed char) and
sizeof(unsigned char) are 1.
That is, char and its variants is the fundamental unit of addressing of memory, which is the primary meaning of byte, both in common speech and formally in C++:
C++11 §1.7/1 (in [intro.memory]):
” The fundamental storage unit in the C ++ memory model is the byte.
This means that on the aforementioned TI DSPs, there is no C++ way of obtaining pointers to individual octets (8-bit parts). And that in turn means that code that needs to deal with endianness, or in other ways needs to treat char values as sequences of octets, in particular for network communications, needs to do things with char values that is not meaningful on a system where CHAR_BIT is 8. It also means that ordinary C++ narrow string literals, if they adhere to the standard, and if the platform's standard software uses an 8-bit character encoding, will waste memory.
The waste aspect was (or is) directly addressed in the Pascal language, which differentiates between packed strings (multiple octets per byte) and unpacked strings (one octet per byte), where the former is used for passive text storage, and the latter is used for efficient processing.
This illustrates the basic conflation of three aspects in the single C++ type char:
unit of memory addressing, a.k.a. byte,
smallest basic type (it would be nice with an octet type), and
character encoding value unit.
And yes, this is a conflict.
Facts about UTF-16 encoding.
Unicode is a large set of 21-bit code points, most of which constitute characters on their own, but some of which are combined with others to form characters. E.g. a character with accent like “é” can be formed by combining code points for “e” and “´”-as-accent. And since that’s a general mechanism it means that a Unicode character can be an arbitrary number of code points, although it’s usually just 1.
UTF-16 encoding was originally a compatibility scheme for code based on original Unicode’s 16 bits per code point, when Unicode was extended to 21 bits per code point. The basic scheme is that code points in the defined ranges of original Unicode are represented as themselves, while each new Unicode code point is represented as a surrogate pair of 16-bit values. A small range of original Unicode is used for surrogate pair values.
At the time, examples of software based on 16 bits per code point included 32-bit Windows and the Java language.
On a system with 8-bit byte UTF-16 is an example of a wide text encoding, i.e. with an encoding unit wider than the basic addressable unit. Byte oriented text encodings are then known as narrow text. On such a system C++ char fits the latter, but not the former.
In C++03 the only built in type suitable for the wide text encoding unit was wchar_t.
However, the C++ standard effectively requires wchar_t to be suitable for a code-point, which for modern 21-bits-per-code-point Unicode means that it needs to be 32 bits. Thus there is no C++03 dedicated type that fits the requirements of UTF-16 encoding values, 16 bits per value. Due to historical reasons the most prevalent system based on UTF-16 as wide text encoding, namely Microsoft Windows, defines wchar_t as 16 bits, which after the extension of Unicode has been in flagrant contradiction with the standard, but then, the standard is impractical regarding this issue. Some platforms define wchar_t as 32 bits.
C++11 introduced new types char16_t and char32_t, where the former is (designed to be) suitable for UTF-16 encoding values.
About the question.
Regarding the question’s stated assumption of
” a system with UTF-16 encoding character set
this can mean one of two things:
a system with UTF-16 as the standard narrow encoding, or
a system with UTF-16 as the standard wide encoding.
With UTF-16 as the standard narrow encoding CHAR_BIT ≥ 16, and (by definition) sizeof(char) = 1. I do not know of any system, i.e. it appears to be hypothetical. Yet it appears to be the meaning tacitly assumed in current other answers.
With UTF-16 as the standard wide encoding, as in Windows, the situation is more complex, because the C++ standard is not up to the task. But, to use Windows as an example, one practical possibility is that sizeof(wchar_t)= 2. And one should just note that the standard is conflict with existing practice and practical considerations for this issue, when the ideal is that standards instead standardize the existing practice, where there is such.
Now finally we’re in a position to deal with the question,
” Is there a conflict between these statements above or is the sizeof(char) = 1 just a default (definition) value and will be implementation-defined depends on each system?
This is a false dichotomy. The two possibilities are not opposites. We have
There is indeed a conflict between char as character encoding unit and as a memory addressing unit (byte). As noted, the Pascal language has the keyword packed to deal with one aspect of that conflict, namely storage versus processing requirements. And there is a further conflict between the formal requirements on wchar_t, and its use for UTF-16 encoding in the most widely used system that employs UTF-16 encoding, namely Windows.
sizeof(char) = 1 by definition: it's not system-dependent.
CHAR_BIT is implementation defined, and is guaranteed ≥ 8.
No, there's no conflict. These two statements refer to different definitions of byte.
UTF-16 implies that byte is the same thing as octet - a group of 8 bits.
In C++ language byte is the same thing as char. There's no limitation on how many bits a C++-byte can contain. The number of bits in C++-byte is defined by CHAR_BIT macro constant.
If your C++ implementation decides to use 16 bits to represent each character, then CHAR_BIT will be 16 and each C++-byte will occupy two UTF-16-bytes. sizeof(char) will still be 1 and sizes of all objects will be measured in terms of 16-bit bytes.
A char is defined as being 1 byte. A byte is the smallest addressable unit. This is 8 bits on common systems, but on some systems it is 16 bits, or 32 bits, or anything else (but must be at least 8 for C++).
It is somewhat confusing because in popular jargon byte is used for what is technically known as an octet (8 bits).
So, your second and third quotes are correct. The first quote is, strictly speaking, not correct.
As defined by [intro.memory]/1 in the C++ Standard, char only needs to be able to hold the basic execution character set which is approximately 100 characters (all of which appear in the 0 - 127 range of ASCII), and the octets that make up UTF-8 encoding. Perhaps that is what the author meant by machine character set.
On a system where the hardware is octet addressable but the character set is Unicode, it is likely that char will remain 8-bit. However there are types char16_t and char32_t (added in C++11) which are designed to be used in your code instead of char for systems that have 16-bit or 32-bit character sets.
So, if the system goes with char16_t then you would use std::basic_string<char16_t> instead of std::string, and so on.
Exactly how UTF-16 should be handled will depend on the detail of the implementation chosen by the system. Unicode is a 21-bit character set and UTF-16 is a multibyte encoding of it; so the system could go the Windows-like route and use std::basic_string<char16_t> with UTF-16 encoding for strings; or it could go for std::basic_string<char32_t> with raw Unicode code points as the characters.
Alf's post goes into more detail on some of the issues that can arise.
Without quoting standard it is easily to give simply answer because:
Definition of byte is not 8 bits. Byte is any size but smallest addressable unit of memory. Most commonly it is 8 bits, but there is no reason to don't have 16 bits byte.
C++ standard gives more restrictions because it must be at least 8 bits.
So there is no problem to sizeof(char) be always 1, no matter what. Sometimes it will be stand as 8 bits, sometimes 16 bits and so on.

char vs wchar_t vs char16_t vs char32_t (c++11)

From what I understand, a char is safe to house ASCII characters whereas char16_t and char32_t are safe to house characters from unicode, one for the 16-bit variety and another for the 32-bit variety (Should I have said "a" instead of "the"?). But I'm then left wondering what the purpose behind the wchar_t is. Should I ever use that type in new code, or is it simply there to support old code? What was the purpose of wchar_t in old code if, from what I understand, its size had no guarantee to be bigger than a char? Clarification would be nice!
char is for 8-bit code units, char16_t is for 16-bit code units, and char32_t is for 32-bit code units. Any of these can be used for 'Unicode'; UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units.
The guarantee made for wchar_t was that any character supported in a locale could be converted from char to wchar_t, and whatever representation was used for char, be it multiple bytes, shift codes, what have you, the wchar_t would be a single, distinct value. The purpose of this was that then you could manipulate wchar_t strings just like the simple algorithms used with ASCII.
For example, converting ascii to upper case goes like:
auto loc = std::locale("");
char s[] = "hello";
for (char &c : s) {
c = toupper(c, loc);
}
But this won't handle converting all characters in UTF-8 to uppercase, or all characters in some other encoding like Shift-JIS. People wanted to be able to internationalize this code like so:
auto loc = std::locale("");
wchar_t s[] = L"hello";
for (wchar_t &c : s) {
c = toupper(c, loc);
}
So every wchar_t is a 'character' and if it has an uppercase version then it can be directly converted. Unfortunately this doesn't really work all the time; For example there exist oddities in some languages such as the German letter ß where the uppercase version is actually the two characters SS instead of a single character.
So internationalized text handling is intrinsically harder than ASCII and cannot really be simplified in the way the designers of wchar_t intended. As such wchar_t and wide characters in general provide little value.
The only reason to use them is that they've been baked into some APIs and platforms. However, I prefer to stick to UTF-8 in my own code even when developing on such platforms, and to just convert at the API boundaries to whatever encoding is required.
The type wchar_t was put into the standard when Unicode promised to create a 16 bit representation. Most vendors choose to make wchar_t 32 bits but one large vendor has chosen to to make it 16 bits. Since Unicode uses more than 16 bits (e.g., 20 bits) it was felt that we should have better character types.
The intent for char16_t is to represent UTF16 and char32_t is meant to directly represent Unicode characters. However, on systems using wchar_t as part of their fundamental interface, you'll be stuck with wchar_t. If you are unconstrained I would personally use char to represent Unicode using UTF8. The problem with char16_t and char32_t is that they are not fully supported, not even in the standard C++ library: for example, there are no streams supporting these types directly and it more work than just instantiating the stream for these types.

What is the use of wchar_t in general programming?

Today I was learning some C++ basics and came to know about wchar_t. I was not able to figure out, why do we actually need this datatype, and how do I use it?
wchar_t is intended for representing text in fixed-width, multi-byte encodings; since wchar_t is usually 2 bytes in size it can be used to represent text in any 2-byte encoding. It can also be used for representing text in variable-width multi-byte encodings of which the most common is UTF-16.
On platforms where wchar_t is 4 bytes in size it can be used to represent any text using UCS-4 (Unicode), but since on most platforms it's only 2 bytes it can only represent Unicode in a variable-width encoding (usually UTF-16). It's more common to use char with a variable-width encoding e.g. UTF-8 or GB 18030.
About the only modern operating system to use wchar_t extensively is Windows; this is because Windows adopted Unicode before it was extended past U+FFFF and so a fixed-width 2-byte encoding (UCS-2) appeared sensible. Now UCS-2 is insufficient to represent the whole of Unicode and so Windows uses UTF-16, still with wchar_t 2-byte code units.
wchar_t is a wide character. It is used to represent characters which require more memory to represent them than a regular char. It is, for example, widely used in the Windows API.
However, the size of a wchar_t is implementation-dependant and not guaranteed to be larger than char. If you need to support a specific form of character format greater than 8 bits, you may want to turn to char32_t and char16_t which are guaranteed to be 32 and 16 bits respectively.
wchar_t is used when you need to store characters with codes greater than 255 (it has a greater value than char can store).
char can take 256 different values which corresponds to entries in the ISO Latin tables. On the other hand, wide char can take more than 65536 values which corresponds to Unicode values. It is a recent international standard which allows the encoding of characters for virtually all languages and commonly used symbols.
I understand most of them have answered it but as I was learning C++ basics too and came to know about wchar_t, I would like to tell you what I understood after searching about it.
wchar_t is used when you need to store a character over ASCII 255 , because these characters have a greater size than our character type 'char'. Hence, requiring more memory.
e.g.:
wchar_t var = L"Привет мир\n"; // hello world in russian
It generally has a size greater than 8-bit character.
The windows operating system uses it substantially.
It is usually used when there is a foreign language involved.
The wchar_t data type is used to display wide characters that will occupy 16 bits. This datatype occupies "2 or 4" bytes.
Mostly the wchar_t datatype is used when international languages like japanese are used.
The wchar_t type is used for characters of extended character sets. It is among other uses used with wstring which is a string that can hold single characters of extended character sets, as opposed to the string which might hold single characters of size char, or use more than one character to represent a single sign (like utf8).
The wchar_t size is dependent on the locales, and is by the standard said to be able to represent all members of the largest extended character set supported by the locales.
wchar_t is specified in the C++ language in [basic.fundamental]/p5 as:
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales ([locale]).
In other words, wchar_t is a data type which makes it possible to work with text containing characters from any language without worrying about character encoding.
On platforms that support Unicode above the basic multilingual plane, wchar_t is usually 4 bytes (Linux, BSD, macOS).
Only on Windows wchar_t is 2 bytes and encoded with UTF-16LE, due to historical reasons (Windows initially supported UCS2 only).
In practice, the "1 wchar_t = 1 character" concept becomes even more complicated, due to Unicode supporting combining characters and graphemes (characters represented by sequences of code points).