Issue regarding char datatype in c++ - c++

I understand the fact that 'char' datatype is used to store single character and uses 1 byte but what are char16_t, char32_t, and wchar_t used for? We, after all, have to store just a single character only

Regarding char16_t and char32_t, quoting from a Microsoft article:
The char16_t and char32_t types represent 16-bit and 32-bit wide characters, respectively. Unicode encoded as UTF-16 can be stored in the char16_t type, and Unicode encoded as UTF-32 can be stored in the char32_t type. Strings of these types and wchar_t are all referred to as wide strings, though the term often refers specifically to strings of wchar_t type.
And for wchar_t:
The wchar_t type is an implementation-defined wide character type. In the Microsoft compiler, it represents a 16-bit wide character used to store Unicode encoded as UTF-16LE, the native character type on Windows operating systems. The wide character versions of the Universal C Runtime (UCRT) library functions use wchar_t and its pointer and array types as parameters and return values, as do the wide character versions of the native Windows API.
So they can't be said as simply a character. The type differs with the encoding, as mentioned above.
For example, the character u (U+0075) in char16_t encoding is stored as feff0075.

Related

Is 16-bit wchar_t formally valid for representing full Unicode?

In the ¹comp.lang.c++ Usenet group I recently asserted, based on what I thought I knew, that Windows' 16-bit wchar_t, with UTF-16 encoding where sometimes two such values (called a “surrogate pair”) is needed for a single Unicode code point, is invalid for representing Unicode.
It's certainly inconvenient and in conflict with the assumption of the C and C++ standard libraries (e.g. character classification) that each code point is represented as a single value, although the Unicode consortium's ²Technical Note 12 from 2004 makes a good case for using UTF-16 for internal processing, with an impressive list of software that does.
And certainly it seems as if the original intent was to have one wchar_t value per code point, consistent with the assumptions of the C and C++ standard libraries. E.g. in the web page “ISO C Amendment 1 (MSE)” over at ³unix.org, about the amendment that brought wchar_t into the C standard in 1995, the authors maintain that
” The primary advantage to the one byte/one character model is that it is very easy to process data in fixed-width chunks. For this reason, the concept of the wide character was invented. A wide character is an abstract data type large enough to contain the largest character that is supported on a particular platform.
But as it turned out, the C and C++ standards seem to not talk about the largest supported character, but only about the largest extended character sets in the supported locales: that wchar_t must be large enough to represent every code point in the largest such extended character set – but not Unicode, when there is no Unicode locale.
C99 §7.17/2 (from the N869 draft):
” [the wchar_t type] is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales.
This is almost identically the same wording as in the C++ standard. And it seems to mean that with a restricted set of supported locales, wchar_t can be smallish indeed, down to a single byte with UTF-8 encoding (a nightmare possibility where e.g. no standard library character classification function would work outside of ASCII's A through Z, but hey). Possibly the following is a requirement to be wider than that:
C99 §7.1.1/4:
” A wide character is a code value (a binary encoded integer) of an object of type wchar_t that corresponds to a member of the extended character set.
… since it refers to the extended character set, but that term seems to not be further defined anywhere.
And at least with Microsoft's C and C++ runtime there is no Unicode locale: with that implementation setlocale is restricted to character encodings that have at most 2 bytes per character:
MSDN ⁴documentation of setlocale:
” The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL.
So it seems that contrary to what I thought I knew, and contrary to my assertion, Windows' 16-bit wchar_t is formally OK. And mainly due to Microsoft's ingenious lack of support for UTF-8 locales, or any locale with more than 2 bytes per character. But is it really so, is 16-bit wchar_t OK?
Links:
¹ news:comp.lang.c++
² http://unicode.org/notes/tn12/#Software_16
³ http://www.unix.org/version2/whatsnew/login_mse.html
⁴ https://msdn.microsoft.com/en-us/library/x99tb11d.aspx
wchar_t is not now and never was a Unicode character/code point. The C++ standard does not declare that a wide-string literal will contain Unicode characters. The C++ standard does not declare that a wide-character literal will contain a Unicode character. Indeed, the standard doesn't say anything about what wchar_t will contain.
wchar_t can be used with locale-aware APIs, but those are only relative to the implementation-defined encoding, not any particular Unicode encoding. The standard library functions that take these use their knowledge of the implementation's encoding to do their jobs.
So, is a 16-bit wchar_t legal? Yes; the standard does not require that wchar_t be sufficiently large to hold a Unicode codepoint.
Is a string of wchar_t permitted to hold UTF-16 values (or variable width in general)? Well, you are permitted to make strings of wchar_t that store whatever you want (so long as it fits). So for the purposes of the standard, the question is whether standard-provided means for generating wchar_t characters and strings are permitted to use UTF-16.
Well, the standard library can do whatever it wants to; the standard offers no guarantee that a conversion from any particular character encoding to wchar_t will be a 1:1 mapping. Even char->wchar_t conversion via wstring_convert is not required anywhere in the standard to produce a 1:1 character mapping.
If a compiler wishes to declare that the wide character set consists of the Base Multilingual Plane of Unicode, then a declaration like this L'\U0001F000' will produce a single wchar_t. But the value is implementation-defined, per [lex.ccon]/2:
The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined.
And of course, C++ doesn't allow to use surrogate pairs as a c-char; \uD800 is a compile error.
Where things get murky in the standard is the treatment of strings that contain characters outside of the character set. The above text would suggest that implementations can do what they want. And yet, [lex.string]\16 says this:
The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus
one for the terminating U’\0’ or L’\0’.
I say this is murky because nothing says what the behavior should be if a c-char in a string literal is outside the range of the destination character set.
Windows compilers (both VS and GCC-on-Windows) do indeed cause L"\U0001F000" to have an array size of 3 (two surrogate pairs and a single NUL terminator). Is that legal C++ standard behavior? What does it mean to provide a c-char to a string literal that is outside of the valid range for a character set?
I would say that this is a hole in the standard, rather than a deficiency in those compilers. It should make it more clear what the conversion behavior in this case ought to be.
In any case, wchar_t is not an appropriate tool for processing Unicode-encoded text. It is not "formally valid" for representing any form of Unicode. Yes, many compilers implement wide-string literals as a Unicode encoding. But since the standard doesn't require this, you cannot rely on it.
Now obviously, you can stick whatever will fit inside of a wchar_t. So even on platforms where wchar_t is 32-bits, you could shove UTF-16 data into them, with each 16-bit word taking up 32-bits. But you couldn't pass such text to any API function that expects the wide character encoding unless you knew that this was the expected encoding for that platform.
Basically, never use wchar_t if you want to work with a Unicode encoding.
Let's start from first principles:
(§3.7.3) wide character: bit representation that fits in an object of type
wchar_t, capable of representing any character in the current locale
(§3.7) character: 〈abstract〉 member of a set of elements used for the
organization, control, or representation of data
That, right away, discards full Unicode as a character set (a set of elements/characters) representable on 16-bit wchar_t.
But wait, Nicol Bolas quoted the following:
The size of a char32_t or wide string literal is the total number of
escape sequences, universal-character-names, and other characters,
plus one for the terminating U’\0’ or L’\0’.
and then wondered about the behavior for characters outside the execution character set. Well, C99 has the following to say about this issue:
(§5.1.1.2) Each source character set member and escape sequence in character
constants and string literals is converted to the corresponding member
of the execution character set; if there is no corresponding member,
it is converted to an implementation- defined member other than the
null (wide) character.8)
and further clarifies in a footnote that not all source characters need to map to the same execution character.
Armed with this knowledge, you can declare that your wide execution character set is the Basic Multilingual Plane, and that you consider surrogates as proper characters themselves, not as mere surrogates for other characters. AFAICT, this means you are in the clear as far as Clause 6 (Language) of ISO C99 cares.
Of course, don't expect Clause 7 (Library) to play along nicely with you. As an example, consider iswalpha(wint_t). You cannot pass astral characters (characters outside the BMP) to that function, you can only pass it the two surrogates. And you'd get some nonsensical result, but that's fine because you declared the surrogate themselves to be proper members of the execution character set.
After clarifying what the question is I will do an edit.
Q: Is the width of 16 bits for wchar_t in Windows conformant to the standard?
A: Well, lets see. We will start with the definition of wchar_t from c99 draft.
... largest extended character set specified among the supported locales.
So, we should look what are the supported locales. For that there are Three steps:
We check the documentation for setlocale
We quickly open the documentation for the locale string. We see the format of the string
locale :: "locale_name"
| "language[_country_region[.code_page]]"
| ".code_page"
| "C"
| ""
| NULL
We see the list of supported Code pages and we see UTF-8, UTF-16, UTF-32 and what not. We're in a dead end.
If we start with the C99 definition, it ends with
... corresponds to a member of the extended character set.
The word "character set" is used. But if we say UTF-16 code units are our character set, then all is OK. Otherwise, it's not. It's kinda vague, and one should not care much. The standards were defined many years ago, when Unicode was not popular.
At the end of the day, we now have C++11 and C11 that define use cases for UTF-8, 16 and 32 with the additional types char16_t and char32_t.
You need to read about Unicode and you will answer the question yourself.
Unicode is a character set. Set of characters, it's about 200000 characters. Or more precisely it is a mapping, mapping between numbers and characters. Unicode by itself does not mean this or that bit width.
Then there are 4 encodings, UTF-7, UTF-8, UTF-16 and UTF-32. UTF stands for Unicode transformation format.
Each format defines a code point and a code unit. Code point is an actual charter from Unicode and can consists of one or more units. Only UTF-32 has one unit per point.
On the other hand, each unit fits into a fixed size integer. So UTF-7 units are at most 7 bits, UTF-16 units are at most 16 bits etc.
Therefore, in a 16 bit wchar_t string we can hold Unicode text encoded in UTF-16. Particularly in UTF-16 each point takes one or two units.
So the final answer, in a single wchar_t you can not store all Unicode char, only the single unit ones, but in a string of wchar_t you can store any Unicode text.

using unicode in a C++ program

I want that strings with Unicode characters be correctly handled in my file synchronizer application but I don't know how this kind of encoding works ?
In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)
In internet I see examples using "wide strings or wchar_t ??
So, what's the suitable object to handle unicode characters ? In rapidJson (which supports Unicode, UTF-8, UTF-16, UTF-32) , we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented... I don't understand...
This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string) :
boost::replace_all(listFolder, "\\u00e0", "à");
boost::replace_all(listFolder, "\\u00e2", "â");
boost::replace_all(listFolder, "\\u00e4", "ä");
...
The suitable object to handle Unicode strings in C++ is icu::UnicodeString (check "API References, ICU4C" in the sidebar), at least if you want to really handle Unicode strings (as opposed to just passing them from one point of your application to another).
wchar_t was an early attempt at handling international character sets, which turned out to be a failure because Microsoft's definition of wchar_t as two bytes turned out to be insufficient once Unicode was extended beyond code point 0x10000. Linux defines wchar_t as four bytes, but the inconsistency makes it (and its derived std::wstring) rather useless for portable programming.
TCHAR is a Microsoft define that resolves to char by default and to WCHAR if UNICODE is defined, with WCHAR in turn being wchar_t behind a level of indirection... yeah.
C++11 brought us char16_t and char32_t as well as the corresponding string classes, but those are still instances of basic_string<>, and as such have their shortcomings e.g. when trying to uppercase / lowercase characters that have more than one replacement character (e.g. the German ß would require to be extended to SS in uppercase; the standard library cannot do that).
ICU, on the other hand, goes the full way. For example, it provides normalization and decomposition, which the standard strings do not.
\uxxxx and \UXXXXXXXX are unicode character escapes. The xxxx are a 16-bit hexadecimal number representing a UCS-2 code point, which is equivalent to a UTF-16 code point within the Basic Multilingual Plane.
The XXXXXXXX are a 32-bit hex number, representing a UTF-32 code point, which may be any plane.
How those character escapes are handled depends on the context in which they appear (narrow / wide string, for example), making them somewhat less than perfect.
C++11 introduced "proper" Unicode literals:
u8"..." is always a const char[] in UTF-8 encoding.
u"..." is always a const uchar16_t[] in UTF-16 encoding.
U"..." is always a const uchar32_t[] in UTF-32 encoding.
If you use \uxxxx or \UXXXXXXXX within one of those three, the character literal will always be expanded to the proper code unit sequence.
Note that storing UTF-8 in a std::string is possible, but hazardous. You need to be aware of many things: .length() is not the number of characters in your string. .substr() can lead to partial and invalid sequences. .find_first_of() will not work as expected. And so on.
That being said, in my opinion UTF-8 is the only sane encoding choice for any stored text. There are cases to be made for handling texts as UTF-16 in-memory (the way ICU does), but on file, don't accept anything but UTF-8. It's space-efficient, endianess-independent, and allows for semi-sane handling even by software that is blissfully unaware of Unicode matters (see caveats above).
In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)
That is a unicode character escape sequence. It will be interpreted as a unicode character. The u after the escape character is part of the syntax and it's what differentiates it from other escape sequences. Read the documentation for more information.
So, what's the suitable object to handle unicode characters ?
char for uft-8
char16_t for utf-16
char32_t for utf-32
The size of wchar_t is platform dependent, so you cannot make portable assumptions of which encoding it suits.
we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented...
If you mean that you can store multi-byte utf-8 characters in a char string, then you're correct.
This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string)
What you're attempting to do there is replacing some unicode characters with characters that have a plaftorm defined encoding. If you have other unicode characters besides those, then you end up with a string that has mixed encoding. Also, in some cases it may accidentally replace parts of other byte sequences. I recommend using library to convert encoding or do any other manipulation on encoded strings.

Understanding wchar_t type in c++

The Standard says N3797::3.9.1 [basic.fundamental]:
Type wchar_t is a distinct type whose values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales (22.3.1).
I can't imagine how we can use that type. Could you give an example where plain char isn't working? I thought it may be helpful if we use two different language simultaneously. But plain char is Ok in case for cyrillic and latinica
#include <iostream>
char cp[] = "LATINICA_КИРИЛЛИЦА";
int main()
{
std::cout << cp; //LATINICA_КИРИЛЛИЦА
}
DEMO
In your example, you use Unicode. Indeed you could type not only in Latin or Cyrillic, but in Thai, Arabic, Chinese in other words any Unicode symbol. Your example with some more symbols link
The case is in encoding. In your example you are using char to store Unicode symbols encoded in UTF-8. See this for more details. The main advantage of UTF-8 in backward compatibility with ASCII. The main disadvantage of using UTF-8 is variable symbol length.
There are other types of encoding for Unicode symbols. The most common (except UTF-8) are UTF-16 and UTF-32. You should be aware that the UTF-16 encoding is still variable length, however the code unit is now 16bit. UTF-32 encoding is constant length.
The type wchar_t is usually used to store symbols in UTF-16 or UTF-32 encoding depending on the system.
It depends what encoding you decide to use. Any single UTF-8 value can be held in an 8-bit char (though one Unicode code-point can take several char values to represent). It's impossible to tell from your question, but I'd guess that your editor and compiler are treating your strings as UTF-8 and that's fine if that's what you want.
Other common encodings include UTF-16, UTF-32, UCS-2 and UCS-4, which have 2-byte, 4-byte, 2-byte and 4-byte values respectively. You can't store these values in an 8-bit char.
The decision of what encoding to use for any given purpose is not straightforward. The main considerations are:
What other systems does your code have to interface to and what encoding do they use?
What libraries do you want to use and what encodings do they use? (eg xerces-c uses UTF-16 throughout)
The tradeoff between complexity and storage size. UTF-32 and UCS-4 have the useful property that every possible displayed character is represented by one value, so you can tell the length of the string from how much memory it takes up without having to look at the values in it (though this assumes that you consider combining diacretic marks as separate characters). However, if all you're representing is ASCII, they take up four times as much memory as UTF-8.
I'd suggest Joel Spolsky's essay on Unicode as a good read.
wchar_t has its own problems, though. The standard didn't specify how big a wchar_t is, so, of course, different compilers have picked different sizes; VC++ used two bytes and gcc (and most others) use four bytes. Wide-character literals, such as L"Hello, world," are similarly confused, being UTF-16 strings in VC++ and UCS-4 in gcc.
To try to clean this up, C++11 introduced two new character types:
char16_t is a character guaranteed to be 16-bits, and with a literal form u"Hello, world."
char32_t is a character guaranteed to be 32-bits, and with a literal form U"Hello, world."
However, these have problems of their own; in particular, <iostream> doesn't provide console streams that can handle them (ie there is no u16cout or u32cerr).
To be more specific I'll provide a normative reference relates to the question: [N3797:8.5.2/1 [dcl.init.string] says:
An array of narrow character type (3.9.1), char16_t array, char32_t
array, or wchar_t array can be initialized by a narrow string literal,
char16_t string literal, char32_t string literal, or wide string
literal, respectively, or by an appropriately-typed string literal
enclosed in braces (2.14.5). Successive characters of the value of the
string literal initialize the elements of the array.
8.5.2/2:
There shall not be more initializers than there are array elements.
In the case of
#include <iostream>
char cp[] = "LATINICA_КИРИЛЛИЦА";
int main()
{
std::cout << sizeof(cp) << std::endl; //28
}
DEMO
for some language, like English, it's not necessary to use wchar_t.but some language, like Chinese, you'd better use wchar_t.
although char is able to store string, likechar p[] = "你好"
but it may show messy code when you run you program in different computer, especially the computer using different language.
if you use wchar_t, you can avoid this.

How to use Loadstring to load Chinese Characters

I have a string table which defines a string in Chinese like this:
STRINGTABLE
LANGAUGE 0x0C04, 0x03
BEGIN
1000 "检查环境..."
...
END
I am trying to load that string into a wchar_t buffer as follows:
#define UNICODE
#define _UNICODE
wchar_t buffer[512];
LoadString(DLL_HANDLE, (UINT) msg_num, buffer, 512);
MessageBox(NULL, buffer, NULL, NULL);
However, the string that is loaded into the buffer is different than the one that is in my string table.
It looks like this in my string table:
检查环境...
But this is how it turns out on screen:
環境をãƒã‚§ãƒƒã‚¯ä¸­...
Doesnt the 'MessageBox' function work on narrow strings by deault? Wouldn't you need to use 'MessageBoxW'?
Edit:
A couple of things to check. The encoding of L"..." strings is implementation defined. The standard makes no mention of encoding of characters of wchar_t; make sure you're using the same encoding as windows expects. (If I recall correctly, windows expects UTF-16 - but I very well may be wrong on this).
In C++11, 3 new literal string types are introduced, and their prefixes are "u8", "u" and "U", which specify UTF-8, UTF-16 & UTF-32, respectively. C++11 still makes no guarantees on the encoding the "L" prefixes from what I can tell, other than what is mentioned in §2.14.3:
A character literal that begins with the letter L, such as L’x’, is a wide-character literal. A wide-character
literal has type wchar_t.23 The value of a wide-character literal containing a single c-char has value equal
to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char
has no representation in the execution wide-character set, in which case the value is implementation-defined.
[ Note: The type wchar_t is able to represent all members of the execution wide-character set (see 3.9.1).
—end note ]. The value of a wide-character literal containing multiple c-chars is implementation-defined.
Reference §3.9.1 P5 states:
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest
extended character set specified among the supported locales (22.3.1). Type wchar_t shall have the same
size, signedness, and alignment requirements (3.11) as one of the other integral types, called its underlying
type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as
uint_least16_t and uint_least32_t, respectively, in <stdint.h>, called the underlying types.
Again, no mention of encoding. It is possible that windows is expecting a different encoding that what your resource string is using, and thus the discrepancy.
You might verify by calling MessageBox using an L"" string literal with "\Uxxxxxxx" encoding escapes for your characters to verify.
The MSDN documentation states that the format should be similar to
IDS_CHINESESTRING L"\x5e2e\x52a9". That's not the most formal of descriptions. I interpret it as stating that unicode strings must be prefixed with L and encoded using \uxxxx escape codes

What is the difference between "UTF-16" and "std::wstring"?

Is there any difference between these two string storage formats?
std::wstring is a container of wchar_t. The size of wchar_t is not specified—Windows compilers tend to use a 16-bit type, Unix compilers a 32-bit type.
UTF-16 is a way of encoding sequences of Unicode code points in sequences of 16-bit integers.
Using Visual Studio, if you use wide character literals (e.g. L"Hello World") that contain no characters outside of the BMP, you'll end up with UTF-16, but mostly the two concepts are unrelated. If you use characters outside the BMP, std::wstring will not translate surrogate pairs into Unicode code points for you, even if wchar_t is 16 bits.
UTF-16 is a specific Unicode encoding. std::wstring is a string implementation that uses wchar_t as its underlying type for storing each character. (In contrast, regular std::string uses char).
The encoding used with wchar_t does not necessarily have to be UTF-16—it could also be UTF-32 for example.
UTF-16 is a concept of text represented in 16-bit elements but an actual textual character may consist of more than one element
std::wstring is just a collection of these elements, and is a class primarily concerned with their storage.
The elements in a wstring, wchar_t is at least 16-bits but could be 32 bits.