Transform byte array to string while supporting different encodings - c++

Let's say I have read the binary content of a text file into a std::vector<std::uint8_t> and I want to transform these bytes into a string representation.
As long as the bytes are encoded using a single-byte encoding (ASCII for example), a transformation to std::string is pretty straightforward:
std::string transformToString(std::vector<std::uint8_t> bytes)
{
std::string str;
str.assign(
reinterpret_cast<std::string::value_type*>(const_cast<std::uint8_t*>(bytes.data())),
data.size() / sizeof(std::string::value_type)
);
return str;
}
As soon as the bytes are encoded in some unicode format, things get a little bit more complicated.
As far as I know, C++ supports additional string types for unicode strings. These are std::u8string for UTF-8, std::u16string for UTF-16 and std::u32string for UTF-32.
Problem 1: Let's say the bytes are encoded in UTF-8. How can I create a std::u8string from these bytes in the first place? Also, how do I know the length of the string since there can be code points encoded in multiple bytes?
Problem 2: I've seen, that UTF-16 and UTF-32 support both big-endian and little-endian byte order. Let's say the bytes are encoded in UTF-16 BE or UTF-16 LE. How can I create a std::u16string from the bytes (and how can I specify the byte order for transformation)? I am looking for something like std::u16string u16str = std::u16string::from_bytes(bytes, byte_order::big_endian);.
Problem 3: Are the previously listed types of unicode string already aware of a byte order mark or does the byte order mark (if present) need to be processed separately? Since the said string types are just char8_t, char16_t and char32_t templated on a std::basic_string, I assume, that processing of a byte order mark is not supported.
Clarification: Please note, that I do not want to do any conversions. Almost every article I found was about how to convert UTF-8 strings to other encodings and vice-versa. I just want to get the string representation of the specified byte array. Therefore, as the user/programmer, I must be aware of the encoding of the bytes to get the correct representation. For example:
The bytes are encoded in UTF-8 (e.g. 41 42 43 (ABC)). I try to transform them to a std::u8string. The transformation was correct, the string is ABC.
The bytes are encoded in UTF-8 (e.g. 41 42 43 (ABC)). I try to transform them to a std::u16string. The transformation fails or the resulting string is not correct.

Your transformToString is (more or less) correct only if uint8_t is unsigned char, which however is the case on every platform I know.
It is unnecessary to do the multiple casts you are doing. The whole cast sequence is not an aliasing violation only if you are casting from unsigned char* to char* (and char is always the value type of std::string). In particular there is no const involved. I also say "more or less", because while this is probably supposed to work specifically when casting between signed/unsigned variants of the same element type, the standard currently doesn't actually specify the pointer arithmetic on the resulting pointer (which I guess is a defect).
However there is a much safer way that doesn't involve dangerous casts or potential for length mismatch:
str.assign(std::begin(bytes), std::end(bytes));
You can use exactly the same line as above to convert to any other std::basic_string specialization, but the important point is that it will simply copy individual bytes as individual code units, not considering encoding or endianess in any way.
Problem 1: Let's say the bytes are encoded in UTF-8. How can I create a std::u8string from these bytes in the first place? Also, how do I know the length of the string since there can be code points encoded in multiple bytes?
You create the string exactly with the same line I showed above. In this case your approach would be wrong if you just replace str's type because char8_t cannot alias unsigned char and would therefore be an aliasing violation resulting in undefined behavior.
A std::u8string holds a sequence of UTF-8 code units (by convention). To get individual code points you can convert to UTF-32. There is std::mbrtoc32 from the C standard library, which relies on the C locale being set as UTF-8 (and requires conversion back to a char array first) and there is codecvt_utf8<char32_t> from the C++ library, which is however deprecated and no replacement has been decided on yet.
There are no functions in the standard library that actually interpret the sequence of code units in u8string as code points. (e.g. .size() is the number of code units, not the number of code points).
Problem 2: I've seen, that UTF-16 and UTF-32 support both big-endian and little-endian byte order. Let's say the bytes are encoded in UTF-16 BE or UTF-16 LE. How can I create a std::u16string from the bytes (and how can I specify the byte order for transformation)? I am looking for something like std::u16string u16str = std::u16string::from_bytes(bytes, byte_order::big_endian);.
There is nothing like that directly in the standard library. A u16string holds 16bit code units of type char16_t as values. What endianess or in general what representation is used for this type is an implementation detail, but you can expect it to be equal to that of other basic types. Since C++20 there is std::endian to indicate the endianess of all scalar types if applicable and std::byteswap which can be used to swap byte order if the endianess doesn't match the source endianess. However, you would need to manually iterate over the vector and form char16_ts from pairs of bytes by bitwise operations anyway, so I am not sure whether this is all that helpful.
All of the above assumes that the original data is actually UTF-16 encoded. If that is not the case you need to convert from the original encoding to UTF-16 for which there are equivalent functions as in the UTF-32 case mentioned above.
Problem 3: Are the previously listed types of unicode string already aware of a byte order mark or does the byte order mark (if present) need to be processed separately? Since the said string types are just char8_t, char16_t and char32_t templated on a std::basic_string, I assume, that processing of a byte order mark is not supported.
The types simply store sequences of code units. They do not care what they represent (e.g. whether they represent a BOM). Because they store code units, not bytes, the BOM wouldn't have any meaning in processing them anyway.

Related

The conversion from/to UTF-8/UTF-16 requires (ex: utf8 -> codepoint then codepoint to utf16) or (ex: utf8 -> utf16)?

I have an inquiry about the conversion from/to utf8/utf16, does require to return the UTF-8/16 to its original codepoint first then convert to the target encoding or it is possible to convert from encoding to another directly, ex: utf16 to utf8 or visa versa.
For example, I have a character س its UTF-8 0xD8 0xB3, does require to convert from utf-8 to utf-16 to decode that to its codepoint U+0633 then encode it again to UTF-16 0x0633?
If your UTF-8 code is less than 128, then you can immediately generate the UTF-16 equivalent. In a very real sense, though, you have decoded the entire UTF-8 character to its codepoint and re-encoded it in UTF-16. So we'd just be debating semantics as to whether that's going directly to the other encoding or not.
UTF-8 encodings up to three bytes have to be completely decoded, and the UTF-16 encoding is just that decoded value as two bytes. So have you actually re-encoded it into UTF-16 or did you convert to UTF-16 directly? It's really just a point of view.
The most complicated version is when the UTF-8 encoding is four bytes, since those represent codepoints beyond the BMP, so the UTF-16 encoding would be a surrogate pair. I don't think there's any computational shortcut to be taken there. If there were, it probably wouldn't be worth it. Such shortcuts could actually run slower on modern processors since you'd need extra conditional branch instructions, which could thwart branch prediction and pipelining.
I think you can make roughly the same argument in the reverse direction as well.
So I'm going to say, yes, you do have to convert to the actual codepoint when transcoding between UTF-8 and UTF-16.
UTF-8's decoding algorithm works like this. You do up to 3 conditional tests against the first byte to figure out how many bytes to process, and then you process that number of bytes into a codepoint.
UTF-16's encoding algorithm works by taking the code point and checking to see if it is larger than 0xFFFF. If so, then you encode it into 2 16-bit surrogate pairs; otherwise, you encode it into a single 16-bit code unit.
Here's the thing though. Every codepoint larger than 0xFFFF is encoded in UTF-8 by 4 code units, and every codepoint 0xFFFF or smaller is encoded by 3 code units or less. Therefore, if you did UTF-8 decoding to produce the codepoint... you don't have to do the conditional test in the UTF-16 encoding algorithm. Based on how you decoded the UTF-8 sequence, you already know if the codepoint needs 1 16-bit code unit or two.
Therefore, in theory, a full UTF-8->utf-16 hand-coded algorithm could involve one less conditional test than using a direct codepoint intermediate. But really, that's the only difference. Even for 4-byte UTF-8 sequences, you have to extract the UTF-8 value into a full 32-bit codepoint before you can do the surrogate pair encoding. So the only real efficiency gain possible is the lack of the condition.
For UTF-16->UTF-8, you know that any surrogate pair encoding requires 4 bytes in UTF-8, and any non-surrogate pair encoding requires 3 or less. And you have to do that test before decoding UTF-16 anyway. But you still basically have to do all of the work to convert the UTF-16 to a codepoint before the UTF-8 encoder can do its job (even if that work is nothing, as is the case for non-surrogate pairs). So again, the only efficiency gain is from losing one conditional test.
These sound like micro-optimizations. If you do a lot of such conversions, and they're performance-critical, it might be worthwhile to hand-code a converter. Maybe.
Try the top answer to this question:
How to convert UTF-8 std::string to UTF-16 std::wstring?
Ignore the "C++11" answer as the STL calls made are deprecated.
The easier way it is to decode to code-points and then encode with the wanted encoding. In this manner you manage surrogates, and special escapes (which are not really UTF-8, but they are sometime used, e.g. to include codepoint U+0 into an ASCIIZ/C-string.
If you write down UTF-8 <-> code point in bit form (and the same with UTF-16, Wikipedia helps), you see that bits keeps they values, so you can just move bits in a direct conversion, without passing to code point (and so without an intermediate variable). It is just shift and masks (and an addition/subtraction in UTF16). I would not doing it, but if it is a very performance critical task.

Why u8'A' can be a char type while UTF-8 can be up to 4 bytes and char is normally 1 byte?

I was reading What is the use of wchar_t in general programming? and found something confusing in the accepted answer:
It's more common to use char with a variable-width encoding e.g. UTF-8 or GB 18030.
And I find this from my textbook:
Isn't Unicode encoding with UTF-8 is at most 4 bytes? char for most platforms is 1 byte. Do I misunderstand something?
Update:
After searching and reading, now I know that:
code points and code units are different stuff. Code point is unique while code units rely on encoding.
u8'a'(a char, not string here) is only allowed for basic character set(the ASCII and it's control character stuff), and its value is the corresponding 'a''s code unit value, and for ascii characters, code units are same value as code points. (This is what #codekaizer's answer say)
std::string::size() returns code units.
So the editors are all dealing with code units right? And If I change my file encoding from utf8 to uft32, then size of ə would be 4?
Isn't unicode encoding with utf8 is at most 4 bytes?
As per lex.ccon/3, emphasis mine:
A character literal that begins with u8, such as u8'w', is a character
literal of type char, known as a UTF-8 character literal. The value of
a UTF-8 character literal is equal to its ISO 10646 code point value,
provided that the code point value is representable with a single
UTF-8 code unit (that is, provided it is in the C0 Controls and Basic
Latin Unicode block). If the value is not representable with a single
UTF-8 code unit, the program is ill-formed. A UTF-8 character literal
containing multiple c-chars is ill-formed.
Single UTF-8 code unit is 1 byte.
You are confusing code points with code units.
In UTF-8 each code unit (≈ data type used by a particular encoding) is one byte (8 bit), so it can be represented in a C++ program by the char type (which the standard guarantees to be at least 8 bit).
Now, of course you cannot represent all Unicode code points (≈ character/glyph) in just a single code unit if it is so small - they are currently well over 1 million, while a byte can have only 256 distinct values. For this reason, UTF-8 uses more code units to represent a single code point (and, to save space and for compatibility, uses a variable length encoding). So, the 😀 code point (U+1F600) will be mapped to 4 code units (f0 9f 98 80).
Most importantly, C++ almost everywhere is concerned just with code units - strings are treated mostly as opaque binary blobs (with the exception of the 0 byte for C strings). For example, strlen and std::string::size() will all report you the number of code units, not of code points.
The u8 cited above is one of the rare exceptions. It's an indication to the compiler that the string enclosed in the literal must be mapped from whatever the encoding the compiler is using to read the source file to an UTF-8 string.
The UTF-* is a family of variable encodings. On UTF-8, for instance, the minimal size is indeed 1 byte, but some characters require more. Those encodings have two advantages:
Compatibility with widespread characters types such as char
Minimal size when the text contains mostly English characters (which occupy 1 byte).
On the down size, variable length encodings require more work for some operations, e.g. calculating the number of characters in a given string. Since each character can occupy a different number of bytes, you can't just look at the string size (in bytes).
Given that, if you're going to use a variable length encoding, it usually makes sense to use the most compressed one, which is UTF-8 (under the assumption your text indeed contains mostly English characters). OTOH, if your text contains a wide range of languages, which will make UTF-8 inefficient, you can opt for the fixed size Unicode representations. On such cases, you'll need wider character types - 2 or 4 bytes.
The character set is not restricted to the ASCII table only. Having entries which can reside in 1 byte. Usually the character is more than that when it comes to different languages e.g Japanese. These characters don't reside in the ASCII table so we use 4 byte for a character in that regard.
In C++ we assume that our character is in the ASCII table so we give it a size of 1 byte.

How does convertion between char and wchar_t work in Windows?

In Windows there are the functions like mbstowcs to convert between char and wchar_t. There are also C++ functions such as from_bytes<std::codecvt<wchar_t, char, std::mbstate_t>> to use.
But how does this work beind the scenes as char and wchar_t are obviously of different size? I assume the system codepage is involved in some way? But what happens if a wchar_t can't be correlated to a char (it can after all contain a lot more values)?
Also what happens if code that has to use char (maybe due to a library) is moved between computers with different codepages? Say that it is only using numbers (0-9) which are well within the range of ASCII, would that always be safe?
And finally, what happens on computers where the local language can't be represented in 256 characters? In that case the concept of char seems completely irrelevant other than for storing for example utf8.
It all depends on the cvt facet used, as described here
In your case, (std::codecvt<wchar_t, char, std::mbstate_t>) it all boils down to mbsrtowcs / wcsrtombs using the global locale. (that is the "C" locale, if you don't replace it with the system one)
I don't know about mbstowcs() but I assume it is similar to std::codecvt<cT, bT, std::mbstate_t>. The latter travels in terms of two types:
A character type cT which is in your code wchar_t.
A byte type bT which is normally char.
The third type in play, std::mbstate_t, is used to store any intermediate state between calls to the std::codecvt<...> facet. The facets can't have any mutable state and any state between calls needs to be obtained somehow. Sadly, the structure of std::mbstate_t is left unspecified, i.e., there is no portable way to actually use it when creating own code conversion facets.
Each instance of std::codecvt<...> implements the conversions between bytes of an external encoding, e.g., UTF8, and characters. Originally, each character was meant to be a stand-alone entity but various reasons (primarily from outside the C++ community, notably from changes made to Unicode) have result in the internal characters effectively being an encoding themselves. Typically the internal encodings used are UTF8 for char and UTF16 or UCS4 for wchar_t (depending on whether wchar_t uses 16 or 32 bits).
The decoding conversions done by std::codecvt<...> take the incoming bytes in the external encoding and turn them into characters of the internal encoding. For example, when the external encoding is UTF8 the incoming bytes are converted to 32 bit code points which are then stuck into UTF16 characters by splitting them up into to wchar_t when necessary (e.g., when wchar_t is 16 bit).
The details of this process are unspecified but it will involve some bit masking and shifting. Also, different transformations will use different approaches. If the mapping between the external and internal encoding isn't as trivial as mapping one Unicode representation to another representation there may be suitable tables providing the actual mapping.
I what is in the char array is actually a UTF-8 encoded string, then you can convert it to and from a UTF-16 encoded wchar_t array using
#include <locale>
#include <codecvt>
#include <string>
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string narrow = converter.to_bytes(wide_utf16_source_string);
std::wstring wide = converter.from_bytes(narrow_utf8_source_string);
as described in more detail at https://stackoverflow.com/a/18597384/6345

char vs wchar_t vs char16_t vs char32_t (c++11)

From what I understand, a char is safe to house ASCII characters whereas char16_t and char32_t are safe to house characters from unicode, one for the 16-bit variety and another for the 32-bit variety (Should I have said "a" instead of "the"?). But I'm then left wondering what the purpose behind the wchar_t is. Should I ever use that type in new code, or is it simply there to support old code? What was the purpose of wchar_t in old code if, from what I understand, its size had no guarantee to be bigger than a char? Clarification would be nice!
char is for 8-bit code units, char16_t is for 16-bit code units, and char32_t is for 32-bit code units. Any of these can be used for 'Unicode'; UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units.
The guarantee made for wchar_t was that any character supported in a locale could be converted from char to wchar_t, and whatever representation was used for char, be it multiple bytes, shift codes, what have you, the wchar_t would be a single, distinct value. The purpose of this was that then you could manipulate wchar_t strings just like the simple algorithms used with ASCII.
For example, converting ascii to upper case goes like:
auto loc = std::locale("");
char s[] = "hello";
for (char &c : s) {
c = toupper(c, loc);
}
But this won't handle converting all characters in UTF-8 to uppercase, or all characters in some other encoding like Shift-JIS. People wanted to be able to internationalize this code like so:
auto loc = std::locale("");
wchar_t s[] = L"hello";
for (wchar_t &c : s) {
c = toupper(c, loc);
}
So every wchar_t is a 'character' and if it has an uppercase version then it can be directly converted. Unfortunately this doesn't really work all the time; For example there exist oddities in some languages such as the German letter ß where the uppercase version is actually the two characters SS instead of a single character.
So internationalized text handling is intrinsically harder than ASCII and cannot really be simplified in the way the designers of wchar_t intended. As such wchar_t and wide characters in general provide little value.
The only reason to use them is that they've been baked into some APIs and platforms. However, I prefer to stick to UTF-8 in my own code even when developing on such platforms, and to just convert at the API boundaries to whatever encoding is required.
The type wchar_t was put into the standard when Unicode promised to create a 16 bit representation. Most vendors choose to make wchar_t 32 bits but one large vendor has chosen to to make it 16 bits. Since Unicode uses more than 16 bits (e.g., 20 bits) it was felt that we should have better character types.
The intent for char16_t is to represent UTF16 and char32_t is meant to directly represent Unicode characters. However, on systems using wchar_t as part of their fundamental interface, you'll be stuck with wchar_t. If you are unconstrained I would personally use char to represent Unicode using UTF8. The problem with char16_t and char32_t is that they are not fully supported, not even in the standard C++ library: for example, there are no streams supporting these types directly and it more work than just instantiating the stream for these types.

How do I use 3 and 4-byte Unicode characters with standard C++ strings?

In standard C++ we have char and wchar_t for storing characters. char can store values between 0x00 and 0xFF. And wchar_t can store values between 0x0000 and 0xFFFF. std::string uses char, so it can store 1-byte characters only. std::wstring uses wchar_t, so it can store characters up to 2-byte width. This is what I know about strings in C++. Please correct me if I said anything wrong up to this point.
I read the article for UTF-8 in Wikipedia, and I learned that some Unicode characters consume up to 4-byte space. For example, the Chinese character 𤭢 has a Unicode code point 0x24B62, which consumes 3-byte space in the memory.
Is there an STL string container for dealing with these kind of characters? I'm looking for something like std::string32. Also, we had main() for ASCII entry point, wmain() for entry point with 16-bit character support; what entry point do we use for 3 and 4-byte Unicode supported code?
Can you please add a tiny example?
(My OS: Windows 7 x64)
First you need a better understanding of Unicode. Specific answers to your questions are at the bottom.
Concepts
You need a more nuanced set of concepts than are required for very simple text handling as taught in introductory programming courses.
byte
code unit
code point
abstract character
user perceived Character
A byte is the smallest addressable unit of memory. Usually 8 bits today, capable of storing up to 256 different values. By definition a char is one byte.
A code unit is the smallest fixed size unit of data used in storing text. When you don't really care about the content of text and you just want to copy it somewhere or calculate how much memory the text is using then you care about code units. Otherwise code units aren't much use.
A code point represents a distinct member of a character set. Whatever 'characters' are in a character set, they all are assigned a unique number, and whenever you see a particular number encoded then you know which member of the character set you're dealing with.
An abstract character is an entity with meaning in a linguistic system, and is distinct from its representation or any code points assigned to that meaning.
User perceived characters are what they sound like; what the user thinks of as a character in whatever linguistic system he's using.
In the old days, char represented all of these things: a char is by definition a byte, in char* strings the code units are chars, the character sets were small so the 256 values representable by char was plenty to represent every member, and the linguistic systems that were supported were simple, so the members of the character sets mostly represented the characters users wanted to use directly.
But this simple system with char representing pretty much everything wasn't enough to support more complex systems.
The first problem encountered was that some languages use far more than 256 characters. So 'wide' characters were introduced. Wide characters still used a single type to represent four of the above concepts, code units, code points, abstract characters, and user perceived characters. However wide characters are no longer single bytes. This was thought to be the simplest method of supporting large character sets.
Code could mostly be the same, except it would deal with wide characters instead of char.
However it turns out that many linguistic systems aren't that simple. In some systems it makes sense not to have every user-perceived character necessarily be represented by a single abstract character in the character set. As a result text using the Unicode character set sometimes represents user perceived characters using multiple abstract characters, or uses a single abstract character to represent multiple user-perceived characters.
Wide characters have another problem. Since they increase the size of the code unit they increase the space used for every character. If one wishes to deal with text that could adequately be represented by single byte code units, but must use a system of wide characters then the amount of memory used is higher than would be the case for single byte code units. As such, it was desired that wide characters not be too wide. At the same time wide characters need to be wide enough to provide a unique value for every member of the character set.
Unicode currently contains about 100,000 abstract characters. This turns out to require wide characters which are wider than most people care to use. As a result a system of wide characters; where code units larger than one byte are used to directly store codepoint values turns out to be undesirable.
So to summarize, originally there was no need to distinguish between bytes, code units, code points, abstract characters, and user perceived characters. Over time, however, it became necessary to distinguish between each of these concepts.
Encodings
Prior to the above, text data was simple to store. Every user perceived character corresponded to an abstract character, which had a code point value. There were few enough characters that 256 values was plenty. So one simply stored the code point numbers corresponding to the desired user-perceived characters directly as bytes. Later, with wide characters, the values corresponding to user-percieved characters were stored directly as integers of larger sizes, 16 bits, for example.
But since storing Unicode text this way would use more memory than people are willing to spend (three or four bytes for every character) Unicode 'encodings' store text not by storing the code point values directly, but by using a reversible function to compute some number of code unit values to store for each code point.
The UTF-8 encoding, for example, can take the most commonly used Unicode code points and represent them using a single, one byte code unit. Less common code points are stored using two one byte code units. Code points that are still less common are stored using three or four code units.
This means that common text can generally be stored with the UTF-8 encoding using less memory than 16-bit wide character schemes, but also that the numbers stored do not necessarily correspond directly to the code point values of abstract characters. Instead if you need to know what abstract characters are stored, you have to 'decode' the stored code units. And if you need to know the user perceived characters you have to further convert abstract characters into user perceived characters.
There are many different encodings, and in order to convert data using those encodings into abstract characters you must know the right method of decoding. The stored values are effectively meaningless if you don't know what encoding was used to convert the code point values into code units.
An important implication of encodings are that you need to know whether particular manipulations of encoded data are valid, or meaningful.
For example, if you want get the 'size' of a string are you counting bytes, code units, abstract characters, or user perceived characters? std::string::size() counts code units, and if you need a different count then you have to use another method.
As another example, if you split an encoded string you need to know if you're doing so in such a way that the result is still valid in that encoding and that the data's meaning hasn't unintentionally changed. For example you might split between code units that belong to the same code point, thus producing an invalid encoding. Or you might split between code points which must be combined to represent a user perceived character and thus produce data the user will see as incorrect.
Answers
Today char and wchar_t can only be considered code units. The fact that char is only one byte doesn't prevent it from representing code points that take two, three, or four bytes. You simply have to use two, three, or four chars in sequence. This is how UTF-8 was intended to work. Likewise, platforms that use two byte wchar_t to represent UTF-16 simply use two wchar_t in a row when necessary. The actual values of char and wchar_t don't individually represent Unicode code points. They represent code unit values that result from encoding the code points. E.g. The Unicode code point U+0400 is encoded into two code units in UTF-8 -> 0xD0 0x80. The Unicode code point U+24B62 similarly gets encoded into as four code units 0xF0 0xA4 0xAD 0xA2.
So you can use std::string to hold UTF-8 encoded data.
On Windows main() supports not just ASCII, but whatever the system char encoding is. Now even Windows support UTF-8 as the system char encoding and is no longer limited to legacy encodings. You may have to configure Windows for this though; I'm not sure if it's the default yet, and if not hopefully it becomes the default.
You can also use a Win32 API call to directly access the UTF-16 command line parameters instead of using main()s argc and argv parameters. See GetCommandLineW() and CommandLineToArgvW.
wmain()'s argv parameter fully supports Unicode. The 16-bit code units stored in wchar_t on Windows are UTF-16 code units. The Windows API uses UTF-16 natively, so it's quite easy to work with on Windows. wmain() is non-standard though, so relying on this won't be portable.
Example:
#include <iostream>
#include <string>
int main() {
std::string s = "CJK UNIFIED IDEOGRAPH-24B62: \U00024B62";
std::cout << s << '\n';
auto space = s.rfind(' ');
std::cout << "Encoded bytes: ";
for (auto i = space + 1, end = s.size(); i != end; ++i) {
std::cout << std::hex << static_cast<int>(static_cast<unsigned char>(s[i])) << " ";
}
}
If the compiler uses UTF-8 as the narrow execution encoding then s will contain UTF-8 data. If the terminal you're using to run the compiled program supports UTF-8, is configured to use it, and uses a font that supports the character 𤭢 then you should see that character printed out by this program.
On Windows I built with the /utf-8 flag cl.exe /EHsc /utf-8 tmp.cpp, and ran the command to set the console to UTF-8 chcp 65001, resulting in this program printing the correct data. Although due to lack of support from the font the character displayed as a box with a question mark. Copying the text from the console and pasting it somewhere with proper support reveals that the correct character was written.
With /utf-8 you can also write utf-8 data directly in string literals instead of using the \Uxxxxxxxx syntax.
With GCC you can use the flag -fexec-charset=utf-8 to build this program, though it should be the default. -finput-charset=utf-8 allows you to directly write UTF-8 encoded data in your string literals.
Clang doesn't bother to support anything other than UTF-8.
Windows uses UTF-16. Any code point in the range of U+0000 to U+D7FF and U+E000 to U+FFFF will be stored directly; any outside of those ranges will be split into two 16-bit values according to the UTF-16 encoding rules.
For example 0x24B62 will be encoded as 0xd892,0xdf62.
You may convert the strings to work with them any way you'd like but the Windows API will still want and deliver UTF-16 so that's probably going to be the most convenient.
The size and meaning of wchar_t is implementation-defined. On Windows it's 16 bit as you say, on Unix-like systems it's often 32 bit but not always.
For that matter, a compiler is permitted do its own thing and pick a different size for wchar_t than what the system says -- it just won't be ABI-compatible with the rest of the system.
C++11 provides std::u32string, which is for representing strings of unicode code points. I believe that sufficiently recent Microsoft compilers include it. It's of somewhat limited use since Microsoft's system functions expect 16-bit wide characters (a.k.a UTF-16le), not 32-bit unicode code points (a.k.a UTF-32, UCS-4).
You mention UTF-8, though: UTF-8 encoded data can be stored in a regular std::string. Of course since it's a variable-length encoding, you can't access unicode code points by index, you can only access the bytes by index. But you'd normally write your code not to need to access code points by index anyway, even if using u32string. Unicode code points don't correspond 1-1 with printable characters ("graphemes") because of the existence of combining marks in Unicode, so many of the little tricks you play with strings when learning to program (reversing them, searching for substrings) don't work so easily with Unicode data no matter what you store it in.
The character 𤭢 is, as you say, \u24B62. It is UTF-8 encoded as a series of four bytes, not three: F0 A4 AD A2. Translating between UTF-8 encoded data and unicode code points is effort (admittedly not a huge amount of effort and library functions will do it for you). It's best to regard "encoded data" and "unicode data" as separate things. You can use whatever representation you find most convenient right up to the point where you need to (for example) render the text to screen. At that point you need to (re-)encode it to an encoding that your output destination understands.
In standard C++ we have char and wchar_t for storing characters? char can store values between 0x00 and 0xFF. And wchar_t can store values between 0x0000 and 0xFFFF
Not quite:
sizeof(char) == 1 so 1 byte per character.
sizeof(wchar_t) == ? Depends on your system
(for unix usually 4 for Windows usually 2).
Unicode characters consume up to 4-byte space.
Not quite. Unicode is not an encoding. Unicode is a standard the defines what each code point is and the code points are restricted to 21 bits. The first 16 bits defined the character position on a code plain while the following 5 bits defines which plain the character is on.
There are several unicode encodings (UTF-8, UTF-16 and UTF-32 being the most common) this is how you store the characters in memory. There are practical differences between the three.
UTF-8: Great for storage and transport (as it is compact)
Bad because it is variable length
UTF-16: Horrible in nearly all regards
It is always large and it is variable length
(anything not on the BMP needs to be encoded as surrogate pairs)
UTF-32: Great for in memory representations as it is fixed size
Bad because it takes 4 bytes for each character which is usually overkill
Personally I use UTF-8 for transport and storage and UTF-32 for in memory representation of text.
char and wchar_t are not the only data types used for text strings. C++11 introduces new char16_t and char32_t data types and respective STL std::u16string and std::u32string typedefs of std::basic_string, to address the ambiquity of the wchar_t type, which has different sizes and encodings on different platforms. wchar_t is 16-bit on some platforms, suitable for UTF-16 encoding, but is 32-bit on other platforms, suitable for UTF-32 encoding instead. char16_t is specifically 16-bit and UTF-16, and char32_t is specifically 32-bit and UTF-32, on all platforms.