UTF-16 codecvt facet - c++

Extending from this questions about locales
And described in this question: What I really wanted to do was install a codecvt facet into the locale that understands UTF-16 files.
I could write my own. But I am not a UTF expert and as such I am sure I would get it nearly correct; but it would break at the most inconvenient time. So I was wondering if there are any resources (on the web) of pre-build codecvt (or other) facets that can be used from C++ that are peer reviewed and tested?
The reason is the default locale (on my system MAC OS X 10.6) when reading a file just converts 1 byte to 1 wchar_t with no conversion. Thus UTF-16 encoded files are converted into wstrings that contain lots of null ('\0') characters.

I'm not sure if by "resources on the Web" you meant available free of cost, but there is the Dinkumware Conversions Library that sounds like it will fit your needs—provided that the library can be integrated into your compiler suite.
The codecvt types are described in the section Code Conversions.

As of C++11, there are additional standard codecvt specialisations and types, intended for converting between various UTF-x and UCSx character sequences; one of these may suit your needs.
In <locale>:
std::codecvt<char16_t, char, std::mbstate_t>: Converts between UTF-16 and UTF-8.
std::codecvt<char32_t, char, std::mbstate_t>: Converts between UTF-32 and UTF-8.
In <codecvt>:
std::codecvt_utf8_utf16<typename Elem>: Converts between UTF-8 and UTF-16, where UTF-16 code points are stored as the specified Elem (note that if char32_t is specified, only one code point will be stored per char32_t).
Has two additional, defaulted template paramters (unsigned long MaxCode = 0x10ffff, and std::codecvt_mode Mode = (std::codecvt_mode)0), and inherits from std::codecvt<Elem, char, std::mbstate_t>.
std::codecvt_utf8<typename Elem>: Converts between UTF-8 and either UCS2 or UCS4, depending on Elem (UCS2 for char16_t, UCS4 for char32_t, platform-dependent for wchar_t).
Has two additional, defaulted template paramters (unsigned long MaxCode = 0x10ffff, and std::codecvt_mode Mode = (std::codecvt_mode)0), and inherits from std::codecvt<Elem, char, std::mbstate_t>.
std::codecvt_utf16<typename Elem>: Converts between UTF-16 and either UCS2 or UCS4, depending on Elem (UCS2 for char16_t, UCS4 for char32_t, platform-dependent for wchar_t).
Has two additional, defaulted template paramters (unsigned long MaxCode = 0x10ffff, and std::codecvt_mode Mode = (std::codecvt_mode)0), and inherits from std::codecvt<Elem, char, std::mbstate_t>.
codecvt_utf8 and codecvt_utf16 will convert between the specified UTF and either UCS2 or UCS4, depending on the size of Elem. Therefore, wchar_t will specify UCS2 on systems where it's 16- to 31-bit (such as Windows, where it's 16-bit), or UCS4 on systems where it's at least 32-bit (such as Linux, where it's 32-bit), regardless of whether wchar_t strings actually use that encoding; on platforms that use different encodings for wchar_t strings, this will understandably cause problems if you aren't careful.
For more information, see CPP Reference:
std::codecvt
std::codecvt_utf8
std::codecvt_utf16
std::codecvt_utf8_utf16
Note that support for header codecvt was only added to libstdc++ relatively recently. If using an older version of Clang or GCC, you may have to use libc++, if you want to use it.
Note that versions of Visual Studio prior to 2015 don't actually support char16_t and char32_t; if these types exist on previous versions, it will be as typedefs for unsigned short and unsigned int, respectively. Also note that older versions of Visual Studio can have trouble converting strings between UTF encodings sometimes, and that Visual Studio 2015 has a glitch that prevents codecvt from working properly with char16_t and char32_t, requiring the use of same-sized integral types instead

Related

Transform byte array to string while supporting different encodings

Let's say I have read the binary content of a text file into a std::vector<std::uint8_t> and I want to transform these bytes into a string representation.
As long as the bytes are encoded using a single-byte encoding (ASCII for example), a transformation to std::string is pretty straightforward:
std::string transformToString(std::vector<std::uint8_t> bytes)
{
std::string str;
str.assign(
reinterpret_cast<std::string::value_type*>(const_cast<std::uint8_t*>(bytes.data())),
data.size() / sizeof(std::string::value_type)
);
return str;
}
As soon as the bytes are encoded in some unicode format, things get a little bit more complicated.
As far as I know, C++ supports additional string types for unicode strings. These are std::u8string for UTF-8, std::u16string for UTF-16 and std::u32string for UTF-32.
Problem 1: Let's say the bytes are encoded in UTF-8. How can I create a std::u8string from these bytes in the first place? Also, how do I know the length of the string since there can be code points encoded in multiple bytes?
Problem 2: I've seen, that UTF-16 and UTF-32 support both big-endian and little-endian byte order. Let's say the bytes are encoded in UTF-16 BE or UTF-16 LE. How can I create a std::u16string from the bytes (and how can I specify the byte order for transformation)? I am looking for something like std::u16string u16str = std::u16string::from_bytes(bytes, byte_order::big_endian);.
Problem 3: Are the previously listed types of unicode string already aware of a byte order mark or does the byte order mark (if present) need to be processed separately? Since the said string types are just char8_t, char16_t and char32_t templated on a std::basic_string, I assume, that processing of a byte order mark is not supported.
Clarification: Please note, that I do not want to do any conversions. Almost every article I found was about how to convert UTF-8 strings to other encodings and vice-versa. I just want to get the string representation of the specified byte array. Therefore, as the user/programmer, I must be aware of the encoding of the bytes to get the correct representation. For example:
The bytes are encoded in UTF-8 (e.g. 41 42 43 (ABC)). I try to transform them to a std::u8string. The transformation was correct, the string is ABC.
The bytes are encoded in UTF-8 (e.g. 41 42 43 (ABC)). I try to transform them to a std::u16string. The transformation fails or the resulting string is not correct.
Your transformToString is (more or less) correct only if uint8_t is unsigned char, which however is the case on every platform I know.
It is unnecessary to do the multiple casts you are doing. The whole cast sequence is not an aliasing violation only if you are casting from unsigned char* to char* (and char is always the value type of std::string). In particular there is no const involved. I also say "more or less", because while this is probably supposed to work specifically when casting between signed/unsigned variants of the same element type, the standard currently doesn't actually specify the pointer arithmetic on the resulting pointer (which I guess is a defect).
However there is a much safer way that doesn't involve dangerous casts or potential for length mismatch:
str.assign(std::begin(bytes), std::end(bytes));
You can use exactly the same line as above to convert to any other std::basic_string specialization, but the important point is that it will simply copy individual bytes as individual code units, not considering encoding or endianess in any way.
Problem 1: Let's say the bytes are encoded in UTF-8. How can I create a std::u8string from these bytes in the first place? Also, how do I know the length of the string since there can be code points encoded in multiple bytes?
You create the string exactly with the same line I showed above. In this case your approach would be wrong if you just replace str's type because char8_t cannot alias unsigned char and would therefore be an aliasing violation resulting in undefined behavior.
A std::u8string holds a sequence of UTF-8 code units (by convention). To get individual code points you can convert to UTF-32. There is std::mbrtoc32 from the C standard library, which relies on the C locale being set as UTF-8 (and requires conversion back to a char array first) and there is codecvt_utf8<char32_t> from the C++ library, which is however deprecated and no replacement has been decided on yet.
There are no functions in the standard library that actually interpret the sequence of code units in u8string as code points. (e.g. .size() is the number of code units, not the number of code points).
Problem 2: I've seen, that UTF-16 and UTF-32 support both big-endian and little-endian byte order. Let's say the bytes are encoded in UTF-16 BE or UTF-16 LE. How can I create a std::u16string from the bytes (and how can I specify the byte order for transformation)? I am looking for something like std::u16string u16str = std::u16string::from_bytes(bytes, byte_order::big_endian);.
There is nothing like that directly in the standard library. A u16string holds 16bit code units of type char16_t as values. What endianess or in general what representation is used for this type is an implementation detail, but you can expect it to be equal to that of other basic types. Since C++20 there is std::endian to indicate the endianess of all scalar types if applicable and std::byteswap which can be used to swap byte order if the endianess doesn't match the source endianess. However, you would need to manually iterate over the vector and form char16_ts from pairs of bytes by bitwise operations anyway, so I am not sure whether this is all that helpful.
All of the above assumes that the original data is actually UTF-16 encoded. If that is not the case you need to convert from the original encoding to UTF-16 for which there are equivalent functions as in the UTF-32 case mentioned above.
Problem 3: Are the previously listed types of unicode string already aware of a byte order mark or does the byte order mark (if present) need to be processed separately? Since the said string types are just char8_t, char16_t and char32_t templated on a std::basic_string, I assume, that processing of a byte order mark is not supported.
The types simply store sequences of code units. They do not care what they represent (e.g. whether they represent a BOM). Because they store code units, not bytes, the BOM wouldn't have any meaning in processing them anyway.

How does convertion between char and wchar_t work in Windows?

In Windows there are the functions like mbstowcs to convert between char and wchar_t. There are also C++ functions such as from_bytes<std::codecvt<wchar_t, char, std::mbstate_t>> to use.
But how does this work beind the scenes as char and wchar_t are obviously of different size? I assume the system codepage is involved in some way? But what happens if a wchar_t can't be correlated to a char (it can after all contain a lot more values)?
Also what happens if code that has to use char (maybe due to a library) is moved between computers with different codepages? Say that it is only using numbers (0-9) which are well within the range of ASCII, would that always be safe?
And finally, what happens on computers where the local language can't be represented in 256 characters? In that case the concept of char seems completely irrelevant other than for storing for example utf8.
It all depends on the cvt facet used, as described here
In your case, (std::codecvt<wchar_t, char, std::mbstate_t>) it all boils down to mbsrtowcs / wcsrtombs using the global locale. (that is the "C" locale, if you don't replace it with the system one)
I don't know about mbstowcs() but I assume it is similar to std::codecvt<cT, bT, std::mbstate_t>. The latter travels in terms of two types:
A character type cT which is in your code wchar_t.
A byte type bT which is normally char.
The third type in play, std::mbstate_t, is used to store any intermediate state between calls to the std::codecvt<...> facet. The facets can't have any mutable state and any state between calls needs to be obtained somehow. Sadly, the structure of std::mbstate_t is left unspecified, i.e., there is no portable way to actually use it when creating own code conversion facets.
Each instance of std::codecvt<...> implements the conversions between bytes of an external encoding, e.g., UTF8, and characters. Originally, each character was meant to be a stand-alone entity but various reasons (primarily from outside the C++ community, notably from changes made to Unicode) have result in the internal characters effectively being an encoding themselves. Typically the internal encodings used are UTF8 for char and UTF16 or UCS4 for wchar_t (depending on whether wchar_t uses 16 or 32 bits).
The decoding conversions done by std::codecvt<...> take the incoming bytes in the external encoding and turn them into characters of the internal encoding. For example, when the external encoding is UTF8 the incoming bytes are converted to 32 bit code points which are then stuck into UTF16 characters by splitting them up into to wchar_t when necessary (e.g., when wchar_t is 16 bit).
The details of this process are unspecified but it will involve some bit masking and shifting. Also, different transformations will use different approaches. If the mapping between the external and internal encoding isn't as trivial as mapping one Unicode representation to another representation there may be suitable tables providing the actual mapping.
I what is in the char array is actually a UTF-8 encoded string, then you can convert it to and from a UTF-16 encoded wchar_t array using
#include <locale>
#include <codecvt>
#include <string>
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string narrow = converter.to_bytes(wide_utf16_source_string);
std::wstring wide = converter.from_bytes(narrow_utf8_source_string);
as described in more detail at https://stackoverflow.com/a/18597384/6345

char vs wchar_t vs char16_t vs char32_t (c++11)

From what I understand, a char is safe to house ASCII characters whereas char16_t and char32_t are safe to house characters from unicode, one for the 16-bit variety and another for the 32-bit variety (Should I have said "a" instead of "the"?). But I'm then left wondering what the purpose behind the wchar_t is. Should I ever use that type in new code, or is it simply there to support old code? What was the purpose of wchar_t in old code if, from what I understand, its size had no guarantee to be bigger than a char? Clarification would be nice!
char is for 8-bit code units, char16_t is for 16-bit code units, and char32_t is for 32-bit code units. Any of these can be used for 'Unicode'; UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units.
The guarantee made for wchar_t was that any character supported in a locale could be converted from char to wchar_t, and whatever representation was used for char, be it multiple bytes, shift codes, what have you, the wchar_t would be a single, distinct value. The purpose of this was that then you could manipulate wchar_t strings just like the simple algorithms used with ASCII.
For example, converting ascii to upper case goes like:
auto loc = std::locale("");
char s[] = "hello";
for (char &c : s) {
c = toupper(c, loc);
}
But this won't handle converting all characters in UTF-8 to uppercase, or all characters in some other encoding like Shift-JIS. People wanted to be able to internationalize this code like so:
auto loc = std::locale("");
wchar_t s[] = L"hello";
for (wchar_t &c : s) {
c = toupper(c, loc);
}
So every wchar_t is a 'character' and if it has an uppercase version then it can be directly converted. Unfortunately this doesn't really work all the time; For example there exist oddities in some languages such as the German letter ß where the uppercase version is actually the two characters SS instead of a single character.
So internationalized text handling is intrinsically harder than ASCII and cannot really be simplified in the way the designers of wchar_t intended. As such wchar_t and wide characters in general provide little value.
The only reason to use them is that they've been baked into some APIs and platforms. However, I prefer to stick to UTF-8 in my own code even when developing on such platforms, and to just convert at the API boundaries to whatever encoding is required.
The type wchar_t was put into the standard when Unicode promised to create a 16 bit representation. Most vendors choose to make wchar_t 32 bits but one large vendor has chosen to to make it 16 bits. Since Unicode uses more than 16 bits (e.g., 20 bits) it was felt that we should have better character types.
The intent for char16_t is to represent UTF16 and char32_t is meant to directly represent Unicode characters. However, on systems using wchar_t as part of their fundamental interface, you'll be stuck with wchar_t. If you are unconstrained I would personally use char to represent Unicode using UTF8. The problem with char16_t and char32_t is that they are not fully supported, not even in the standard C++ library: for example, there are no streams supporting these types directly and it more work than just instantiating the stream for these types.

Size of wchar_t for unicode encoding

is there 32-bit wide character for encoding UTF-32 strings? I'd like to do it via std::wstring which apparently showing me size of a wide character is 16 bits on windows platform.
You won't be able to do it with std::wstring on many platforms because it will have 16 bit elements.
Instead you should use std::basic_string<char32_t>, but this requires a compiler with some C++0x support.
The size of wchar_t is platform-dependent and it is independent of UTF-8, UTF-16, and UTF-32 (it can be used to represent unicode data, but there is nothing that says that it represents that).
I strongly recommend using UTF-8 with std::string for internal string representation, and using established libraries such as ICU for complex manipulation and conversion tasks involving unicode.
Just use typedef!
It would look something like this:
typedef int char_32;
And use it like this:
char_32 myChar;
or as a c-string:
char_32* string_of_32_bit_char = "Hello World";
The modern answer for this is to use char32_t (c++11) which can be used with std::u32string. However, in reality, you should just use std::string with a encoding like UTF-8. Note that the old answer to char32_t would be using templates or macros to determine which unsigned integral type has size of 4 bytes, and use that.

conflicts: definition of wchar_t string in C++ standard and Windows implementation?

From c++2003 2.13
A wide string literal has type “array of n const wchar_t” and has static storage duration, where n is the size of the string as defined below
The size of a wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating L’\0’.
From c++0x 2.14.5
A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below
The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.
The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u’\0’.
The statement in C++2003 is quite vague. But in C++0x, when counting the length of the string, the wide string literal wchar_t shall be treated as same as char32_t, and different from char16_t.
There's a post that states clearly how windows implements wchar_t in https://stackoverflow.com/questions/402283?tab=votes%23tab-top
In short, wchar_t in windows is 16bits and encoded using UTF-16. The statement in standard apparently leaves something conflicting in Windows.
for example,
wchar_t kk[] = L"\U000E0005";
This exceeds 16bits and for UTF-16 it needs two 16 bits to encode it (a surrogate pair).
However, from standard, kk is an array of 2 wchar_t (1 for the universal-name \U000E005, 1 for \0).
But in the internal storage, Windows need 3 16-bit wchar_t objects to store it, 2 wchar_t for the surrogate pair, and 1 wchar_t for the \0. Therefore, from array's definition, kk is an array of 3 wchar_t.
It's apparently conflicting to each other.
I think one simplest solution for Windows is to "ban" anything that requires surrogate pair in wchar_t ("ban" any unicode outside BMP).
Is there anything wrong with my understanding?
Thanks.
The standard requires that wchar_t be large enough to hold any character in the supported character set. Based on this, I think your premise is correct -- it is wrong for VC++ to represent the single character \U000E0005 using two wchar_t units.
Characters outside the BMP are rarely used, and Windows itself internally uses UTF-16 encoding, so it is simply convenient (even if incorrect) for VC++ to behave this way. However, rather than "banning" such characters, it is likely that the size of wchar_t will increase in the future while char16_t takes its place in the Windows API.
The answer you linked to is somewhat misleading as well:
On Linux, a wchar_t is 4-bytes, while on Windows, it's 2-bytes
The size of wchar_t depends solely on the compiler and has nothing to do with the operating system. It just happens that VC++ uses 2 bytes for wchar_t, but once again, this could very well change in the future.
Windows knows nothing about wchar_t, because wchar_t is a programming concept. Conversely, wchar_t is just storage, and it knows nothing about the semantic value of the data you store in it (that is, it knows nothing about Unicode or ASCII or whatever.)
If a compiler or SDK that targets Windows defines wchar_t to be 16 bits, then that compiler may be in conflict with the C++0x standard. (I don't know whether there are some get-out clauses that allow wchar_t to be 16 bits.) But in any case the compiler could define wchar_t to be 32 bits (to comply with the standard) and provide runtime functions to convert to/from UTF-16 for when you need to pass your wchar_t* to Windows APIs.