Conversion from std::wstring to ISO Latin1 format in c++ - c++

I have a need to convert from std::wstring to ISO Latin1. After reading several forums, I landed in a confusion. wstring is supports unicode character set in which each is two bytes where as ISO Latin1 is just 1 byte. But first 256 code points are same for both.
Is ISO Latin-1 is a multi byte string? If so do I need to use wstombs() to convert from wstring to ISO Latin-1
2.Do I have a need to convert the input wstring to ISO LATIN-1, if so how to achieve that?
Please help me in understanding this.

In Windows wchar_t is 16 bits.
When there are no surrogate pairs (character represented as 2 successive wchar_t values) you know that any wchar_t value <256 is Latin-1, and otherwise not.
Surrogate pair values are easily recognizable as such because they are in a value range reserved for such.
Actually, this means that you know that any wchar_t value <256 is Latin-1, and otherwise not, regardless of surrogate pairs.
And no, Latin-1 is not a multibyte encoding. "Multibyte" refers to encodings where the number of bytes per character can vary.

Related

What is the efficient, standards-compliant mechanism for processing Unicode using C++17?

Short version:
If I wanted to write program that can efficiently perform operations with Unicode characters, being able to input and output files in UTF-8 or UTF-16 encodings. What is the appropriate way to do this with C++?
Long version:
C++ predates Unicode, and both have evolved significantly since. I need to know how to write standards-compliant C++ code that is leak-free. I need a clear answers to:
Which string container should I pick?
std::string with UTF-8?
std::wstring (don't really know much about it)
std::u16string with UTF-16?
std::u32string with UTF-32?
Should I stick entirely to one of the above containers or change them when needed?
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?
What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
What are differences between wchar_t and other multi-byte character types? Is wchar_t character or wchar_t string literal capable of storing UTF encodings?
Which string container should I pick?
That is really up to you to decide, based on your own particular needs. Any of the choices you have presented will work, and they each have their own advantages and disadvantages. Generically, UTF-8 is good to use for storage and communication purposes, and is backwards compatible with ASCII. Whereas UTF-16/32 is easier to use when processing Unicode data.
std::wstring (don't really know much about it)
The size of wchar_t is compiler-dependent and even platform-dependent. For instance, on Windows, wchar_t is 2 bytes, making std::wstring usable for UTF-16 encoded strings. On other platforms, wchar_t may be 4 bytes instead, making std::wstring usable for UTF-32 encoded strings instead. That is why wchar_t/std::wstring is generally not used in portable code, and why char16_t/std::u16string and char32_t/std::u32string were introduced in C++11. Even char can have portability issues for UTF-8, since char can be either signed or unsigned at the descretion of the compiler vendors, which is why char8_t/std::u8string was introduced in C++20 for UTF-8.
Should I stick entirely to one of the above containers or change them when needed?
Use whatever containers suit your needs.
Typically, you should use one string type throughout your code. Perform data conversions only at the boundaries where string data enters/leaves your program. For instance, when reading/writing files, network communications, platform system calls, etc.
How to properly convert between them?
There are many ways to handle that.
C++11 and later have std::wstring_convert/std::wbuffer_convert. But these were deprecated in C++17.
There are 3rd party Unicode conversion libraries, such as ICONV, ICU, etc.
There are C library functions, platform system calls, etc.
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?
Yes, if you use appropriate string literal prefixes:
u8 for UTF-8.
L for UTF-16 or UTF-32 (depending on compiler/platform).
u16 for UTF-16.
u32 for UTF-32.
Also, be aware that the charset you use to save your source files can affect how the compiler interprets string literals. So make sure that whatever charset you choose to save your files in, like UTF-8, that you tell your compiler what that charset is, or else you may end up with the wrong string values at runtime.
What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
Each string character may be a single-byte, or be part of a multi-byte representation of a Unicode codepoint. It depends on the encoding of the string, and the character being encoded.
Just as std::wstring (when wchar_t is 2 bytes) and std::u16string can hold strings containing supplementary characters outside of the Unicode BMP, which require UTF-16 surrogates to encode.
When a string container contains a UTF encoded string, each "character" is just a UTF encoded codeunit. UTF-8 encodes a Unicode codepoint as 1-4 codeunits (1-4 chars in a std::string). UTF-16 encodes a codepoint as 1-2 codeunits (1-2 wchar_ts/char16_ts in a std::wstring/std::u16string). UTF-32 encodes a codepoint as 1 codeunit (1 char32_t in a std::u32string).
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
Exactly what you would expect. A std::string holds char elements. Regardless of encoding, operator+=(char) will simply append a single char to the end of the std::string.
How can I distinguish UTF char[] and non-UTF char[] or std::string?
You would need to have outside knowledge of the string's original encoding, or else perform your own heuristic analysis of the char[]/std::string data to see if it conforms to a UTF or not.
What are differences between wchar_t and other multi-byte character types?
Byte size and UTF encoding.
char = ANSI/MBCS or UTF-8
wchar_t = DBCS, UTF-16 or UTF-32, depending on compiler/platform
char8_t = UTF-8
char16_t = UTF-16
char32_t = UTF-32
Is wchar_t character or wchar_t string literal capable of storing UTF encodings?
Yes, UTF-16 or UTF-32, depending on compiler/platform. In case of UTF-16, a single wchar_t can only hold a codepoint value that is in the BMP. A single wchar_t in UTF-32 can hold any codepoint value. A wchar_t string can encode all codepoints in either encoding.
How to properly manipulate UTF strings (such as toupper/tolower conversion) and be compatible with locales simultaneously?
That is a very broad topic, worthy of its own separate question by itself.

Understanding unicode codecvt

I have a UTF-16 encoded stream and I'd like to convert it into plain ASCII, i.e. if there's an ASCII character -> print it. If a codeunit represents something else I don't care e.g. chinese characters) -> output garbage.
I'm using this code
typedef std::codecvt_utf16<wchar_t> convert_typeX;
std::wstring_convert<convert_typeX, wchar_t> converterX;
std::string converted = converterX.from_bytes(str);
and it seems to work.. but why?
documentation for codecvt_utf16 states:
std::codecvt_utf16 is a std::codecvt facet which encapsulates conversion between a UTF-16 encoded byte string and UCS2 or UCS4 character string (depending on the type of Elem).
UCS2 is a version of unicode as far as I know.. so this code is converting to a sequence of wchar_t bytes that represent unicode characters right? How come I'm getting ASCII bytes?
The nice thing about unicode is that unicode values 0-127 represent ASCII characters 0-127.
So, you don't even need to waste your time with std::codecvt. All you have to do is scan your UTF-16 sequence, grab all UTF-16 values in the range of 0-127 (see the wikipedia entry for UTF-16 for the simple process of extracting UTF-16 values from the bytestream), and you'll end up with plain ASCII, as if by magic. That's because, by definition, values above 127 are not plain ASCII. You can do whatever you want with all other characters.
And, if you would like to expand your universe to iso-8859-1, rather than US-ASCII, you can expand your range to 0-255. Because unicode values 128-255 are also equivalent to characters 128-255 in the iso-8859-1 codeset.

What is the difference between "UTF-16" and "std::wstring"?

Is there any difference between these two string storage formats?
std::wstring is a container of wchar_t. The size of wchar_t is not specified—Windows compilers tend to use a 16-bit type, Unix compilers a 32-bit type.
UTF-16 is a way of encoding sequences of Unicode code points in sequences of 16-bit integers.
Using Visual Studio, if you use wide character literals (e.g. L"Hello World") that contain no characters outside of the BMP, you'll end up with UTF-16, but mostly the two concepts are unrelated. If you use characters outside the BMP, std::wstring will not translate surrogate pairs into Unicode code points for you, even if wchar_t is 16 bits.
UTF-16 is a specific Unicode encoding. std::wstring is a string implementation that uses wchar_t as its underlying type for storing each character. (In contrast, regular std::string uses char).
The encoding used with wchar_t does not necessarily have to be UTF-16—it could also be UTF-32 for example.
UTF-16 is a concept of text represented in 16-bit elements but an actual textual character may consist of more than one element
std::wstring is just a collection of these elements, and is a class primarily concerned with their storage.
The elements in a wstring, wchar_t is at least 16-bits but could be 32 bits.

c++: getting ascii value of a wide char

let's say i have a char array like "äa".
is there a way to get the ascii value (e.g 228) of the first char, which is a multibyte?
even if i cast my array to a wchar_t * array, i'm not able to get the ascii value of "ä", because its 2 bytes long.
is there a way to do this, im trying for 2 days now :(
i'm using gcc.
thanks!
You're contradicting yourself. International characters like ä are (by definition) not in the ASCII character set, so they don't have an "ascii value".
It depends on the exact encoding of your two-character array, if you can get the code point for a single character or not, and if so which format it will be in.
You are very confused. ASCII only has values smaller than 128. Value 228 corresponds to ä in 8 bit character sets ISO-8859-1, CP1252 and some others. It also is the UCS value of ä in the Unicode system. If you use string literal "ä" and get a string of two characters, the string is in fact encoded in UTF-8 and you may wish to parse the UTF-8 coding to acquire Unicode UCS values.
More likely what you really want to do is converting from one character set to another. How to do this heavily depends on your operating system, so more information is required. You also need to specify what exactly you want? A std::string or char* of ISO-8859-1, perhaps?
There is a standard C++ template function to do that conversion, ctype::narrow(). It is part of the localization library. It will convert the wide character to the equivalent char value for you current local, if possible. As the other answers have pointed out, there isn't always a mapping, which is why ctype::narrow() takes a default character that it will return if there is no mapping.
Depends on the encoding used in your char array.
If your char array is Latin 1 encoded, then it it 2 bytes long (plus maybe a NUL terminator, we don't care), and those 2 bytes are:
0xE4 (lower-case a umlaut)
0x61 (lower-case a).
Note that Latin 1 is not ASCII, and 0xE4 is not an ASCII value, it's a Latin 1 (or Unicode) value.
You would get the value like this:
int i = (unsigned char) my_array[0];
If your char array is UTF-8 encoded, then it is three bytes long, and those bytes are:
binary 11000011 (first byte of UTF-8 encoded 0xE4)
binary 10100100 (second byte of UTF-8 encoded 0xE4)
0x61 (lower-case a)
To recover the Unicode value of a character encoded with UTF-8, you either need to implement it yourself based on http://en.wikipedia.org/wiki/UTF-8#Description (usually a bad idea in production code), or else you need to use a platform-specific unicode-to-wchar_t conversion routine. On linux this is mbstowcs or iconv, although for a single character you can use mbtowc provided that the multi-byte encoding defined for the current locale is in fact UTF-8:
wchar_t i;
if (mbtowc(&i, my_array, 3) == -1) {
// handle error
}
If it's SHIFT-JIS then this doesn't work...
what you want is called transliteration - converting letters of one language to another. it has nothing about unicode and wchars. you need to have a table of mapping.

Can BSTR's hold characters that take more than 16 bits to represent?

I am confused about Windows BSTR's and WCHAR's, etc. WCHAR is a 16-bit character intended to allow for Unicode characters. What about characters that take more then 16-bits to represent? Some UTF-8 chars require more then that. Is this a limitation of Windows?
Edit: Thanks for all the answers. I think I understand the Unicode aspect. I am still confused on the Windows/WCHAR aspect though. If WCHAR is a 16-bit char, does Windows really use 2 of them to represent code-points bigger than 16-bits or is the data truncated?
UTF-8 is not the encoding used in Windows' BSTR or WCHAR types. Instead, they use UTF-16, which defines each code point in the Unicode set using either 1 or 2 WCHARs. 2 WCHARs gives exactly the same amount of code points as 4 bytes of UTF-8.
So there is no limitation in Windows character set handling.
UTF8 is an encoding of a Unicode character (codepoint). You may want to read this excellent faq on the subject. To answer your question though, BSTRs are always encoded as UTF-16. If you have UTF-32 encoded strings, you will have to transcode them first.
As others have mentioned, the FAQ has a lot of great information on unicode.
The short answer to your question, however, is that a single unicode character may require more than one 16bit character to represent it. This is also how UTF-8 works; any unicode character that falls outside the range that a single byte is able to represent uses two (or more) bytes.
BSTR simply contains 16 bit code units that can contain any UTF-16 encoded data. As for the OS, Windows has supported surrogate pairs since XP. See the Dr International FAQ
The Unicode standard defines somewhere over a million unique code-points (each code-point represents an 'abstract' character or symbol - e.g. 'E', '=' or '~').
The standard also defines several methods of encoding those million code-points into commonly used fundamental data types, such as 8-bit characters, or 16-byte wchars.
The two most widely used encodings are utf-8 and utf-16.
utf-8 defines how to encode unicode code points into 8-bit chars. Each unicode code-point will map to between 1 and 4 8-bit chars.
utf-16 defines how to encode unicode code points into 16-bit words (WCHAR in Windows). Most code-points will map onto a single 16-bit WCHAR, but there are some that require two WCHARs to represent.
I recommend taking a look at the Unicode standard, and especially the FAQ (http://unicode.org/faq/utf_bom.html)
Windows has used UTF-16 as its native representation since Windows 2000; prior to that it used UCS-2. UTF-16 supports any Unicode character; UCS-2 only supports the BMP. i.e. it will do the right thing.
In general, though, it doesn't matter much, anyway. For most applications strings are pretty opaque, and just passed to some I/O mechanism (for storage in a file or database, or display on-screen, etc.) that will do the Right Thing. You just need to ensure you don't damage the strings at all.