Is wstring character is Unicode ? What happens during conversion? - c++

Recent times I am coming across the conversion of UTF-8 encoding to string and vice vera. I understood that UTF-8 encoding is used to hold almost all the characters in the world while using char which is built in data type for string, only ASCII values can be stored.For a character in UTF-8 encoding the number of bytes required in memory is varied from one byte to 4 bytes but for 'char' type it is usually 1 byte.
My question is what happens in conversion from wstring to string or wchar to char ?
Does the characters which require more than one byte is skipped? It seems it depends on implementation but I want to know what is the correct way of doing it.
Also does wchar is required to store unicode characters ? As far as I understood UNICODE characters can be stored in normal string as well. Why should we use wstring or wchar ?

Depends how you convert them.
You need to specify the source encoding type and the target encoding type.
wstring is not a format, it just defines a data type.
Now usually when one says "Unicode", one means UTF16 which is what Microsoft Windows uses, and that is usuasly what wstring contains.
So, the right way to convert from UTF8 to UTF16:
std::string utf8String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::wstring utf16String = convert.from_bytes( utf8String );
And the other way around:
std::wstring utf16String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::string utf16String = convert.to_bytes( utf16String );
And to add to the confusion:
When you use std::string on a windows platform (like when you use a multibyte compilation), It's NOT UTF8. They use ANSI.
More specifically, the default encoding language your windows is using.
When compiling in Unicode the windows API commands expect these formats:
CommandA - multibyte - ANSI
CommandW - Unicode - UTF16

Make your source files UTF-8 encoded, set the character encoding to UNICODE in your IDE.
Use std::string and widen them for WindowsAPI calls.
std::string somestring = "こんにちは";
WindowsApiW(widen(somestring).c_str());
I know it sounds kind of hacky but a more profound explaination of this issue can be found at utf8everywhere.org.

Related

Problems converting std::wstring to UnicodeString using "UTF8ToUnicodeString"

In my project (using Embarcadero C++Builder), I am trying to convert a std::wstring to a UnicodeString using UTF8ToUnicodeString() from <system.hpp>.
The result shows some replacement characters (U+FFFD) for some Russian and Vietnamese characters. But most characters are shown correctly.
Does anybody know what the problem could be? Is it a problem with codepages?
First, neither std::wstring nor UnicodeString use UTF-8, so you should not be using UTF8ToUnicodeString() at all in this situation. UnicodeString uses UTF-16 on all platforms. std::wstring uses UTF-16 on Windows, and UTF-32 on most other platforms.
Second, std::wstring is a wchar_t-based string. UnicodeString uses wchar_t on Windows, and char16_t on other platforms. It has constructors that accept C-style wchar_t* string pointers as input, and will convert the data to UTF-16 if needed.
So, you can simply use the std::wstring::c_str() method to convert std::wstring to UnicodeString, eg:
std::wstring w = ...;
UnicodeString u = w.c_str();
Alternatively:
std::wstring w = ...;
UnicodeString u(w.c_str(), w.size());
If you try to assign a wchar_t* string to a RawByteString, such as for the input to UTF8ToUnicodeString(), the RTL will perform a Unicode->ANSI conversion to the default system ANSI codepage specified by System::DefaultSystemCodePage, which is not universally set to UTF-8 on all platforms (especially on Windows), hence why you may lose characters, or even potentially end up with Mojibake.

Difference between "codecvt_utf8_utf16" and "codecvt_utf8" for converting from UTF-8 to UTF-16

I came across two code snippets
std::wstring str = std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>>().from_bytes("some utf8 string");
and,
std::wstring str = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes("some utf8 string");
Are they both correct ways to convert utf-8 stored in std::string to utf-16 in std::wstring?
codecvt_utf8_utf16 does exactly what it says: converts between UTF-8 and UTF-16, both of which are well-understood and portable encodings.
codecvt_utf8 converts between UTF-8 and UCS-2/4 (depending on the size of the given type). UCS-2 and UTF-16 are not the same thing.
So if your goal is to store genuine, actual UTF-16 in a wchar_t, then you should use codecvt_utf8_utf16. However, if you're trying to do cross-platform coding with wchar_t as some kind of Unicode-ish thing or whatever, you can't. The UTF-16 facet always converts to UTF-16, whereas wchar_t on non-Windows platforms is expected to generally be UTF-32/UCS-4. By contrast, codecvt_utf8 only converts to UCS-2/4, but on Windows, wchar_t strings are "supposed" to be full UTF-16.
So you can't write code to satisfy all platforms without some #ifdef or template work. On Windows, you should use codecvt_utf8_utf16; on non-Windows, you should use codecvt_utf8.
Or better yet, just use UTF-8 internally and find APIs that directly take strings in a specific format rather than platform-dependent wchar_t stuff.

What is the efficient, standards-compliant mechanism for processing Unicode using C++17?

Short version:
If I wanted to write program that can efficiently perform operations with Unicode characters, being able to input and output files in UTF-8 or UTF-16 encodings. What is the appropriate way to do this with C++?
Long version:
C++ predates Unicode, and both have evolved significantly since. I need to know how to write standards-compliant C++ code that is leak-free. I need a clear answers to:
Which string container should I pick?
std::string with UTF-8?
std::wstring (don't really know much about it)
std::u16string with UTF-16?
std::u32string with UTF-32?
Should I stick entirely to one of the above containers or change them when needed?
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?
What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
What are differences between wchar_t and other multi-byte character types? Is wchar_t character or wchar_t string literal capable of storing UTF encodings?
Which string container should I pick?
That is really up to you to decide, based on your own particular needs. Any of the choices you have presented will work, and they each have their own advantages and disadvantages. Generically, UTF-8 is good to use for storage and communication purposes, and is backwards compatible with ASCII. Whereas UTF-16/32 is easier to use when processing Unicode data.
std::wstring (don't really know much about it)
The size of wchar_t is compiler-dependent and even platform-dependent. For instance, on Windows, wchar_t is 2 bytes, making std::wstring usable for UTF-16 encoded strings. On other platforms, wchar_t may be 4 bytes instead, making std::wstring usable for UTF-32 encoded strings instead. That is why wchar_t/std::wstring is generally not used in portable code, and why char16_t/std::u16string and char32_t/std::u32string were introduced in C++11. Even char can have portability issues for UTF-8, since char can be either signed or unsigned at the descretion of the compiler vendors, which is why char8_t/std::u8string was introduced in C++20 for UTF-8.
Should I stick entirely to one of the above containers or change them when needed?
Use whatever containers suit your needs.
Typically, you should use one string type throughout your code. Perform data conversions only at the boundaries where string data enters/leaves your program. For instance, when reading/writing files, network communications, platform system calls, etc.
How to properly convert between them?
There are many ways to handle that.
C++11 and later have std::wstring_convert/std::wbuffer_convert. But these were deprecated in C++17.
There are 3rd party Unicode conversion libraries, such as ICONV, ICU, etc.
There are C library functions, platform system calls, etc.
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?
Yes, if you use appropriate string literal prefixes:
u8 for UTF-8.
L for UTF-16 or UTF-32 (depending on compiler/platform).
u16 for UTF-16.
u32 for UTF-32.
Also, be aware that the charset you use to save your source files can affect how the compiler interprets string literals. So make sure that whatever charset you choose to save your files in, like UTF-8, that you tell your compiler what that charset is, or else you may end up with the wrong string values at runtime.
What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
Each string character may be a single-byte, or be part of a multi-byte representation of a Unicode codepoint. It depends on the encoding of the string, and the character being encoded.
Just as std::wstring (when wchar_t is 2 bytes) and std::u16string can hold strings containing supplementary characters outside of the Unicode BMP, which require UTF-16 surrogates to encode.
When a string container contains a UTF encoded string, each "character" is just a UTF encoded codeunit. UTF-8 encodes a Unicode codepoint as 1-4 codeunits (1-4 chars in a std::string). UTF-16 encodes a codepoint as 1-2 codeunits (1-2 wchar_ts/char16_ts in a std::wstring/std::u16string). UTF-32 encodes a codepoint as 1 codeunit (1 char32_t in a std::u32string).
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
Exactly what you would expect. A std::string holds char elements. Regardless of encoding, operator+=(char) will simply append a single char to the end of the std::string.
How can I distinguish UTF char[] and non-UTF char[] or std::string?
You would need to have outside knowledge of the string's original encoding, or else perform your own heuristic analysis of the char[]/std::string data to see if it conforms to a UTF or not.
What are differences between wchar_t and other multi-byte character types?
Byte size and UTF encoding.
char = ANSI/MBCS or UTF-8
wchar_t = DBCS, UTF-16 or UTF-32, depending on compiler/platform
char8_t = UTF-8
char16_t = UTF-16
char32_t = UTF-32
Is wchar_t character or wchar_t string literal capable of storing UTF encodings?
Yes, UTF-16 or UTF-32, depending on compiler/platform. In case of UTF-16, a single wchar_t can only hold a codepoint value that is in the BMP. A single wchar_t in UTF-32 can hold any codepoint value. A wchar_t string can encode all codepoints in either encoding.
How to properly manipulate UTF strings (such as toupper/tolower conversion) and be compatible with locales simultaneously?
That is a very broad topic, worthy of its own separate question by itself.

std::string conversion to char32_t (unicode characters)

I need to read a file using fstream in C++ that has ASCII as well as Unicode characters using the getline function.
But the function uses only std::string and these simple strings' characters can not be converted into char32_t so that I can compare them with Unicode characters. So please could any one give any fix.
char32_t corresponds to UTF-32 encoding, which is almost never used (and often poorly supported). Are you sure that your file is encoded in UTF-32?
If you are sure, then you need to use std::u32string to store your string. For reading, you can use std::basic_stringstream<char32_t> for instance. However, please note that these types are generally poorly supported.
Unicode is generally encoded using:
UTF-8 in text files (and web pages, etc...)
A platform-specific 16-bit or 32-bit encoding in programs, using type wchar_t
So generally, universally encoded files are in UTF-8. They use a variable number of bytes for encoding characters, from 1(ASCII characters) to 4. This means you cannot directly test the individual chars using a std::string
For this, you need to convert the UTF-8 string to wchar_t string, stored in a std::wstring.
For this, use a converter defined like this:
std::wstring_convert<std::codecvt_utf8<wchar_t> > converter;
And convert like that:
std::wstring unicodeString = converter.from_bytes(utf8String);
You can then access the individual unicode characters. Don't forget to put a "L" before each string literals, to make it a unicode string literal. For instance:
if(unicodeString[i]==L'仮')
{
info("this is some japanese character");
}

How to convert from std::string to SQLWCHAR*

I'm trying to convert std::string to SQLWCHAR*, but I couldn't find how.
Any brilliant suggestion, please?
One solution would be to simply use a std::wstring in the first place, rather than std::string. With Unicode Character set you can define a wide string literal using the following syntax:
std::wstring wstr = L"hello world";
However, if you would like to stick with std::string then you will need to convert the string to a different encoding. This will depend on how your std::string is encoded. The default encoding in Windows is ANSI (however the most common encoding when reading files or downloading text from websites is usually UTF8).
This answer shows a function for converting a std::string to a std::wstring on Windows (using the MultiByteToWideChar function).
https://stackoverflow.com/a/27296/966782
Note that the answer uses CP_ACP as the input encoding (i.e. ANSI). If your input string is UTF8 then you can change to use CP_UTF8 instead.
Once you have a std::wstring you should be able to easily retrieve a SQLWCHAR* using:
std::wstring::c_str()