Problems converting std::wstring to UnicodeString using "UTF8ToUnicodeString" - c++

In my project (using Embarcadero C++Builder), I am trying to convert a std::wstring to a UnicodeString using UTF8ToUnicodeString() from <system.hpp>.
The result shows some replacement characters (U+FFFD) for some Russian and Vietnamese characters. But most characters are shown correctly.
Does anybody know what the problem could be? Is it a problem with codepages?

First, neither std::wstring nor UnicodeString use UTF-8, so you should not be using UTF8ToUnicodeString() at all in this situation. UnicodeString uses UTF-16 on all platforms. std::wstring uses UTF-16 on Windows, and UTF-32 on most other platforms.
Second, std::wstring is a wchar_t-based string. UnicodeString uses wchar_t on Windows, and char16_t on other platforms. It has constructors that accept C-style wchar_t* string pointers as input, and will convert the data to UTF-16 if needed.
So, you can simply use the std::wstring::c_str() method to convert std::wstring to UnicodeString, eg:
std::wstring w = ...;
UnicodeString u = w.c_str();
Alternatively:
std::wstring w = ...;
UnicodeString u(w.c_str(), w.size());
If you try to assign a wchar_t* string to a RawByteString, such as for the input to UTF8ToUnicodeString(), the RTL will perform a Unicode->ANSI conversion to the default system ANSI codepage specified by System::DefaultSystemCodePage, which is not universally set to UTF-8 on all platforms (especially on Windows), hence why you may lose characters, or even potentially end up with Mojibake.

Related

Difference between "codecvt_utf8_utf16" and "codecvt_utf8" for converting from UTF-8 to UTF-16

I came across two code snippets
std::wstring str = std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>>().from_bytes("some utf8 string");
and,
std::wstring str = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes("some utf8 string");
Are they both correct ways to convert utf-8 stored in std::string to utf-16 in std::wstring?
codecvt_utf8_utf16 does exactly what it says: converts between UTF-8 and UTF-16, both of which are well-understood and portable encodings.
codecvt_utf8 converts between UTF-8 and UCS-2/4 (depending on the size of the given type). UCS-2 and UTF-16 are not the same thing.
So if your goal is to store genuine, actual UTF-16 in a wchar_t, then you should use codecvt_utf8_utf16. However, if you're trying to do cross-platform coding with wchar_t as some kind of Unicode-ish thing or whatever, you can't. The UTF-16 facet always converts to UTF-16, whereas wchar_t on non-Windows platforms is expected to generally be UTF-32/UCS-4. By contrast, codecvt_utf8 only converts to UCS-2/4, but on Windows, wchar_t strings are "supposed" to be full UTF-16.
So you can't write code to satisfy all platforms without some #ifdef or template work. On Windows, you should use codecvt_utf8_utf16; on non-Windows, you should use codecvt_utf8.
Or better yet, just use UTF-8 internally and find APIs that directly take strings in a specific format rather than platform-dependent wchar_t stuff.

Visual C++ - UTF-8 - CA2W followed by CW2T with MBCS - Possibly a bad idea?

I'm using a library that produces UTF-8 null-terminated strings in the const char* type. Examples include:
MIGUEL ANTÓNIO
DONA ESTEFÂNIA
I'd like to convert those two const char* types to CString so that they read:
MIGUEL ANTÓNIO
DONA ESTEFÂNIA
To that effect, I'm using the following function I made:
CString Utf8StringToCString(const char * s)
{
CStringW ws = CA2W(s, CP_UTF8);
return CW2T(ws);
}
The function seems to do what I want (at least for those 2 cases). However, I'm wondering: is it a good idea at all to use the CA2W macro, followed by CW2T? Am I doing some sort of lossy conversion by doing this? Are there any side-effects I should worry about?
Some other details:
I'm using Visual Studio 2015
My application is compiled using Use Multi-Byte Character Set
Even if your application is compiled as MBCS, you can still use Unicode strings, buffers, and Windows Unicode APIs without any issue.
Pass your strings around as UTF-8 either with a raw pointer (const char*) or in a string class such as CString or std::string. When you actually need to render the string for display, convert to Unicode and use the W API explicitly.
For example:
void UpdateDisplayText(const char* s)
{
CStringW ws = CA2W(s, CP_UTF8);
SetDlgItemTextW(m_hWnd, IDC_LABEL1, ws);
}

Is wstring character is Unicode ? What happens during conversion?

Recent times I am coming across the conversion of UTF-8 encoding to string and vice vera. I understood that UTF-8 encoding is used to hold almost all the characters in the world while using char which is built in data type for string, only ASCII values can be stored.For a character in UTF-8 encoding the number of bytes required in memory is varied from one byte to 4 bytes but for 'char' type it is usually 1 byte.
My question is what happens in conversion from wstring to string or wchar to char ?
Does the characters which require more than one byte is skipped? It seems it depends on implementation but I want to know what is the correct way of doing it.
Also does wchar is required to store unicode characters ? As far as I understood UNICODE characters can be stored in normal string as well. Why should we use wstring or wchar ?
Depends how you convert them.
You need to specify the source encoding type and the target encoding type.
wstring is not a format, it just defines a data type.
Now usually when one says "Unicode", one means UTF16 which is what Microsoft Windows uses, and that is usuasly what wstring contains.
So, the right way to convert from UTF8 to UTF16:
std::string utf8String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::wstring utf16String = convert.from_bytes( utf8String );
And the other way around:
std::wstring utf16String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::string utf16String = convert.to_bytes( utf16String );
And to add to the confusion:
When you use std::string on a windows platform (like when you use a multibyte compilation), It's NOT UTF8. They use ANSI.
More specifically, the default encoding language your windows is using.
When compiling in Unicode the windows API commands expect these formats:
CommandA - multibyte - ANSI
CommandW - Unicode - UTF16
Make your source files UTF-8 encoded, set the character encoding to UNICODE in your IDE.
Use std::string and widen them for WindowsAPI calls.
std::string somestring = "こんにちは";
WindowsApiW(widen(somestring).c_str());
I know it sounds kind of hacky but a more profound explaination of this issue can be found at utf8everywhere.org.

How to convert from std::string to SQLWCHAR*

I'm trying to convert std::string to SQLWCHAR*, but I couldn't find how.
Any brilliant suggestion, please?
One solution would be to simply use a std::wstring in the first place, rather than std::string. With Unicode Character set you can define a wide string literal using the following syntax:
std::wstring wstr = L"hello world";
However, if you would like to stick with std::string then you will need to convert the string to a different encoding. This will depend on how your std::string is encoded. The default encoding in Windows is ANSI (however the most common encoding when reading files or downloading text from websites is usually UTF8).
This answer shows a function for converting a std::string to a std::wstring on Windows (using the MultiByteToWideChar function).
https://stackoverflow.com/a/27296/966782
Note that the answer uses CP_ACP as the input encoding (i.e. ANSI). If your input string is UTF8 then you can change to use CP_UTF8 instead.
Once you have a std::wstring you should be able to easily retrieve a SQLWCHAR* using:
std::wstring::c_str()

BSTR to CString conversion for Arabic text

My VC++ (VS2008) project uses Multi-byte Character set.
I've the following code to convert a date string to COleDateTime
_bstr_t bstr_tDate = bstrDate; //bstrDate is populated by a COM function
const CString szStartDateTime = bstr_tDate.operator const char *();
bool bParseOK = oleDateTime.ParseDateTime(szStartDateTime);
This code works well in all regional settings, but fails in Arabic regional settings, where the input date is this format: 21/05/2012 11:50:31م
After conversion, the CString contains junk characters and parsing fails: 01/05/2012 11:50:28ã
Is there a BSTR to CString conversion that works in Arabic settings?
BSTR is string consisting of UTF-16-encoded Unicode codepoints (wide "chars", 16-bit):
typedef WCHAR OLECHAR;
typedef OLECHAR* BSTR;
which means that special characters like 'م' are represented by single WCHAR. In multi-byte string (C-style char* or std::string) are these special characters represented by more characters (therefore it's called "multi-byte").
The reason why your CString contains junk characters is because you retrieve char* directly from _bstr_t. You need to convert this wide-char string to multi-byte string first. There are more ways how to do that, one of them is to use WideCharToMultiByte function.
This question will also help you: How do you properly use WideCharToMultiByte
What you are trying to do is possible with CString despite the MBCS setting, but it will only support Arabic.
It is probably much easier to start supporting all Unicode. This can be done without much damage to existing code (you can keep the std::string and char*) if you follow the instructions at the Windows section of utf8everywhere.org.