BSTR to CString conversion for Arabic text - c++

My VC++ (VS2008) project uses Multi-byte Character set.
I've the following code to convert a date string to COleDateTime
_bstr_t bstr_tDate = bstrDate; //bstrDate is populated by a COM function
const CString szStartDateTime = bstr_tDate.operator const char *();
bool bParseOK = oleDateTime.ParseDateTime(szStartDateTime);
This code works well in all regional settings, but fails in Arabic regional settings, where the input date is this format: 21/05/2012 11:50:31م
After conversion, the CString contains junk characters and parsing fails: 01/05/2012 11:50:28ã
Is there a BSTR to CString conversion that works in Arabic settings?

BSTR is string consisting of UTF-16-encoded Unicode codepoints (wide "chars", 16-bit):
typedef WCHAR OLECHAR;
typedef OLECHAR* BSTR;
which means that special characters like 'م' are represented by single WCHAR. In multi-byte string (C-style char* or std::string) are these special characters represented by more characters (therefore it's called "multi-byte").
The reason why your CString contains junk characters is because you retrieve char* directly from _bstr_t. You need to convert this wide-char string to multi-byte string first. There are more ways how to do that, one of them is to use WideCharToMultiByte function.
This question will also help you: How do you properly use WideCharToMultiByte

What you are trying to do is possible with CString despite the MBCS setting, but it will only support Arabic.
It is probably much easier to start supporting all Unicode. This can be done without much damage to existing code (you can keep the std::string and char*) if you follow the instructions at the Windows section of utf8everywhere.org.

Related

Visual C++ - UTF-8 - CA2W followed by CW2T with MBCS - Possibly a bad idea?

I'm using a library that produces UTF-8 null-terminated strings in the const char* type. Examples include:
MIGUEL ANTÓNIO
DONA ESTEFÂNIA
I'd like to convert those two const char* types to CString so that they read:
MIGUEL ANTÓNIO
DONA ESTEFÂNIA
To that effect, I'm using the following function I made:
CString Utf8StringToCString(const char * s)
{
CStringW ws = CA2W(s, CP_UTF8);
return CW2T(ws);
}
The function seems to do what I want (at least for those 2 cases). However, I'm wondering: is it a good idea at all to use the CA2W macro, followed by CW2T? Am I doing some sort of lossy conversion by doing this? Are there any side-effects I should worry about?
Some other details:
I'm using Visual Studio 2015
My application is compiled using Use Multi-Byte Character Set
Even if your application is compiled as MBCS, you can still use Unicode strings, buffers, and Windows Unicode APIs without any issue.
Pass your strings around as UTF-8 either with a raw pointer (const char*) or in a string class such as CString or std::string. When you actually need to render the string for display, convert to Unicode and use the W API explicitly.
For example:
void UpdateDisplayText(const char* s)
{
CStringW ws = CA2W(s, CP_UTF8);
SetDlgItemTextW(m_hWnd, IDC_LABEL1, ws);
}

Is wstring character is Unicode ? What happens during conversion?

Recent times I am coming across the conversion of UTF-8 encoding to string and vice vera. I understood that UTF-8 encoding is used to hold almost all the characters in the world while using char which is built in data type for string, only ASCII values can be stored.For a character in UTF-8 encoding the number of bytes required in memory is varied from one byte to 4 bytes but for 'char' type it is usually 1 byte.
My question is what happens in conversion from wstring to string or wchar to char ?
Does the characters which require more than one byte is skipped? It seems it depends on implementation but I want to know what is the correct way of doing it.
Also does wchar is required to store unicode characters ? As far as I understood UNICODE characters can be stored in normal string as well. Why should we use wstring or wchar ?
Depends how you convert them.
You need to specify the source encoding type and the target encoding type.
wstring is not a format, it just defines a data type.
Now usually when one says "Unicode", one means UTF16 which is what Microsoft Windows uses, and that is usuasly what wstring contains.
So, the right way to convert from UTF8 to UTF16:
std::string utf8String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::wstring utf16String = convert.from_bytes( utf8String );
And the other way around:
std::wstring utf16String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::string utf16String = convert.to_bytes( utf16String );
And to add to the confusion:
When you use std::string on a windows platform (like when you use a multibyte compilation), It's NOT UTF8. They use ANSI.
More specifically, the default encoding language your windows is using.
When compiling in Unicode the windows API commands expect these formats:
CommandA - multibyte - ANSI
CommandW - Unicode - UTF16
Make your source files UTF-8 encoded, set the character encoding to UNICODE in your IDE.
Use std::string and widen them for WindowsAPI calls.
std::string somestring = "こんにちは";
WindowsApiW(widen(somestring).c_str());
I know it sounds kind of hacky but a more profound explaination of this issue can be found at utf8everywhere.org.

std::string conversion to char32_t (unicode characters)

I need to read a file using fstream in C++ that has ASCII as well as Unicode characters using the getline function.
But the function uses only std::string and these simple strings' characters can not be converted into char32_t so that I can compare them with Unicode characters. So please could any one give any fix.
char32_t corresponds to UTF-32 encoding, which is almost never used (and often poorly supported). Are you sure that your file is encoded in UTF-32?
If you are sure, then you need to use std::u32string to store your string. For reading, you can use std::basic_stringstream<char32_t> for instance. However, please note that these types are generally poorly supported.
Unicode is generally encoded using:
UTF-8 in text files (and web pages, etc...)
A platform-specific 16-bit or 32-bit encoding in programs, using type wchar_t
So generally, universally encoded files are in UTF-8. They use a variable number of bytes for encoding characters, from 1(ASCII characters) to 4. This means you cannot directly test the individual chars using a std::string
For this, you need to convert the UTF-8 string to wchar_t string, stored in a std::wstring.
For this, use a converter defined like this:
std::wstring_convert<std::codecvt_utf8<wchar_t> > converter;
And convert like that:
std::wstring unicodeString = converter.from_bytes(utf8String);
You can then access the individual unicode characters. Don't forget to put a "L" before each string literals, to make it a unicode string literal. For instance:
if(unicodeString[i]==L'仮')
{
info("this is some japanese character");
}

How to convert from std::string to SQLWCHAR*

I'm trying to convert std::string to SQLWCHAR*, but I couldn't find how.
Any brilliant suggestion, please?
One solution would be to simply use a std::wstring in the first place, rather than std::string. With Unicode Character set you can define a wide string literal using the following syntax:
std::wstring wstr = L"hello world";
However, if you would like to stick with std::string then you will need to convert the string to a different encoding. This will depend on how your std::string is encoded. The default encoding in Windows is ANSI (however the most common encoding when reading files or downloading text from websites is usually UTF8).
This answer shows a function for converting a std::string to a std::wstring on Windows (using the MultiByteToWideChar function).
https://stackoverflow.com/a/27296/966782
Note that the answer uses CP_ACP as the input encoding (i.e. ANSI). If your input string is UTF8 then you can change to use CP_UTF8 instead.
Once you have a std::wstring you should be able to easily retrieve a SQLWCHAR* using:
std::wstring::c_str()

How do I convert a "pointer to const TCHAR" to a "std::string"?

I have a class which returns a typed pointer to a "const TCHAR". I need to convert it to a std::string but I have not found a way to make this happen.
Can anyone provide some insight on how to convert it?
Depending on your compiling settings, TCHAR is either a char or a WCHAR (or wchar_t).
If you are using the multi byte character string setting, then your TCHAR is the same as a char. So you can just set your string to the TCHAR* returned.
If you are using the unicode character string setting, then your TCHAR is a wide char and needs to be converted using WideCharToMultiByte.
If you are using Visual Studio, which I assume you are, you can change this setting in the project properties under Character Set.
Do everything Brian says. Once you get it in the codepage you need, then you can do:
std::string s(myTchar, myTchar+length);
or
std::wstring s(myTchar, myTchar+length);
to get it into a string.
You can also use the handy ATL text conversion macros for this, e.g.:
std::wstring str = CT2W(_T("A TCHAR string"));
CT2W = Const Text To Wide.
You can also specify a code page for the conversion, e.g.
std::wstring str = CT2W(_T("A TCHAR string"), CP_SOMECODEPAGE);
These macros (in their current form) have been available to Visual Studio C++ projects since VS2005.
It depends. If you haven't defined _UNICODE or UNICODE then you can make a string containing the character like this:
const TCHAR example = _T('Q');
std::string mystring(1, example);
If you have are using _UNICODE and UNICODE then this approach may still work but the character may not be convertable to a char. In this case you will need to convert the character to a string. Typically you need to use a call like wcstombs or WideCharToMultiByte which gives you fuller control over the encoding.
Either way you will need to allocate a buffer for the result and construct the std::string from this buffer, remembering to deallocate the buffer once you're done (or use something like vector<char> so that this happens automatically).