C++ string encoding UTF8 / unicode

C++ string encoding UTF8 / unicode - c++

I am trying to be able to send character "Т" (not a normal capital t, unicode decimal value 1058) from C++ to VB
However, with this method below Message is returned to VB and it appears as "Ð¢", which is the above character encoded in ANSI.
#if defined(_MSC_VER) && _MSC_VER > 1310
# define utf8(str) ConvertToUTF8(L##str)
const char * ConvertToUTF8(const wchar_t * pStr) {
static char szBuf[1024];
WideCharToMultiByte(CP_UTF8, 0, pStr, -1, szBuf, sizeof(szBuf), NULL, NULL);
return szBuf;
}
#else
# define utf8(str) str
#endif
BSTR _stdcall chatTest()
{
BSTR Message;
CString temp("temp test");
temp+=utf8("\u0422");
int len = temp.GetLength();
Message = SysAllocStringByteLen ((LPCTSTR)temp, len+1 );
return Message;
}
If I just do temp+=("\u0422"); without the utf8 function. It sends the data as "?" and its actually a question mark (sometimes unicode characters show up as question marks in VB, but still have the correct unicode decimal value.. this is not the case here... it changes it to a question mark.
In VB if I output the String variable that has data from Message when it is "Ð¢" to a text file it appears as the "Т".
So as far as I can tell its in UTF8 in C++, then somehow gets converted to ANSI in VB (or before its sent?), and then when outputted to a file its changed back to UTF8?
I just need to keep the "Т" intact when sending from C++ to VB. I know VB strings can hold that character because from another source within VB I am able to store it (it appears as a "?", but has the proper unicode decimal value).
Any help is greatly appreciated.
Thanks

A BSTR is not UTF-8, it's UTF-16 which is what you get with the L"" prefix. Take out the UTF-8 conversion and use CStringW. And use LPCWSTR instead of LPCTSTR.

Related

convert MFC's CString to int for both ASCII and UNICODE

Converting CString to an int in ASCII mode is as simple as
CString s("123");
int n = atoi(s);
However that doesn't work for projects in UNICODE mode as CString becomes a wide-char string.
How do I write my code to cover both ASCII and UNICODE modes without extra if statements?

Turns out there's a _ttoi() available just for that purpose:
CString s( _T("123") );
int n = _ttoi(s);
This works for both modes with no extra effort.
If you need to convert hexadecimal (or other-base) numbers you can resort to a more generic strtol() variant:
CString s( _T("0xFA3") );
int n = _tcstol(s, nullptr, 16);

There's a special version of CString that uses multibyte characters even if your build is specified for wide characters - CStringA. It will also convert from wide characters automatically.
CString s(_T("123"));
CStringA sa = s;
int n = atoi(sa);
There's a corresponding CStringW that only uses wide characters.

Storing unicode UTF-8 string in std::string

In response to discussion in
Cross-platform strings (and Unicode) in C++
How to deal with Unicode strings in C/C++ in a cross-platform friendly way?
I'm trying to assign a UTF-8 string to a std::string variable in Visual Studio 2010 environment
std::string msg = "महसुस";
However, when I view the string view debugger, I only see "?????"
I have the file saved as Unicode (UTF-8 with Signature)
and i'm using character set "use unicode character set"
"महसुस" is a nepali language and it contains 5 characters and will occupy 15 bytes. But visual studio debugger shows msg size as 5
My question is:
How do I use std::string to just store the utf-8 without needing to manipulate it?

If you were using C++11 then this would be easy:
std::string msg = u8"महसुस";
But since you are not, you can use escape sequences and not rely on the source file's charset to manage the encoding for you, this way your code is more portable (in case you accidentally save it in a non-UTF8 format):
std::string msg = "\xE0\xA4\xAE\xE0\xA4\xB9\xE0\xA4\xB8\xE0\xA5\x81\xE0\xA4\xB8"; // "महसुस"
Otherwise, you might consider doing a conversion at runtime instead:
std::string toUtf8(const std::wstring &str)
{
std::string ret;
int len = WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0, NULL, NULL);
if (len > 0)
{
ret.resize(len);
WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len, NULL, NULL);
}
return ret;
}
std::string msg = toUtf8(L"महसुस");

You can write msg.c_str(), s8 in the Watches window to see the UTF-8 string correctly.

If you have C++11, you can write u8"महसुस". Otherwise, you'll have to write the actual byte sequence, using \xxx for each byte in the UTF-8 sequence.
Typically, you're better off reading such text from a configuration file.

There is a way to display the right values thanks to the ‘s8′ format specifier. If we append ‘,s8′ to the variable names, Visual Studio reparses the text in UTF-8 and renders the text correctly:
In case, you are using Microsoft Visual Studio 2008 Service Pack 1, you need to apply hotfix
http://support.microsoft.com/kb/980263

Chinese character in source code when UTF-8 settings can't be used [duplicate]

This question already has an answer here:
PHP and C++ for UTF-8 code unit in reverse order in Chinese character
(1 answer)
Closed 9 years ago.
This is the scenario:
I can only use the char* data type for the string, not wchar_t *
My MS Visual C++ compiler has to be set to MBCS, not UNICODE because the third party source code that I have is using MBCS; Setting it to UNICODE will cause data type issues.
I am trying to print chinese characters on a printer which needs to get a character string so it can print correctly
What should I do with this line to make the code correct: char * str = "你好";
Convert it to hex sequence perhaps? If yes, how? Thanks a lot.
char * str = "你好";
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize = 0;
mbstowcs_s(&convertedSize, wstr, len, str, _TRUNCATE);
cout << convertedSize;
if(! ExtTextOutW(resource->dc, 1,1 , ETO_OPAQUE, NULL, wstr , convertedSize, NULL))
{
return 0;
}
UPDATE : Let's put the question in another way
I have this, the char * str contain sequence of UTF-8 code units, for the 2 chinese character 你好 ， the ExtTextOutW still cannot execute the wstr correctly, because I think the my code for mbstowcs_s could still not working correctly. Any idea why ?
char * str = "\xE4\xBD\xA0\xE5\xA5\xBD";
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize = 0;
mbstowcs_s(&convertedSize, wstr, len, str, _TRUNCATE);
if(! ExtTextOutW(resource->dc, 1,1 , ETO_OPAQUE, NULL, wstr , len, NULL))
{
return 0;
}

The fact is, 你好 is a sequence of Unicode characters. You will need to use a Unicode character set in order to ensure that it will be displayed correctly.
The only possible exception to that is if you're using a multi-byte character set that includes both of these characters in the basic character set. Since you say that you're stuck compiling for the MBCS anyway, that might be a solution. In order to make it work, you will have to set the system language to one that includes this character. The exact way you do this changes in each OS version. I think they're trying to "improve" it. On Windows 7, at least, they call this the "Language for non-Unicode programs" setting, accessible in the "Regions and Language" control panel.
If there is no system language in which these characters are provided as part of the basic character set, then you are basically out of luck.
Even if you tried to use a UTF-8 encoding (which Windows does not natively support, instead preferring UTF-16 for its Unicode support), which uses the char data type, it is very likely that whatever other application/library you're interfacing with would not be able to deal with it. Windows applications assume that a char holds a character in the current ANSI/MB character set. Unicode characters are in a wchar_t, and since you can't use that, it indicates the application simply doesn't support Unicode. (That means it's broken, by the way—time to upgrade.)

As an adaptation from what MYMNeo said, I would suggest that this would work:
wchar_t *str = L"你好";
fputws(str, stdout);
ps. This probably isn't C: cout << convertedSize;.

How to convert ANSI byte to Unicode string?

I have an vector<BYTE> that represents characters in a string. I want to interpret those characters as ASCII characters and store them in a Unicode (UTF-16) string. The current code assumes that the characters in the vector<BYTE> are Unicode rather than ASCII. This works fine for standard ASCII, but fails for extended ASCII characters. These characters need to be interpreted using the current code page retrieved via GetACP(). How would I go about creating a Unicode (UTF-16) string with these ASCII characters?
EDIT: I believe the solution should have something to do with the macros discussed here: http://msdn.microsoft.com/en-us/library/87zae4a3(v=vs.80).aspx I'm just not exactly sure how the actual implementation would go.
int ExtractByteArray(CATLString* pszResult, const CByteVector* pabData)
{
// place the data into the output cstring
pszResult->Empty();
for(int iIndex = 0; iIndex < pabData->GetSize(); iIndex++)
*pszResult += (TCHAR)pabData->GetAt(iIndex);
return RC_SUCCESS;
}

You should use MultibyteToWideChar to convert that string to unicode

Since you're using MFC, let CString do the job.

I have a vector<BYTE> that represents characters in a string. I want to interpret those characters as ASCII characters and store them in a Unicode (UTF-16) string
You should use std::vector<BYTE> only when you are working with binary data. While working with strings use std::string instead. Note that this std::string object will contain special characters that will be encoded by sequences of one or more bytes (thus called multi-byte characters), but these are not ASCII characters.
Once you use std::string, you can use MultiByteToWideChar to create own function that will convert a std::string (which contains multi-byte UTF-8 characters) into std::wstring containing UTF-16 encoded points:
// multi byte to wide char:
std::wstring s2ws(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}

How to find if a character belongs to a particular codepage using c++ or calling winapi

How can we find if a character belongs to a particular codepage?
or How can we determine whether a charcter fits into currently active IME for an application.

First, Convert your UTF-8 string of characters to UTF-16 using MultiByteToWideChar
Now, reverse the process using WideCharToMultiByte passing the desired codepage as the first parameter.
Use the WC_ERR_INVALID_CHARS flag and WideCharToMultiByte will fail outright if any invalid characters are used. If you want to know which characters are not represented in the target codepage, use the lpDefaultChar, and lpUsedDefaultChar parameters.
LPCWSTR pszUtf16; // converted from utf8 source character
UINT nTargetCP = CP_ACP;
BOOL fBadCharacter = FALSE;
if(WideCharToMultiByte(nTargetCP,WC_NO_BEST_FIT_CHARS,pszUtf16,NULL,0,NULL,&fBadCharacter)
{
if(fBadCharacter)
{
// at least one character in the string was not represented in nTargetCP
}
}

The two previous answers have correctly suggested using MultiByteToWideChar then WideCharToMultiByte to translate your UTF-8 character to UTF-16, then to the current Windows codepage (CP_ACP). Check the result of WideCharToMultiByte to see if the conversion was successful.
What wasn't clear from the original question, is that you are having a particular issue with Hindi. For this language, your question is meaningless because there is no Windows ANSI codepage for Hindi, as Chris Becke pointed out. Therefore, you can never convert a Hindi character to CP_ACP, and WideCharToMultiByte will always fail.
To use Hindi on Windows, as far as I understand it, you must be a Unicode app that calls Unicode APIs.

Using the windows functions WideCharToMultiByte and MultiByteToWideChar you can convert between UTF-8 and 16-bit Unicode characters. The functions have arguments to specify the code page and to specify the behavior if an invalid character is encountered.

Thanks Chris..I am running the following code
#define CP_HINDI 0
#define CP_JAPANESE 932
#define CP_ENGLISH 1252
wchar_t wcsStringJapanese = 'あ';
wchar_t wcsStringHindi = 'र';
wchar_t wcsStringEnglish = 'A';
int main()
{
BOOL usedDefaultCharacter = FALSE;
/* Test for ENGLISH */
WideCharToMultiByte( CP_ENGLISH,
0, &wcsStringEnglish,
-1,
NULL,
0,
NULL,
&usedDefaultCharacter);
printf("usedDefaultCharacters for English? %d \n",usedDefaultCharacter);
usedDefaultCharacter = FALSE;
/*TEST FOR JAPANESE */
WideCharToMultiByte( CP_JAPANESE,
0,
&wcsStringJapanese,
-1,
NULL,
0,
NULL,
&usedDefaultCharacter);
printf("usedDefaultCharacters for Japanese? %d \n",usedDefaultCharacter);
//TEST FOR HINDI
usedDefaultCharacter = FALSE;
WideCharToMultiByte( CP_HINDI,
0,
&wcsStringHindi,
-1,
NULL,
0,
NULL,
&usedDefaultCharacter);
printf("usedDefaultCharacters for Hindi? %d \n",usedDefaultCharacter);
}
The above code returns:
usedDefaultCharacters for English? 0
usedDefaultCharacters for Japanese? 0
usedDefaultCharacters for Hindi? 1
The third line is incorrect as the Codepage for Hindi is 0 , and the string passed consists of Hindi Character and still the usedDefaultChar is set to 1 .. which should not be the case.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ string encoding UTF8 / unicode - c++

A BSTR is not UTF-8, it's UTF-16 which is what you get with the L"" prefix. Take out the UTF-8 conversion and use CStringW. And use LPCWSTR instead of LPCTSTR.

Related

convert MFC's CString to int for both ASCII and UNICODE

Storing unicode UTF-8 string in std::string

Chinese character in source code when UTF-8 settings can't be used [duplicate]

How to convert ANSI byte to Unicode string?

How to find if a character belongs to a particular codepage using c++ or calling winapi

Categories

Resources