convert MFC's CString to int for both ASCII and UNICODE - c++

Converting CString to an int in ASCII mode is as simple as
CString s("123");
int n = atoi(s);
However that doesn't work for projects in UNICODE mode as CString becomes a wide-char string.
How do I write my code to cover both ASCII and UNICODE modes without extra if statements?

Turns out there's a _ttoi() available just for that purpose:
CString s( _T("123") );
int n = _ttoi(s);
This works for both modes with no extra effort.
If you need to convert hexadecimal (or other-base) numbers you can resort to a more generic strtol() variant:
CString s( _T("0xFA3") );
int n = _tcstol(s, nullptr, 16);

There's a special version of CString that uses multibyte characters even if your build is specified for wide characters - CStringA. It will also convert from wide characters automatically.
CString s(_T("123"));
CStringA sa = s;
int n = atoi(sa);
There's a corresponding CStringW that only uses wide characters.

Related

How to convert an integer to a unicode character?

So I wanted to try converting Unicode to an integer for a project of mine. I tried something like this :
unsigned int foo = (unsigned int)L'آ';
std::cout << foo << std::endl;
How do I convert it back? Or in other words, How do I convert an int to the respective Unicode character ?
EDIT : I am expecting the output to be the unicode value of an integer, example:
cout << (wchar_t) 1570 ; // This should print the unicode value of 1570 (which is :آ)
I am using Visual Studio 2013 Community with it's default compiler, Windows 10 64 bit Pro
Cheers
L'آ' will work okay as a signle wide character, because it is below 0xFFFF. But in general UTF16 includes surrogate pairs, so a unicode code point cannot be represented with a single wide character. You need wide string instead.
Your problem is also partly to do with printing UTF16 character in Windows console. If you use MessageBoxW to view a wide string it will work as expected:
wchar_t buf[2] = { 0 };
buf[0] = 1570;
MessageBoxW(0, buf, 0, 0);
However, in general you need a wide string to account for surrogate pairs, not a single wide char. Example:
int utf32 = 1570;
const int mask = (1 << 10) - 1;
std::wstring str;
if(utf32 < 0xFFFF)
{
str.push_back((wchar_t)utf32);
}
else
{
utf32 -= 0x10000;
int hi = (utf32 >> 10) & mask;
int lo = utf32 & mask;
hi += 0xD800;
lo += 0xDC00;
str.push_back((wchar_t)hi);
str.push_back((wchar_t)lo);
}
MessageBox(0, str.c_str(), 0, 0);
See related posts for printing UTF16 in Windows console.
The key here is setlocale(LC_ALL, "en_US.UTF-8");. en_US is the localization string which you may want to set to a different value like zh_CN for Chinese for example.
#include <stdio.h>
#include <iostream>
int main() {
setlocale(LC_ALL, "en_US.UTF-8");
// This does not work without setlocale(LC_ALL, "en_US.UTF-8");
for(int ch=30000; ch<30030; ch++) {
wprintf(L"%lc", ch);
}
printf("\n");
return 0;
}
Things to notice here is the use of wprintf and how the formatted string is given: L"%lc" which tells wprintf to treat the string and the character as long characters.
If you want to use this method to print some variables, use the type wchat_t.
Useful links:
setlocale
wprintf

Chinese character in source code when UTF-8 settings can't be used [duplicate]

This question already has an answer here:
PHP and C++ for UTF-8 code unit in reverse order in Chinese character
(1 answer)
Closed 9 years ago.
This is the scenario:
I can only use the char* data type for the string, not wchar_t *
My MS Visual C++ compiler has to be set to MBCS, not UNICODE because the third party source code that I have is using MBCS; Setting it to UNICODE will cause data type issues.
I am trying to print chinese characters on a printer which needs to get a character string so it can print correctly
What should I do with this line to make the code correct: char * str = "你好";
Convert it to hex sequence perhaps? If yes, how? Thanks a lot.
char * str = "你好";
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize = 0;
mbstowcs_s(&convertedSize, wstr, len, str, _TRUNCATE);
cout << convertedSize;
if(! ExtTextOutW(resource->dc, 1,1 , ETO_OPAQUE, NULL, wstr , convertedSize, NULL))
{
return 0;
}
UPDATE : Let's put the question in another way
I have this, the char * str contain sequence of UTF-8 code units, for the 2 chinese character 你好 , the ExtTextOutW still cannot execute the wstr correctly, because I think the my code for mbstowcs_s could still not working correctly. Any idea why ?
char * str = "\xE4\xBD\xA0\xE5\xA5\xBD";
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize = 0;
mbstowcs_s(&convertedSize, wstr, len, str, _TRUNCATE);
if(! ExtTextOutW(resource->dc, 1,1 , ETO_OPAQUE, NULL, wstr , len, NULL))
{
return 0;
}
The fact is, 你好 is a sequence of Unicode characters. You will need to use a Unicode character set in order to ensure that it will be displayed correctly.
The only possible exception to that is if you're using a multi-byte character set that includes both of these characters in the basic character set. Since you say that you're stuck compiling for the MBCS anyway, that might be a solution. In order to make it work, you will have to set the system language to one that includes this character. The exact way you do this changes in each OS version. I think they're trying to "improve" it. On Windows 7, at least, they call this the "Language for non-Unicode programs" setting, accessible in the "Regions and Language" control panel.
If there is no system language in which these characters are provided as part of the basic character set, then you are basically out of luck.
Even if you tried to use a UTF-8 encoding (which Windows does not natively support, instead preferring UTF-16 for its Unicode support), which uses the char data type, it is very likely that whatever other application/library you're interfacing with would not be able to deal with it. Windows applications assume that a char holds a character in the current ANSI/MB character set. Unicode characters are in a wchar_t, and since you can't use that, it indicates the application simply doesn't support Unicode. (That means it's broken, by the way—time to upgrade.)
As an adaptation from what MYMNeo said, I would suggest that this would work:
wchar_t *str = L"你好";
fputws(str, stdout);
ps. This probably isn't C: cout << convertedSize;.

How to convert ANSI byte to Unicode string?

I have an vector<BYTE> that represents characters in a string. I want to interpret those characters as ASCII characters and store them in a Unicode (UTF-16) string. The current code assumes that the characters in the vector<BYTE> are Unicode rather than ASCII. This works fine for standard ASCII, but fails for extended ASCII characters. These characters need to be interpreted using the current code page retrieved via GetACP(). How would I go about creating a Unicode (UTF-16) string with these ASCII characters?
EDIT: I believe the solution should have something to do with the macros discussed here: http://msdn.microsoft.com/en-us/library/87zae4a3(v=vs.80).aspx I'm just not exactly sure how the actual implementation would go.
int ExtractByteArray(CATLString* pszResult, const CByteVector* pabData)
{
// place the data into the output cstring
pszResult->Empty();
for(int iIndex = 0; iIndex < pabData->GetSize(); iIndex++)
*pszResult += (TCHAR)pabData->GetAt(iIndex);
return RC_SUCCESS;
}
You should use MultibyteToWideChar to convert that string to unicode
Since you're using MFC, let CString do the job.
I have a vector<BYTE> that represents characters in a string. I want to interpret those characters as ASCII characters and store them in a Unicode (UTF-16) string
You should use std::vector<BYTE> only when you are working with binary data. While working with strings use std::string instead. Note that this std::string object will contain special characters that will be encoded by sequences of one or more bytes (thus called multi-byte characters), but these are not ASCII characters.
Once you use std::string, you can use MultiByteToWideChar to create own function that will convert a std::string (which contains multi-byte UTF-8 characters) into std::wstring containing UTF-16 encoded points:
// multi byte to wide char:
std::wstring s2ws(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}

C++ string encoding UTF8 / unicode

I am trying to be able to send character "Т" (not a normal capital t, unicode decimal value 1058) from C++ to VB
However, with this method below Message is returned to VB and it appears as "Т", which is the above character encoded in ANSI.
#if defined(_MSC_VER) && _MSC_VER > 1310
# define utf8(str) ConvertToUTF8(L##str)
const char * ConvertToUTF8(const wchar_t * pStr) {
static char szBuf[1024];
WideCharToMultiByte(CP_UTF8, 0, pStr, -1, szBuf, sizeof(szBuf), NULL, NULL);
return szBuf;
}
#else
# define utf8(str) str
#endif
BSTR _stdcall chatTest()
{
BSTR Message;
CString temp("temp test");
temp+=utf8("\u0422");
int len = temp.GetLength();
Message = SysAllocStringByteLen ((LPCTSTR)temp, len+1 );
return Message;
}
If I just do temp+=("\u0422"); without the utf8 function. It sends the data as "?" and its actually a question mark (sometimes unicode characters show up as question marks in VB, but still have the correct unicode decimal value.. this is not the case here... it changes it to a question mark.
In VB if I output the String variable that has data from Message when it is "Т" to a text file it appears as the "Т".
So as far as I can tell its in UTF8 in C++, then somehow gets converted to ANSI in VB (or before its sent?), and then when outputted to a file its changed back to UTF8?
I just need to keep the "Т" intact when sending from C++ to VB. I know VB strings can hold that character because from another source within VB I am able to store it (it appears as a "?", but has the proper unicode decimal value).
Any help is greatly appreciated.
Thanks
A BSTR is not UTF-8, it's UTF-16 which is what you get with the L"" prefix. Take out the UTF-8 conversion and use CStringW. And use LPCWSTR instead of LPCTSTR.

Recognizing tamil string and process them using c or c++ and the use of unicode

The input is given in a language with a script other than the roman alphabets.A program in c or c++ must recognize them..
How do i take input in Tamil and split it into letters so that i can recognize each Tamil alphabet?
how do i use wchar_t and locale?
The C++ standard libraries do not handle Unicode completely, neither does C; you'd be better off using a library like Boost, which is cross platform
Including and using WinAPI and windows.h allow's you to use Unicode, but only on Win32 programs.
See here for a previous rant of mine on this subject.
Assuming that your platform is capable of handling Tamil characters, I suggest the following sequence of events:
I. Get the input string into a wide string:
#include <clocale>
int main()
{
setlocale(LC_CTYPE, "");
const char * s = getInputString(); // e.g. from the command line
const size_t wl = mbstowcs(NULL, s, 0);
wchar_t * ws = new wchar_t[wl];
mbstowcs(ws, s, wl);
//...
II. Convert the wide string into a string with definite encoding:
#include <iconv.h>
// ...
iconv_t cd = iconv_open("UTF32", "WCHAR_T");
size_t iin = wl;
size_t iout = 2 * wl; // random safety margin
uint32_t * us = new uint32_t[iout];
iconv(cd, reinterpret_cast<char*>(ws), &iin, reinterpret_cast<char*>(us), &iout);
iconv_close(cd);
// ...
Finally, you have in us an array of Unicode codepoints that made up your input text. You can now process this array, e.g. by looking each codepoint up in a list and checking whether it comes from the Tamil script, and do with it whatever you see fit.