Visual Studio multibyte chars to single bytes - c++

Is there a simple way to convert multibyte UTF8 data (from Google Contacts API via https://www.google.com/m8/feeds/) to single bytes? I know the extended ASCII set is non-standard but, for example, my program which will display the info in an MFC CListBox is quite happy to show 'E acute' as 0xE9. I only need it to cope with a few similar European symbols. I've discovered I can convert everything with MultiByteToWideChar() but don't want to have to change lots of functions to accept wide characters if possible.
Thanks.

If you need to convert char * from UTF8 to ANSI, try the following function:
// change encoding from UTF8 to ANSI
char* change_encoding_from_UTF8_to_ANSI(char* szU8)
{
int wcsLen = ::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), NULL, 0);
wchar_t* wszString = new wchar_t[wcsLen + 1];
::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), wszString, wcsLen);
wszString[wcsLen] = '\0';
int ansiLen = ::WideCharToMultiByte(CP_ACP, NULL, wszString, wcslen(wszString), NULL, 0, NULL, NULL);
char* szAnsi = new char[ansiLen + 1];
::WideCharToMultiByte(CP_ACP, NULL, wszString, wcslen(wszString), szAnsi, ansiLen, NULL, NULL);
szAnsi[ansiLen] = '\0';
delete []wszString;
return szAnsi;
}

Utf8 has a 1-to-1 mapping with Ascii characters so if you are receiving Ascii characters as utf8 ones, AFAIK you can directly read them as Ascii. If you have non-Ascii chars then there's no way you can express them in Ascii (any byte > 0x80)

Related

Converting to UTF-8 from ToUnicodeEx()

I get input using GetAsyncKeyState() which I then convert to unicode using ToUnicodeEx():
wchar_t character[1];
ToUnicodeEx(i, scanCode, keyboardState, character, 1, 0, layout);
I can write this to a file using wfstream like so:
wchar_t buffer[128]; // Will not print unicode without these 2 lines
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); // BOM needed since it's encoded using UCS-2 LE
file << character[0];
When I open this file in Notepad++ it's in UCS-2 LE, when I want it to be in UTF-8 format. I believe ToUnicodeEx() is returning it in UCS-2 LE format, it also only works with wide chars. Is there any way to do this using either fstream or wfstream by somehow converting into UTF-8 first? Thanks!
You might want to use the WideCharToMultiByte function.
For example:
wchar_t buffer[LEN]; // input buffer
char output_buffer[OUT_LEN]; // output buffer where the utf-8 string will be written
int num = WideCharToMultiByte(
CP_UTF8,
0,
buffer,
number_of_characters_in_buffer, // or -1 if buffer is null-terminated
output_buffer,
size_in_bytes_of_output_buffer,
NULL,
NULL);
Windows API generally refers to UTF-16 as unicode which is a little confusing. This means most unicode Win32 function calls operate on or give utf-16 strings.
So ToUnicodeEx returns a utf-16 string.
If you need this as utf 8 you'll need to convert it using WideCharToMultiByte
Thank you for all the help, I've managed to solve my problem with additional help from a blog post about WideCharToMultiByte() and UTF-8 here.
This function converts wide char arrays to a UTF-8 string:
// Takes in pointer to wide char array and length of the array
std::string ConvertCharacters(const wchar_t* buffer, int len)
{
int nChars = WideCharToMultiByte(CP_UTF8, 0, buffer, len, NULL, 0, NULL, NULL);
if (nChars == 0)
{
return u8"";
}
std::string newBuffer;
newBuffer.resize(nChars);
WideCharToMultiByte(CP_UTF8, 0, buffer, len, const_cast<char*>(newBuffer.c_str()), nChars, NULL, NULL);
return newBuffer;
}

Chinese character in source code when UTF-8 settings can't be used [duplicate]

This question already has an answer here:
PHP and C++ for UTF-8 code unit in reverse order in Chinese character
(1 answer)
Closed 9 years ago.
This is the scenario:
I can only use the char* data type for the string, not wchar_t *
My MS Visual C++ compiler has to be set to MBCS, not UNICODE because the third party source code that I have is using MBCS; Setting it to UNICODE will cause data type issues.
I am trying to print chinese characters on a printer which needs to get a character string so it can print correctly
What should I do with this line to make the code correct: char * str = "你好";
Convert it to hex sequence perhaps? If yes, how? Thanks a lot.
char * str = "你好";
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize = 0;
mbstowcs_s(&convertedSize, wstr, len, str, _TRUNCATE);
cout << convertedSize;
if(! ExtTextOutW(resource->dc, 1,1 , ETO_OPAQUE, NULL, wstr , convertedSize, NULL))
{
return 0;
}
UPDATE : Let's put the question in another way
I have this, the char * str contain sequence of UTF-8 code units, for the 2 chinese character 你好 , the ExtTextOutW still cannot execute the wstr correctly, because I think the my code for mbstowcs_s could still not working correctly. Any idea why ?
char * str = "\xE4\xBD\xA0\xE5\xA5\xBD";
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize = 0;
mbstowcs_s(&convertedSize, wstr, len, str, _TRUNCATE);
if(! ExtTextOutW(resource->dc, 1,1 , ETO_OPAQUE, NULL, wstr , len, NULL))
{
return 0;
}
The fact is, 你好 is a sequence of Unicode characters. You will need to use a Unicode character set in order to ensure that it will be displayed correctly.
The only possible exception to that is if you're using a multi-byte character set that includes both of these characters in the basic character set. Since you say that you're stuck compiling for the MBCS anyway, that might be a solution. In order to make it work, you will have to set the system language to one that includes this character. The exact way you do this changes in each OS version. I think they're trying to "improve" it. On Windows 7, at least, they call this the "Language for non-Unicode programs" setting, accessible in the "Regions and Language" control panel.
If there is no system language in which these characters are provided as part of the basic character set, then you are basically out of luck.
Even if you tried to use a UTF-8 encoding (which Windows does not natively support, instead preferring UTF-16 for its Unicode support), which uses the char data type, it is very likely that whatever other application/library you're interfacing with would not be able to deal with it. Windows applications assume that a char holds a character in the current ANSI/MB character set. Unicode characters are in a wchar_t, and since you can't use that, it indicates the application simply doesn't support Unicode. (That means it's broken, by the way—time to upgrade.)
As an adaptation from what MYMNeo said, I would suggest that this would work:
wchar_t *str = L"你好";
fputws(str, stdout);
ps. This probably isn't C: cout << convertedSize;.

How to convert ANSI byte to Unicode string?

I have an vector<BYTE> that represents characters in a string. I want to interpret those characters as ASCII characters and store them in a Unicode (UTF-16) string. The current code assumes that the characters in the vector<BYTE> are Unicode rather than ASCII. This works fine for standard ASCII, but fails for extended ASCII characters. These characters need to be interpreted using the current code page retrieved via GetACP(). How would I go about creating a Unicode (UTF-16) string with these ASCII characters?
EDIT: I believe the solution should have something to do with the macros discussed here: http://msdn.microsoft.com/en-us/library/87zae4a3(v=vs.80).aspx I'm just not exactly sure how the actual implementation would go.
int ExtractByteArray(CATLString* pszResult, const CByteVector* pabData)
{
// place the data into the output cstring
pszResult->Empty();
for(int iIndex = 0; iIndex < pabData->GetSize(); iIndex++)
*pszResult += (TCHAR)pabData->GetAt(iIndex);
return RC_SUCCESS;
}
You should use MultibyteToWideChar to convert that string to unicode
Since you're using MFC, let CString do the job.
I have a vector<BYTE> that represents characters in a string. I want to interpret those characters as ASCII characters and store them in a Unicode (UTF-16) string
You should use std::vector<BYTE> only when you are working with binary data. While working with strings use std::string instead. Note that this std::string object will contain special characters that will be encoded by sequences of one or more bytes (thus called multi-byte characters), but these are not ASCII characters.
Once you use std::string, you can use MultiByteToWideChar to create own function that will convert a std::string (which contains multi-byte UTF-8 characters) into std::wstring containing UTF-16 encoded points:
// multi byte to wide char:
std::wstring s2ws(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}

MultiByteToWideChar or WideCharToMultiByte and txt files

I'm trying to write a universal text editor which can open and display ANSI and Unicode in EditControl. Do I need to repeatedly call ReadFile() if I determine that the text is ANSI? Can't figure out how to perform this task. My attempt below does not work, it displays '?' characters in EditControl.
LARGE_INTEGER fSize;
GetFileSizeEx(hFile,&fSize);
int bufferLen = fSize.QuadPart/sizeof(TCHAR)+1;
TCHAR* buffer = new TCHAR[bufferLen];
buffer[0] = _T('\0');
DWORD wasRead = 0;
ReadFile(hFile,buffer,fSize.QuadPart,&wasRead,NULL);
buffer[wasRead/sizeof(TCHAR)] = _T('\0');
if(!IsTextUnicode(buffer,bufferLen,NULL))
{
CHAR* ansiBuffer = new CHAR[bufferLen];
ansiBuffer[0] = '\0';
WideCharToMultiByte(CP_ACP,0,buffer,bufferLen,ansiBuffer,bufferLen,NULL,NULL);
SetWindowTextA(edit,ansiBuffer);
delete[]ansiBuffer;
}
else
SetWindowText(edit,buffer);
CloseHandle(hFile);
delete[]buffer;
There are a few buffer length errors and oddities, but here's your big problem. You call WideCharToMultiByte incorrectly. That is meant to receive UTF-16 encoded text as input. But when IsTextUnicode returns false that means that the buffer is not UTF-16 encoded.
The following is basically what you need:
if(!IsTextUnicode(buffer,bufferLen*sizeof(TCHAR),NULL))
SetWindowTextA(edit,(char*)buffer);
Note that I've fixed the length parameter to IsTextUnicode.
For what it is worth, I think I'd read in to a buffer of char. That would remove the need for the sizeof(TCHAR). In fact I'd stop using TCHAR altogether. This program should be Unicode all the way - TCHAR is what you use when you compile for both NT and 9x variants of Windows. You aren't compiling for 9x anymore I imagine.
So I'd probably code it like this:
char* buffer = new char[filesize+2];//+2 for UTF-16 null terminator
DWORD wasRead = 0;
ReadFile(hFile, buffer, filesize, &wasRead, NULL);
//add error checking for ReadFile, including that wasRead == filesize
buffer[filesize] = '\0';
buffer[filesize+1] = '\0';
if (IsTextUnicode(buffer, filesize, NULL))
SetWindowText(edit, (wchar_t*)buffer);
else
SetWindowTextA(edit, buffer);
delete[] buffer;
Note also that this code makes no allowance for the possibility of receiving UTF-8 encoded text. If you want to handle that you'd need to take your char buffer and send to through MultiByteToWideChar using CP_UTF8.

How to find if a character belongs to a particular codepage using c++ or calling winapi

How can we find if a character belongs to a particular codepage?
or How can we determine whether a charcter fits into currently active IME for an application.
First, Convert your UTF-8 string of characters to UTF-16 using MultiByteToWideChar
Now, reverse the process using WideCharToMultiByte passing the desired codepage as the first parameter.
Use the WC_ERR_INVALID_CHARS flag and WideCharToMultiByte will fail outright if any invalid characters are used. If you want to know which characters are not represented in the target codepage, use the lpDefaultChar, and lpUsedDefaultChar parameters.
LPCWSTR pszUtf16; // converted from utf8 source character
UINT nTargetCP = CP_ACP;
BOOL fBadCharacter = FALSE;
if(WideCharToMultiByte(nTargetCP,WC_NO_BEST_FIT_CHARS,pszUtf16,NULL,0,NULL,&fBadCharacter)
{
if(fBadCharacter)
{
// at least one character in the string was not represented in nTargetCP
}
}
The two previous answers have correctly suggested using MultiByteToWideChar then WideCharToMultiByte to translate your UTF-8 character to UTF-16, then to the current Windows codepage (CP_ACP). Check the result of WideCharToMultiByte to see if the conversion was successful.
What wasn't clear from the original question, is that you are having a particular issue with Hindi. For this language, your question is meaningless because there is no Windows ANSI codepage for Hindi, as Chris Becke pointed out. Therefore, you can never convert a Hindi character to CP_ACP, and WideCharToMultiByte will always fail.
To use Hindi on Windows, as far as I understand it, you must be a Unicode app that calls Unicode APIs.
Using the windows functions WideCharToMultiByte and MultiByteToWideChar you can convert between UTF-8 and 16-bit Unicode characters. The functions have arguments to specify the code page and to specify the behavior if an invalid character is encountered.
Thanks Chris..I am running the following code
#define CP_HINDI 0
#define CP_JAPANESE 932
#define CP_ENGLISH 1252
wchar_t wcsStringJapanese = 'あ';
wchar_t wcsStringHindi = 'र';
wchar_t wcsStringEnglish = 'A';
int main()
{
BOOL usedDefaultCharacter = FALSE;
/* Test for ENGLISH */
WideCharToMultiByte( CP_ENGLISH,
0, &wcsStringEnglish,
-1,
NULL,
0,
NULL,
&usedDefaultCharacter);
printf("usedDefaultCharacters for English? %d \n",usedDefaultCharacter);
usedDefaultCharacter = FALSE;
/*TEST FOR JAPANESE */
WideCharToMultiByte( CP_JAPANESE,
0,
&wcsStringJapanese,
-1,
NULL,
0,
NULL,
&usedDefaultCharacter);
printf("usedDefaultCharacters for Japanese? %d \n",usedDefaultCharacter);
//TEST FOR HINDI
usedDefaultCharacter = FALSE;
WideCharToMultiByte( CP_HINDI,
0,
&wcsStringHindi,
-1,
NULL,
0,
NULL,
&usedDefaultCharacter);
printf("usedDefaultCharacters for Hindi? %d \n",usedDefaultCharacter);
}
The above code returns:
usedDefaultCharacters for English? 0
usedDefaultCharacters for Japanese? 0
usedDefaultCharacters for Hindi? 1
The third line is incorrect as the Codepage for Hindi is 0 , and the string passed consists of Hindi Character and still the usedDefaultChar is set to 1 .. which should not be the case.