converting array of bytes to UTF-8 unicode

converting array of bytes to UTF-8 unicode - c++

I have a file saved as UTF-8, and i'm reading it like this:
ReadFile(hFile, pContents, pFile->nFileSize, &dwRead, NULL);
(pContents is a BYTE* of size nFileSize)
its just a small file with 100 bytes or so, contains text which i want to read into memory in wchar_t* format, so i can set the text of edit and static controls with the unicode text.
How can i convert the bytes to UTF-8?
edit (i don't want to use fstream or wfstream)

MultiByteToWideChar to convert from UTF-8 to UTF-16 (wchar_t).
WideCharToMuliByte to convert from UTF-16 to UTF-8.

If the file is in UTF-8 and you read it into an array.
Then it is still in UTF-8 format and you don;t need to do anything.

int res2 = WideCharToMultiByte(CP_UTF8, 0, tempBuf.c_str(), -1,
multiByteBuf, lengthOfInputString, NULL, NULL);
int res = MultiByteToWideChar(CP_UTF8, 0, buf, -1, wcharBuf, lengthOfInputString);

Related

Converting to UTF-8 from ToUnicodeEx()

I get input using GetAsyncKeyState() which I then convert to unicode using ToUnicodeEx():
wchar_t character[1];
ToUnicodeEx(i, scanCode, keyboardState, character, 1, 0, layout);
I can write this to a file using wfstream like so:
wchar_t buffer[128]; // Will not print unicode without these 2 lines
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); // BOM needed since it's encoded using UCS-2 LE
file << character[0];
When I open this file in Notepad++ it's in UCS-2 LE, when I want it to be in UTF-8 format. I believe ToUnicodeEx() is returning it in UCS-2 LE format, it also only works with wide chars. Is there any way to do this using either fstream or wfstream by somehow converting into UTF-8 first? Thanks!

You might want to use the WideCharToMultiByte function.
For example:
wchar_t buffer[LEN]; // input buffer
char output_buffer[OUT_LEN]; // output buffer where the utf-8 string will be written
int num = WideCharToMultiByte(
CP_UTF8,
0,
buffer,
number_of_characters_in_buffer, // or -1 if buffer is null-terminated
output_buffer,
size_in_bytes_of_output_buffer,
NULL,
NULL);

Windows API generally refers to UTF-16 as unicode which is a little confusing. This means most unicode Win32 function calls operate on or give utf-16 strings.
So ToUnicodeEx returns a utf-16 string.
If you need this as utf 8 you'll need to convert it using WideCharToMultiByte

Thank you for all the help, I've managed to solve my problem with additional help from a blog post about WideCharToMultiByte() and UTF-8 here.
This function converts wide char arrays to a UTF-8 string:
// Takes in pointer to wide char array and length of the array
std::string ConvertCharacters(const wchar_t* buffer, int len)
{
int nChars = WideCharToMultiByte(CP_UTF8, 0, buffer, len, NULL, 0, NULL, NULL);
if (nChars == 0)
{
return u8"";
}
std::string newBuffer;
newBuffer.resize(nChars);
WideCharToMultiByte(CP_UTF8, 0, buffer, len, const_cast<char*>(newBuffer.c_str()), nChars, NULL, NULL);
return newBuffer;
}

Reading csv file having japanese text.(C++)

A csv file is having japanese text in it.
On opening through notepad, it says its encoding is utf-8.
I read on stackoverflow, for utf-8 , first read the file in single stream and then convert it into wstring.
I am using below code for the conversion of string to wstring.
wstring stow(const std::string& str){
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo( size_needed, 0 );
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;}
But still, i am getting junk in the returned wstring in case of japanese text.
Note:
I can only use stream to read the csv.
No static memory allocation is allowed.
How can i read the Japanese text successfully ?

Missing check to see if UTF-8 BOM is prepended in the string, if yes, skip it.

It is achieved by using CP_ACP encoding.

LPWSTR to wstring c++

I would like to read utf-8 test from a .dll string table.
something like this
LPWSTR nnW;
LoadStringW(hMod, id, nnW, MAX_PATH);
and after that I would like to convert the LPWSTR nnW to std::wstring nnWstring.
I tried in this way:
LPWSTR nnW;
LoadStringW(hMod, id, nnW, MAX_PATH);
const int length = MultiByteToWideChar(CP_UTF8,
0, // no flags required
(LPCSTR)nnW,
-1, // automatically determine length
NULL,
0);
std::wstring nnWstring(length, L'\0');
if (!MultiByteToWideChar(CP_UTF8,
0,
(LPCSTR)nnW,
-1,
&nnWstring[0],
length))
MessageBoxW(NULL, (LPCWSTR)nnWstring.c_str(), L"wstring", MB_OK | MB_ICONERROR);
After that in the MessageBoxW only shows the first letter.

No conversion or copying needed.
std::wstring nnWString(MAX_PATH, 0);
nnWString.resize(LoadStringW(hMod, id, &nnWString[0], nnWString.size());
Note: Your original code causes undefined behavior, because it writes using an uninitialized pointer. Surely not what you wanted.
Here's another variation:
http://msmvps.com/blogs/gdicanio/archive/2010/01/05/stl-strings-loading-from-resources.aspx

I would like to read utf-8 test from a .dll string table. something like this
Generally, string tables in Windows are UTF-16. You're trying to put UTF-8 data into one. The UTF-8 data is being treated like "extended" ASCII, so each byte is being expanded to two bytes with zero bytes between them.
You should probably put UTF-16 data in the string table directly.
If you must store UTF-8 data in the resources, you can put it into an RCDATA resource and use the lower-level resource functions to get the data out. Then you'll have to convert from UTF-8 to UTF-16 to store it in a wstring.

How to output utf8 encoded characters normally in c/c++ console application?

Here's what I'm getting now by wprintf:
1胩?鳧?1敬爄汯?瑳瑡獵猆慴畴??
Is utf8 just not supported by windows?

No, Windows doesn't support printing UTF-8 to the console.
When Windows says "Unicode", it means UTF-16. You need to use MultiByteToWideChar to convert from UTF-8 to UTF-16. Something like this:
char* text = "My UTF-8 text\n";
int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
wchar_t *unicode_text = new wchar_t[len];
MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
wprintf(L"%s", unicode_text);

wprintf supposed to receive a UTF-16 encoded string. Use the following for conversion:
Use MultiByteToWideChar with CP_UTF8 codepage to do the conversion. (and don't do blind casting from char* into wchar_t*).

Wrong reading file in UNICODE (fread) on C++

I'm trying to load into string the content of file saved on the dics. The file is .CS code, created in VisualStudio so I suppose it's saved in UTF-8 coding. I'm doing this:
FILE *fConnect = _wfopen(connectFilePath, _T("r,ccs=UTF-8"));
if (!fConnect)
return;
fseek(fConnect, 0, SEEK_END);
lSize = ftell(fConnect);
rewind(fConnect);
LPTSTR lpContent = (LPTSTR)malloc(sizeof(TCHAR) * lSize + 1);
fread(lpContent, sizeof(TCHAR), lSize, fConnect);
But result is so strange - the first part (half of the string is content of .CS file), then strange symbols like 췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍 appear.
So I think I read the content in a wrong way. But how to do that properly?
Thank you so much and I'm looking to hear!

ftell(), fseek(), and fread() all operate on bytes, not on characters. In a Unicode environment, TCHAR is at least 2 bytes, so you are allocating and reading twice as much memory as you should be.
I have never seen fopen() or _wfopen() support a "ccs" attribute. You should use "rb" as the reading mode, read the raw bytes into memory, and then decode them once you have them all available, ie:
FILE *fConnect = _wfopen(connectFilePath, _T("rb"));
if (!fConnect)
return;
fseek(fConnect, 0, SEEK_END);
lSize = ftell(fConnect);
rewind(fConnect);
LPBYTE lpContent = (LPBYTE) malloc(lSize);
fread(lpContent, 1, lSize, fConnect);
fclose(lpContent);
.. decode lpContent as needed ...
free(lpContent);

Does the string contain all the contents of the cs file and then additional funny characters? Probably it's just not correctly null-terminated since fread will not automatically do that. You need to set the character following the string content to zero:
lpContent[lSize] = 0;

.. decode lpContent as needed ...
s2ws function convert string to wstring
std::wstring s2ws(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
add null terminator in the end of buffer:
lpContent[lSize-1] = 0;
initialize wstring from buffer content
std::wstring replyStr = (s2ws((char*)lpContent));

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

converting array of bytes to UTF-8 unicode - c++

MultiByteToWideChar to convert from UTF-8 to UTF-16 (wchar_t). WideCharToMuliByte to convert from UTF-16 to UTF-8.

If the file is in UTF-8 and you read it into an array. Then it is still in UTF-8 format and you don;t need to do anything.

int res2 = WideCharToMultiByte(CP_UTF8, 0, tempBuf.c_str(), -1, multiByteBuf, lengthOfInputString, NULL, NULL); int res = MultiByteToWideChar(CP_UTF8, 0, buf, -1, wcharBuf, lengthOfInputString);

Related

Converting to UTF-8 from ToUnicodeEx()

Reading csv file having japanese text.(C++)

LPWSTR to wstring c++

How to output utf8 encoded characters normally in c/c++ console application?

Wrong reading file in UNICODE (fread) on C++

Categories

Resources