Called ReadFile on a text file, got weird (Japanese?) characters

Called ReadFile on a text file, got weird (Japanese?) characters - c++

I use the next code to read all of the elemnts from a file with the handle hFile that works, and with its size that I got with GetFileSize(hFile, NULL).
_TCHAR* text = (_TCHAR*)malloc(sizeOfFile * sizeof(_TCHAR));
DWORD numRead = 0;
BOOL didntFail = ReadFile(hFile, text, sizeOfFile, &numRead, NULL);
after the operation text is some strange thing in Japanese or something, and not the content of the file.
what did i do wrong?
edit:
I understand it is the encoding problem, but then how will I convert text to LPCWSTR to use stuff like WriteConsoleOutputCharacter

Modern IDEs default to Unicode applications, meaning _TCHAR is actually wchar_t. ReadFile() works with simple bytes and if you use it to fill a _TCHAR array directly, you'll get 8-bit characters interpreted as UTF-16 Unicode. These usually show as CJK (Chinese/Japanese/Korean) glyphs.
You have three options:
convert your program to non-Unicode
use a file containing Unicode text (in UTF-16 encoding), or
read from the file into a char array and then use MultiByteToWideChar() to convert the text to Unicode.
If you mix Unicode and non-Unicode be careful to calculate the correct buffer sizes (number of bytes vs. number of characters).
Note that you can still use narrow chars with Windows in your Unicode program if you call the ANSI version of the Windows function (e.g. WriteConsoleOutputCharacterA).

You got the type of the string wrong. Text from a file that was encoded in an 8-bit encoding will look like Chinese when you look at it through a character type, like TCHAR with UNICODE defined, that uses a 16-bit encoding. Fix:
char* text = (char*)malloc(...);
You do normally have to fret a lot more about the encoding that was used to write the text. It could be utf-8 for example. You can convert from the 8-bit encoding to a TCHAR (wchar_t, really) with MultiByteToWideChar(). Its first argument is the one to fret about.

You have read an ANSI or UTF-8 text file into a UTF-16 string.

wchar_t ReadBuff[1024];
memset(&ReadBuff, 0, sizeof(ReadBuff));
HANDLE hFile = CreateFile(szPathFileName, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
DWORD NumberOfBytesRead = 0;
ReadFile(hFile, ReadBuff, 600, &NumberOfBytesRead, NULL);
wsprintf(ReadBuff, L"%S\0", ReadBuff);
ReadBuff is now in readable form.

Related

Convert Japanese wstring to std::string

Can anyone suggest a good method to convert a Japanese std::wstring to std::string?
I used the below code. Japanese strings are not converting properly on an English OS.
std::string WstringTostring(std::wstring str)
{
size_t size = 0;
_locale_t lc = _create_locale(LC_ALL, "ja.JP.utf8");
errno_t err = _wcstombs_s_l(&size, NULL, 0, &str[0], _TRUNCATE, lc);
std::string ret = std::string(size, 0);
err = _wcstombs_s_l(&size, &ret[0], size, &str[0], _TRUNCATE, lc);
_free_locale(lc);
ret.resize(size-1);
return ret;
}
The wstring is "C\\files\\ブ種別.pdf".
The converted string is "C:\\files\\ãƒ–ç¨®åˆ¥.pdf".

It actually looks right to me.
That is the UTF-8-encoded version of your input (which presumably was UTF-16 before conversion), but shown in its ASCII-decoded form due to a mistake somewhere in your toolchain.
You just need to calibrate your file/terminal/display to render text output as if it were UTF-8 (which it is).
Also, remember that std::string is just a container of bytes, and does not inherently specify or imply any particular encoding. So your question is rather "how can I convert UTF-16 (containing Japanese characters) into UTF-8 in Windows" or, as it turns out, "how do I configure my terminal to display UTF-8?".
If your display for this string is the Visual Studio locals window (which you suggest is the case with your comment "I observed the value of the "ret" string in local window while debugging") you are out of luck, because VS has no idea what encoding your string is in (nor does it attempt to find out).
For other aspects of Visual Studio, though, such as the console output window, there are various approaches to work around this (example).

EDIT: some things first. Windows has the notion of the ANSI codepage. It's the default codepage of non-Unicode strings that Windows assumes. Every program that uses non-Unicode versions of Windows API, and doesn't specify the codepage explicitly, uses the ANSI codepage.
The ANSI codepage is driven by the "System default locale" setting in Control Panel. As of Windows 10 May 2020, it's under Region/Administrative/Change system locale. It takes admin rights to change that setting.
By default, Windows with the system default locale set to English uses codepage 1252 as the ANSI codepage. That codepage doesn't contain the Japanese characters. So using Japanese in Unicode unaware programs in that situation is hard or impossible.
It looks like the OP wants or has to use a piece of third part C++ code that uses multibyte strings (std::string and/or char*). That doesn't necessarily mean that it's Unicode unaware, but it might. What the OP is trying to do entirely depends on the way that third party library is coded. It might not be possible at all.
Looks like your problem is that some piece of third party code expects a file name in ANSI, and uses ANSI functions to open that file. In an English system with the default value of the system locale, Japanese can't be converted to ANSI, because the ANSI codepage (CP1252 in practice) doesn't contain the Japanese characters.
What I think you should do, you should get a short file name instead using GetShortPathNameW, convert that file path to ANSI, and pass that string. Like this:
std::string WstringFilenameTostring(std::wstring str)
{
wchar_t ShortPath[MAX_PATH+1];
DWORD dw = GetShortPathNameW(str.c_str(), ShortPath, _countof(ShortPath));
char AnsiPath[MAX_PATH+1];
int n = WideCharToMultiByte(CP_ACP, 0, ShortPath, -1, AnsiPath, _countof(AnsiPath), 0, 0);
return string(AnsiPath);
}
This code is for filenames only. For any other Japanese string, it will return nonsense. In my test, it converted "日本語.txt" to something unreadable but alphanumeric :)

Issue when converting utf16 wide std::wstring to utf8 narrow std::string for rare characters

Why do some utf16 encoded wide strings, when converted to utf8 encoded narrow strings convert to hex values that don't appear to be correct when converted using this commonly found conversion function?
std::string convert_string(const std::wstring& str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
return conv.to_bytes(str);
}
Hello. I have a C++ app on Windows which takes some user input on the command line. I'm using the wide character main entry point to get the input as a utf16 string which I'm converting to a utf8 narrow string using the above function.
This function can be found in many places online and works in almost all cases. I have however found a few examples where it doesn't seem to work as expected.
For example if I input an emojii character "🤢" as a string literal (in my utf8 encoded cpp file) and write it to disk, the file (FILE-1) contains the following data (which are the correct utf8 hex values specified here https://www.fileformat.info/info/unicode/char/1f922/index.htm):
0xF0 0x9F 0xA4 0xA2
However if I pass the emojii to my application on the command line and convert it to a utf8 string using the conversion function above and then write it to disk, the file (FILE-2) contains different raw bytes:
0xED 0xA0 0xBE 0xED 0xB4 0xA2
While the second file seems to indicate the conversion has produced the wrong output if you copy and paste the hex values (in notepad++ at least) it produces the correct emojii. Also WinMerge considers the two files to be identical.
so to conclude I would really like to know the following:
how the incorrect-looking converted hex values map correctly to the right utf8 character in the example above
why the conversion function converts some characters to this form while almost all other characters produce the expected raw bytes
As a bonus I would like to know if it is possible to modify the conversion function to stop it from outputting these rare characters in this form
I should note that I already have a workaround function below which uses WinAPI calls, however using standard library calls only is the dream :)
std::string convert_string(const std::wstring& wstr)
{
if(wstr.empty())
return std::string();
int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
return strTo;
}

The problem is that std::wstring_convert<std::codecvt_utf8<wchar_t>> converts from UCS-2, not from UTF-16. Characters inside of the BMP (U+0000..U+FFFF) have identical encodings in both UCS-2 and UTF-16 and so will work, but characters outside of the BMP (U+FFFF..U+10FFFF), such as your Emoji, do not exist in UCS-2 at all. This means the conversion doesn't understand the character and produces incorrect UTF-8 bytes (technically, it's converted each half of the UTF-16 surrogate pair into a separate UTF-8 character).
You need to use std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> instead.

There is already a validated answer here. But for the records, here some additional information.
The encoding of the nauseated face emoji was introduced in Unicode in 2016. It is 4 utf-8 bytes (0xF0 0x9F 0xA4 0xA2) or 2 utf-16 words (0xD83E 0xDD22).
The surprising encoding of 0xED 0xA0 0xBE 0xED 0xB4 0xA2 corresponds in fact to an UCS surrogate pair:
0xED 0xA0 0xBE is the utf8 encoding of the high surrogate 0xD83E according to this conversion table.
0xED 0xB4 0xA2 corresponds to the utf8 encoding of the low surrogate 0xDD22 according to this table.
So basically, your first encoding is the direct utf8. The second encoding is the encoding in utf8 of an UCS-2 encoding that corresponds to the utf-16 encoding of the desired character.
As the accepted answer rightly pointed out, the std::codecvt_utf8<wchar_t> is the culprit, because it's about UCS-2 and not UTF-16.
It's quite astonishing nowadays to find in standard libraries this obsolete encoding, but I suspect that this is still a reminiscence of Microsoft's lobying in the standard committee that dates back from the old Windows support for unicode with UCS-2.

Convert wide CString to char*

There are lots of times this question has been asked and as many answers - none of which work for me and, it seems, many others. The question is about wide CStrings and 8bit chars under MFC. We all want an answer that will work in ALL cases, not a specific instance.
void Dosomething(CString csFileName)
{
char cLocFileNamestr[1024];
char cIntFileNamestr[1024];
// Convert from whatever version of CString is supplied
// to an 8 bit char string
cIntFileNamestr = ConvertCStochar(csFileName);
sprintf_s(cLocFileNamestr, "%s_%s", cIntFileNamestr, "pling.txt" );
m_KFile = fopen(LocFileNamestr, "wt");
}
This is an addition to existing code (by somebody else) for debugging.
I don't want to change the function signature, it is used in many places.
I cannot change the signature of sprintf_s, it is a library function.

You are leaving out a lot of details, or ignoring them. If you are building with UNICODE defined (which it seems you are), then the easiest way to convert to MBCS is like this:
CStringA strAIntFileNameStr = csFileName.GetString(); // uses default code page
CStringA is the 8-bit/MBCS version of CString.
However, it will fill with some garbage characters if the unicode string you are translating from contains characters that are not in the default code page.
Instead of using fopen(), you could use _wfopen() which will open a file with a unicode filename. To create your file name, you would use swprintf_s().

an answer that will work in ALL cases, not a specific instance...
There is no such thing.
It's easy to convert "ABCD..." from wchar_t* to char*, but it doesn't work that way with non-Latin languages.
Stick to CString and wchar_t when your project is unicode.
If you need to upload data to webpage or something, then use CW2A and CA2W for utf-8 and utf-16 conversion.
CStringW unicode = L"Россия";
MessageBoxW(0,unicode,L"Russian",0);//should be okay
CStringA utf8 = CW2A(unicode, CP_UTF8);
::MessageBoxA(0,utf8,"format error",0);//WinApi doesn't get UTF-8
char buf[1024];
strcpy(buf, utf8);
::MessageBoxA(0,buf,"format error",0);//same problem
//send this buf to webpage or other utf-8 systems
//this should be compatible with notepad etc.
//text will appear correctly
ofstream f(L"c:\\stuff\\okay.txt");
f.write(buf, strlen(buf));
//convert utf8 back to utf16
unicode = CA2W(buf, CP_UTF8);
::MessageBoxW(0,unicode,L"okay",0);

WriteFile() Function for Win32 Applications

I am facing a problem on the WriteFile(); function using Win32 C++ Application. the second argument asks for a pointer to the buffer that is storing the information. what Syntax do I use to point the input text from the boxes? My information is text from the input of text boxes. What syntax do i use to create a pointer to that?
Here is a snippet of code the code I am using:
case IDC_BUTTON_ONE:
{
HANDLE hFile = CreateFile("C:\\test.txt", GENERIC_READ,
0, NULL, CREATE_NEW, FILE_FLAG_OVERLAPPED, NULL);
}

To write a control's text to a file you'll also need these lines:
char TextBuffer[256]; // Ascii
GetDlgItemTextA(hDlg, IDC_YOUR_CONTROL_ID, TextBuffer, ARRAY_SIZE(TextBuffer));
WriteFile(hFile, TextBuffer, strlen(TextBuffer), &SizeOut, lpOverlapped);
That'll just write plain old ASCII. If you want to use unicode and TCHARs (instead of chars) then you'll need to choose your encoding and write more than "just the bytes" from the text buffer.

Storing and retrieving UTF-8 strings from Windows resource (RC) files

I created an RC file which contains a string table, I would like to use some special
characters: ö ü ó ú ő ű á é. so I save the string with UTF-8 encoding.
But when I call in my cpp file, something like this:
LoadString("hu.dll", 12, nn, MAX_PATH);
I get a weird result:
How do I solve this problem?

As others have pointed out in the comments, the Windows APIs do not provide direct support for UTF-8 encoded text. You cannot pass the MessageBox function UTF-8 encoded strings and get the output that you expect. It will, instead, interpret them as characters in your local code page.
To get a UTF-8 string to pass to the Windows API functions (including MessageBox), you need to use the MultiByteToWideChar function to convert from UTF-8 to UTF-16 (what Windows calls Unicode, or wide strings). Passing the CP_UTF8 flag for the first parameter is the magic that enables this conversion. Example:
std::wstring ConvertUTF8ToUTF16String(const char* pszUtf8String)
{
// Determine the size required for the destination buffer.
const int length = MultiByteToWideChar(CP_UTF8,
0, // no flags required
pszUtf8String,
-1, // automatically determine length
nullptr,
0);
// Allocate a buffer of the appropriate length.
std::wstring utf16String(length, L'\0');
// Call the function again to do the conversion.
if (!MultiByteToWideChar(CP_UTF8,
0,
pszUtf8String,
-1,
&utf16String[0],
length))
{
// Uh-oh! Something went wrong.
// Handle the failure condition, perhaps by throwing an exception.
// Call the GetLastError() function for additional error information.
throw std::runtime_error("The MultiByteToWideChar function failed");
}
// Return the converted UTF-16 string.
return utf16String;
}
Then, once you have a wide string, you will explicitly call the wide-string variant of the MessageBox function, MessageBoxW.
However, if you only need to support Windows and not other platforms that use UTF-8 everywhere, you will probably have a much easier time sticking exclusively with UTF-16 encoded strings. This is the native Unicode encoding that Windows uses, and you can pass these types of strings directly to any of the Windows API functions. See my answer here to learn more about the interaction between Windows API functions and strings. I recommend the same thing to you as I did to the other guy:
Stick with wchar_t and std::wstring for your characters and strings, respectively.
Always call the W variants of Windows API functions, including LoadStringW and MessageBoxW.
Ensure that the UNICODE and _UNICODE macros are defined either before you include any of the Windows headers or in your project's build settings.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Called ReadFile on a text file, got weird (Japanese?) characters - c++

You have read an ANSI or UTF-8 text file into a UTF-16 string.

Related

Convert Japanese wstring to std::string

Issue when converting utf16 wide std::wstring to utf8 narrow std::string for rare characters

Convert wide CString to char*

WriteFile() Function for Win32 Applications

Storing and retrieving UTF-8 strings from Windows resource (RC) files

Categories

Resources