Get Strings in right encoding in c++ - c++

I asked a similar question before.
But I am still in trouble with encodings in c++.
I try to describe the problem as well as possible.
I have a c++ client, communicating with an c# service over TCP.
Now I need to display the Messages from the service in an Messagebox (Win32 API).
The Bytes, sended by the c# service are UTF-8 encoded.
Important to know, the c++ client will only be running on Windows Systems.
This is the code to receive the bytes and to display the Text:
char buffer[1024];
int receivedBytes = recv(socketHandle, buffer, sizeof(buffer) - 1, 0);
char str[receivedBytes];
for (int index = 0; index < receivedBytes; index++)
{
str[index] = buffer[index];
}
MessageBox(mainWindow, (LPCTSTR)str, (LPCTSTR) "Fehler", MB_OK|MB_ICONERROR);
If the Text contains chatacters like üäö, they are not shown in the Messagebox the correct way.
What can I do to receive the message as UTF-8 String in c++?
Is there a possibility to convert the char[] to an UTF-8 String?
Thx for helping
Tobi

If you want to display unicode characters in Windows, you need to translate UTF8 string into UTF16 (older UCS2), as this is the unicode standard Windows handles. You do that with MultiByteToWideChar function.
Also make sure that the #define UNICODE is set before you include Windows headers, so that MessageBox points to MessageBoxW or use MessageBoxW explicitly.

Related

Write to the input of process strings received by socket

I have an application on the Windows platform that receives remote commands from applications running on the Linux platform.
The Linux applications are experiencing difficulties accessing directories or files that contain accented characters, they send the command to access such files/directories and the return is always: "directory/file not found".
I think the two applications are with different code page, I venture to say this because I previously had problems in linux applications, the directories and files with accented words came with strange symbols in std::cout, and after I added SetConsoleOutputCP (CP_UTF8) in the windows application the problem was solved, and finally the paths containing accents were readable, does this mean that the linux application has code page 65001? Anyway, the problem when sending strings containing the path to the directories/files still persists, whenever the linux application tries to access paths containing accented words it fails.
I'll try to show how the two applications communicate.
Windows Side:
In short, this is the part where the client receives the message from the linux application, and then writes in the process what was received. In this part when writing paths containing accented characters the application returns in the output that is not possible to find them.
BYTE buffer[4096];
DWORD BytesWritten;
int ret = SSL_read(stI->ssl, (char*)buffer, sizeof(buffer));
if (ret <= 0)
break;
if(!WriteFile(stI->hStdIn, buffer, ret, &BytesWritten, NULL))
break;
And then it reads the output of the process and sends the content to the Linux application.
BYTE buffer[4096];
DWORD BytesAvailable, BytesRead;
if (!ReadFile(stI->hStdOut, buffer, min(sizeof(buffer), BytesAvailable), &BytesRead, NULL))
break;
ret = SSL_write(stI->ssl, (char*)buffer, BytesAvailable);
if (ret <= 0)
break;
Linux Side:
This part is very basic, the application reads a user input and then sends it to the windows application.
std::string inputBuffer;
ZH->console_input(inputBuffer, 33); // This function only controls the input and output of data with termios.
inputBuffer+='\n' // To simulate an enter in windows application
// Sends the typed path to the Windows application
SSL_write(session_data.ssl, inputBuffer.c_str(), strlen(inputBuffer.c_str()))
The part of receiving the data is basically the same as the windows application, it receives the data in a char variable and then print on the screen with std::cout.
The only difference is that the socket is set to NONBLOCK and I use the select function.
Any suggestions on how to solve this problem?
Your best bet is to use proper unicode encodings. Windows tends to use UTF-16 (uses 2 bytes to represent a character), Linux on the other hand uses UTF-8. This is typically uses a single byte per character for ASCII and escapes non ascii characters (\uxxxx where x represents a hex digit). If you do a proper conversion from Windows UTF-16 to UTF-8, things should work correctly.
C++11 and Boost do provide some Unicode support, but for gold standard support, take a look at ICU.
Sockets however just transmit bytes so they have nothing to do with Unicode conversions.

Capture spawned process stdout as unicode

In my C++/WinAPI code, I want to run some commands and capture their output. To test non-ASCII output, I renamed my network connection to Ethérnét אבג БбГгДд and run ipconfig. When running in command prompt, the output comes out correctly (visible when using a supporting font like Courier New):
C:\>ipconfig
Windows IP Configuration
Ethernet adapter Ethérnét אבג БбГгДд:
(...)
I tried to redirect the output to a pipe, following the example in this answer. But the byte array returned from ReadFile() is not unicode - it's encoded in CP_OEMCP (CP437 in my case), and so the Hebrew and Russian characters come out as '?'s. Since the characters are already lost, no further handling can restore them.
Obviously it's possible, since cmd in a console window does it. How can I do it?
It would seem that ipconfig produces Unicode output when it detects that the output device is the console, and ANSI output otherwise. This is likely to be a backwards-compatibility measure.
Most other built-in command-line tools are likely to either be ANSI-only or to behave in the same way as ipconfig, for the same reason. In Windows, command-line tools are meant, well, for use on the command line; programmers are discouraged from shelling out to them and parsing the output. Instead, you should use the corresponding APIs.
If you know which language you are expecting, you might be able to choose a code page that will preserve the content.
Added by #Jonathan: Undocumented: Turns out you can control the encoding of built-in commands using the environment variable OutputEncoding. I tested with ipconfig, but presumably it works with other built-in tools as well:
> for %e in ("" Unicode Ansi UTF8) do (set OutputEncoding=%~e& ipconfig >ipconfig-%~e.txt)
> (set OutputEncoding= & ipconfig 1>ipconfig-.txt )
> (set OutputEncoding=Unicode & ipconfig 1>ipconfig-Unicode.txt )
> (set OutputEncoding=Ansi & ipconfig 1>ipconfig-Ansi.txt )
> (set OutputEncoding=UTF8 & ipconfig 1>ipconfig-UTF8.txt )
And indeed, ipconfig-*.txt are enconded as expected! Note that this is undocumented, but it does work for me.
Addendum: as of Windows 10 v1809, another alternative is to create a pseudoconsole.
console application can use different ways for output.
for console handle we can use WriteConsoleW for output already in
UNICODE.
if we want use WriteConsoleA or WriteFile for console
handle need first convert UNICODE text to multi-bytes by
WideCharToMultiByte with CodePage :=
GetConsoleOutputCP()
if we have not UNICODE text initially for output (say UTF-8 or
Ansi), need first convert it to UNICODE by
MultiByteToWideChar (with CP_UTF8 or CP_ACP) and then
already again convert it to multi-byte WideCharToMultiByte(GetConsoleOutputCP(), ..)
usual (by default) GetConsoleOutputCP() return same value as GetOEMCP(), so have the same effect in MultiByteToWideChar and WideCharToMultiByte as CP_OEMCP (this constant value is translated to GetOEMCP() )
when output handle is redirected to a file need only use WriteFile only. however application can write data to file in any format: UNICODE, Ansi (CP_ACP) , UTF-8 (CP_UTF8) etc. what is format will be used - very depend from concrete application. you can not full control this. usual you will receive multi-byte output in CP_OEMCP encoding. then you need decide how process it - faster of all you will be need first convert it to UNICODE and use unicode form. if you need Ansi - you will be need do else one conversion.
say if you try use pipe output in CP_OEMCP encoding with OutputDebugStringA - you got error (not readable) output for non english text.
but after 2 conversions CP_OEMCP -> UNICODE -> CP_ACP you can correct displayed text with OutputDebugStringA
but because OutputDebugStringW exist - here enough only to UNICODE convert
also some applications have special options for control output to file format. say ipconfig.exe looking for "OutputEncoding" Environment Variable and depended from it string value ("Unicode", "Ansi", "UTF-8") produce different output. by default (if this Environment Variable not exist or unknown value) CP_OEMCP used
example of pipe read procedure. assume that input data in CP_OEMCP encoding:
void OnRead(PVOID buf, ULONG cbTransferred)
{
if (cbTransferred)
{
if (int len = MultiByteToWideChar(CP_OEMCP, 0, (PSTR)buf, cbTransferred, 0, 0))
{
PWSTR pwz = (PWSTR)alloca((1 + len) * sizeof(WCHAR));
if (len = MultiByteToWideChar(CP_OEMCP, 0, (PSTR)buf, cbTransferred, pwz, len))
{
if (g_bUseAnsi)
{
if (cbTransferred = WideCharToMultiByte(CP_ACP, 0, pwz, len, 0, 0, 0, 0))
{
PSTR psz = (PSTR)alloca(cbTransferred + 1);
if (cbTransferred = WideCharToMultiByte(CP_ACP, 0, pwz, len, psz, cbTransferred, 0, 0))
{
DoPrint(psz, cbTransferred, OutputDebugStringA);
}
}
}
else
{
DoPrint(pwz, len, OutputDebugStringW);
}
}
}
}
}
// debugger can incomplete print too big buffer, so split it on small chunks
template<typename T> void DoPrint(T* p, ULONG len, void (WINAPI* fnOutput)(const T*))
{
ULONG cb;
T* q = p;
do
{
cb = min(len, 256);
q = p + cb;
T c = *q;
*q = 0;
fnOutput(p);
*q = c;
p = q;
} while (len -= cb);
}
about your concrete case - ipconfig.exe used WriteConsoleW for output to console. as result it not depended from current system locale and can correct display multilanguage text. but another tools, like route.exe used WriteFile for ouput (both to console and file) and convert before this UNICODE text to multi-byte by WideCharToMultiByte(CP_OEMCP,..) - as result here will be problems, if try display characters which not exist in CP_OEMCP code page (current system locale). if you have CP437 - Hebrew and Russian characters will be lost if use UNICODE -> CP_OEMCP, need only direct ouput with unicode to console and file. are this possible - dependend from concrete application. for say route.exe this not possible. for ipconfig.exe this possible, because it always write to console in unicode format, and can write to file also in unicode or utf-8 if you set "OutputEncoding" to "Unicode" or "UTF-8"

Convert wide CString to char*

There are lots of times this question has been asked and as many answers - none of which work for me and, it seems, many others. The question is about wide CStrings and 8bit chars under MFC. We all want an answer that will work in ALL cases, not a specific instance.
void Dosomething(CString csFileName)
{
char cLocFileNamestr[1024];
char cIntFileNamestr[1024];
// Convert from whatever version of CString is supplied
// to an 8 bit char string
cIntFileNamestr = ConvertCStochar(csFileName);
sprintf_s(cLocFileNamestr, "%s_%s", cIntFileNamestr, "pling.txt" );
m_KFile = fopen(LocFileNamestr, "wt");
}
This is an addition to existing code (by somebody else) for debugging.
I don't want to change the function signature, it is used in many places.
I cannot change the signature of sprintf_s, it is a library function.
You are leaving out a lot of details, or ignoring them. If you are building with UNICODE defined (which it seems you are), then the easiest way to convert to MBCS is like this:
CStringA strAIntFileNameStr = csFileName.GetString(); // uses default code page
CStringA is the 8-bit/MBCS version of CString.
However, it will fill with some garbage characters if the unicode string you are translating from contains characters that are not in the default code page.
Instead of using fopen(), you could use _wfopen() which will open a file with a unicode filename. To create your file name, you would use swprintf_s().
an answer that will work in ALL cases, not a specific instance...
There is no such thing.
It's easy to convert "ABCD..." from wchar_t* to char*, but it doesn't work that way with non-Latin languages.
Stick to CString and wchar_t when your project is unicode.
If you need to upload data to webpage or something, then use CW2A and CA2W for utf-8 and utf-16 conversion.
CStringW unicode = L"Россия";
MessageBoxW(0,unicode,L"Russian",0);//should be okay
CStringA utf8 = CW2A(unicode, CP_UTF8);
::MessageBoxA(0,utf8,"format error",0);//WinApi doesn't get UTF-8
char buf[1024];
strcpy(buf, utf8);
::MessageBoxA(0,buf,"format error",0);//same problem
//send this buf to webpage or other utf-8 systems
//this should be compatible with notepad etc.
//text will appear correctly
ofstream f(L"c:\\stuff\\okay.txt");
f.write(buf, strlen(buf));
//convert utf8 back to utf16
unicode = CA2W(buf, CP_UTF8);
::MessageBoxW(0,unicode,L"okay",0);

Set Unicode text on MFC form controls in Multi-Byte Char Set application

I have Multi-Byte Char Set MFC windows application. Now I need to display international single byte ASCI characters on windows controls. I can't use ASCI characters directly because to display them correctly it is required windows locale be set to adequate country. I need to display characters in all windows locale cases. For this purpose I must convert ASCI to unicode. I can display required international characters in MessageBoxW, but how to display them on windows MFC controls using SetWindowText?
To show unicode string in MessageBoxW I construct it in wstring
WORD g [] = {0x105,0x106,0x107,0x108,0x109,0x110,0x111,0x112,0x113,0x114,0x115,0x116,0x117,0x118,0x119,0x120};
wstring gg (reinterpret_cast<wchar_t*>(g),15);
MessageBoxW(NULL, gg.c_str() , gg.c_str() , MB_ICONEXCLAMATION | MB_OK);
Seting MFC form control text:
class MyFrm: public CDialogEx
{
virtual BOOL OnInitDialog();
}
...
BOOL MyFrm::OnInitDialog()
{
GetDlgItem(IDC_EDIT_TICKET_NUMBER)->SetWindowText( ???);
}
Is it possible somehow convert wstring gg to CString and show unicode chars on window control?
You could try casting your CDialogEx 'this' object to HWND and then call explictly Win32 API to set text using wchars. So your code will look something like this:
BOOL MyFrm::OnInitDialog()
{
SetDlgItemTextW((HWND)(*this), IDC_EDIT_TICKET_NUMBER, gg.c_str());
}
But as I mentioned earlier Unicode is supported starting from Windows XP and using ASCII is really not a good idea unless you're targeting those very very old OS'es before it. Using them nowdays will cause ALL ASCII strings you pass to be firstly converted into Unicode by the Win32 API. So it is a better idea to switch your project entirely to UNICODE.
First, note that you can simply directly initialize a std::wstring with your Unicode hex character data, without any ugly useless reinterpret_cast<wchar_t*>, etc.
Instead of this:
WORD g [] = {0x105,0x106,0x107,0x108,...,0x120};
wstring gg (reinterpret_cast<wchar_t*>(g),15);
just consider that:
wstring text = L"\x0105\x0106\x0108...\0x0120";
The latter seems much cleaner to me.
Second, if you want to pass an instance to std::wstring to an MFC method that expects a const wchar_t* input string pointer, just consider using wstring::c_str() method.
In addition, the best suggestion I can give you is to just port your app to Unicode.
ASCII/MBCS should be considered programming model of the past for MFC; they bring lots of problem when you want to write "international" code.

Storing and retrieving UTF-8 strings from Windows resource (RC) files

I created an RC file which contains a string table, I would like to use some special
characters: ö ü ó ú ő ű á é. so I save the string with UTF-8 encoding.
But when I call in my cpp file, something like this:
LoadString("hu.dll", 12, nn, MAX_PATH);
I get a weird result:
How do I solve this problem?
As others have pointed out in the comments, the Windows APIs do not provide direct support for UTF-8 encoded text. You cannot pass the MessageBox function UTF-8 encoded strings and get the output that you expect. It will, instead, interpret them as characters in your local code page.
To get a UTF-8 string to pass to the Windows API functions (including MessageBox), you need to use the MultiByteToWideChar function to convert from UTF-8 to UTF-16 (what Windows calls Unicode, or wide strings). Passing the CP_UTF8 flag for the first parameter is the magic that enables this conversion. Example:
std::wstring ConvertUTF8ToUTF16String(const char* pszUtf8String)
{
// Determine the size required for the destination buffer.
const int length = MultiByteToWideChar(CP_UTF8,
0, // no flags required
pszUtf8String,
-1, // automatically determine length
nullptr,
0);
// Allocate a buffer of the appropriate length.
std::wstring utf16String(length, L'\0');
// Call the function again to do the conversion.
if (!MultiByteToWideChar(CP_UTF8,
0,
pszUtf8String,
-1,
&utf16String[0],
length))
{
// Uh-oh! Something went wrong.
// Handle the failure condition, perhaps by throwing an exception.
// Call the GetLastError() function for additional error information.
throw std::runtime_error("The MultiByteToWideChar function failed");
}
// Return the converted UTF-16 string.
return utf16String;
}
Then, once you have a wide string, you will explicitly call the wide-string variant of the MessageBox function, MessageBoxW.
However, if you only need to support Windows and not other platforms that use UTF-8 everywhere, you will probably have a much easier time sticking exclusively with UTF-16 encoded strings. This is the native Unicode encoding that Windows uses, and you can pass these types of strings directly to any of the Windows API functions. See my answer here to learn more about the interaction between Windows API functions and strings. I recommend the same thing to you as I did to the other guy:
Stick with wchar_t and std::wstring for your characters and strings, respectively.
Always call the W variants of Windows API functions, including LoadStringW and MessageBoxW.
Ensure that the UNICODE and _UNICODE macros are defined either before you include any of the Windows headers or in your project's build settings.