wcstombs producing unexpected output - c++

I am trying to convert a wchar_t* to char* using wcstombs on Linux(CentOS7 to be specific). The current system encoding is UTF-8 and the return value of wcstombs confirms that all the characters are converted. However, when I print the resulting char* string, I see unexpected values. I am pasting the sample code below. TRACE1 is a custom macro to print values.
int length;
int size = wcstombs(NULL,wideString,0);
TRACE1("number of bytes needed for conversion :-------> %d" , size);
char str[size+1];
length = wcstombs(str, wideString, size+1);
TRACE1("characters converted :-------> %d" , length);
if(length == size)
{
str[size] = '\0';
}
After the conversion I was expecting the char* string to have characters in ascii character set as the wideString has only ascii characters. However, I see non ascii characters in the converted char* string. wideString is something that I receive as return value from .net code and I am receiving it as a 'wchar_t' to keep it uniform across Linux and windows.I am not sure if that can have an impact here?
I have tried setting the locale to the default locale and it didn't help me. Any help is greatly appreciated. I am pasting sample wide string and output string below for reference.
Below is how the original wide character string looks like
eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6
And this is how the converted string looks like
ü°<92>§¥¥ý©<90><95>¡¥ý<8b><92>¦¥<8f>ý©<94><93><85><96>ý¨<92>¤<8d><8c>ý©<98>´<9d>¢ý<93><92>¦¥<8f>ü±<92><97>©<95>ý³<92><96>¥<8e>ü±<99>¶¹<89>ü¶<92><94><8d>¤ý<94><9c><86>µ<89>

Related

convert MFC's CString to int for both ASCII and UNICODE

Converting CString to an int in ASCII mode is as simple as
CString s("123");
int n = atoi(s);
However that doesn't work for projects in UNICODE mode as CString becomes a wide-char string.
How do I write my code to cover both ASCII and UNICODE modes without extra if statements?
Turns out there's a _ttoi() available just for that purpose:
CString s( _T("123") );
int n = _ttoi(s);
This works for both modes with no extra effort.
If you need to convert hexadecimal (or other-base) numbers you can resort to a more generic strtol() variant:
CString s( _T("0xFA3") );
int n = _tcstol(s, nullptr, 16);
There's a special version of CString that uses multibyte characters even if your build is specified for wide characters - CStringA. It will also convert from wide characters automatically.
CString s(_T("123"));
CStringA sa = s;
int n = atoi(sa);
There's a corresponding CStringW that only uses wide characters.

char Array to string conversion results in strange characters - sockets

I currently have the following code
char my_stream[800];
std::string my_string;
iResult = recv(clntSocket,my_stream,sizeof(my_stream),0);
my_string = std::string(my_stream);
Now when I attempt to convert the char array to string I get the present of weird characters in the string any suggestions on what I might be doing wrong
You're getting weird characters because your strings length is not equal to the number of bytes received.
You should initialize the string like so:
char* buffer = new char[512];
ssize_t bytesRead = recv(clntSocket,buffer,512,0);
std::string msgStr = std::string(buffer,bytesRead);
delete buffer;
The most common solution is to zero every byte of the buffer before reading anything.
char buffer[512];
buffer = { 0 };
If you're reading in a zero-terminated string from your socket, there's no need for a conversion, it's already a char string. If it's not zero-terminated already, you'll need some other kind of terminator because sockets are streams (assuming this is TCP). In other words, you don't need my_string = std::string(my_stream);
have you tried to print my_stream directly without converting to string.
According to me it may be the case of mismatch in format of data sent and received.
data on other side may be in other format like Unicode and you may be trying to print it as single byte array
if only part of string is in weird characters than it is definitely error related to null terminator at the end of my_stream missing tehn increase the size of array of my_stream.

C++ string encoding UTF8 / unicode

I am trying to be able to send character "Т" (not a normal capital t, unicode decimal value 1058) from C++ to VB
However, with this method below Message is returned to VB and it appears as "Т", which is the above character encoded in ANSI.
#if defined(_MSC_VER) && _MSC_VER > 1310
# define utf8(str) ConvertToUTF8(L##str)
const char * ConvertToUTF8(const wchar_t * pStr) {
static char szBuf[1024];
WideCharToMultiByte(CP_UTF8, 0, pStr, -1, szBuf, sizeof(szBuf), NULL, NULL);
return szBuf;
}
#else
# define utf8(str) str
#endif
BSTR _stdcall chatTest()
{
BSTR Message;
CString temp("temp test");
temp+=utf8("\u0422");
int len = temp.GetLength();
Message = SysAllocStringByteLen ((LPCTSTR)temp, len+1 );
return Message;
}
If I just do temp+=("\u0422"); without the utf8 function. It sends the data as "?" and its actually a question mark (sometimes unicode characters show up as question marks in VB, but still have the correct unicode decimal value.. this is not the case here... it changes it to a question mark.
In VB if I output the String variable that has data from Message when it is "Т" to a text file it appears as the "Т".
So as far as I can tell its in UTF8 in C++, then somehow gets converted to ANSI in VB (or before its sent?), and then when outputted to a file its changed back to UTF8?
I just need to keep the "Т" intact when sending from C++ to VB. I know VB strings can hold that character because from another source within VB I am able to store it (it appears as a "?", but has the proper unicode decimal value).
Any help is greatly appreciated.
Thanks
A BSTR is not UTF-8, it's UTF-16 which is what you get with the L"" prefix. Take out the UTF-8 conversion and use CStringW. And use LPCWSTR instead of LPCTSTR.

Call popen() on a command with Chinese characters on Mac

I'm trying to execute a program on a file using the popen() command on a Mac. For this, I create a command of the form <path-to_executable> <path-to-file> and then call popen() on this command. Right now, both these two components are declared in a char*. I need to read the output of the command so I need the pipe given by popen().
Now it turns out that path-to-file can contain Chinese, Japanese, Russian and pretty much any other characters. For this, I can represent the path-to-file as wchar_t*. But this doesn't work with popen() because apparently Mac / Linux don't have a wide _wpopen() like Windows.
Is there any other way I can make this work? I'm getting the path-to-file from a data structure that can only give me wchar_t* so I have to take it from there and convert it appropriately, if needed.
Thanks in advance.
Edit:
Seems like one of those days when you just end up pulling your hair out.
So I tried using wcstombs, but the setlocale call failed for "C.UTF-8" and any of its permutations. Unsurprisingly, the wcstombs call failed returning -1 after that.
Then I tried to write my own iconv implementation based on some sample codes searched on Google. I came up with this, which stubbornly refuses to work:
iconv_t cd = iconv_open("UTF-8", "WCHAR_T");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = (char*) inbuf;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
iconv always returns -1 and the errno is set to EINVAL. I've verified that <size-of-len> is set correctly. I've got no clue why this code's failing now.
Edit 2:
iconv was failing because I was not setting the input buffer length right. Also, Mac doesn't seem to support the "WCHAR_T" encoding so I've changed it to UTF-16. Now I've corrected the length and changed the from encoding but iconv just returns without converting any character. It just returns 0.
To debug this issue, I even changed the input string to a temp string and set the input length appropriately. Even this iconv call just returns 0. My code now looks like:
iconv_t cd = iconv_open("UTF-8", "UTF-16");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here - guaranteed to be UTF-16
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = "abc"; // (char*) inbuf;
inlen = 4;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
I've confirmed that the converter descriptor is being opened correctly. The from-encoding is correct. The input buffer contains a few simple characters. Everything is hardcoded and still, iconv doesn't convert any characters and just returns 0 and outbuf remains empty.
Sanity loss alert!
You'll need an UTF-8 string for popen. For this, you can use iconv to convert between different encodings, including from the local wchar_t encoding to UTF-8. (Note that on my Mac OS install, wchar_t is actually 32 bits, and not 16.)
EDIT Here's an example that works on OS X Lion. I did not have problems using the wchar_t encoding (and it is documented in the iconv man page).
#include <sys/param.h>
#include <string.h>
#include <iconv.h>
#include <stdio.h>
#include <errno.h>
char* utf8path(const wchar_t* wchar, size_t utf32_bytes)
{
char result_buffer[MAXPATHLEN];
iconv_t converter = iconv_open("UTF-8", "wchar_t");
char* result = result_buffer;
char* input = (char*)wchar;
size_t output_available_size = sizeof result_buffer;
size_t input_available_size = utf32_bytes;
size_t result_code = iconv(converter, &input, &input_available_size, &result, &output_available_size);
if (result_code == -1)
{
perror("iconv");
return NULL;
}
iconv_close(converter);
return strdup(result_buffer);
}
int main()
{
wchar_t hello_world[] = L"/éè/path/to/hello/world.txt";
char* utf8 = utf8path(hello_world, sizeof hello_world);
printf("%s\n", utf8);
free(utf8);
return 0;
}
The utf8_hello_world function accepts a wchar_t string with its byte length and returns the equivalent UTF-8 string. If you deal with pointers to wchar_t instead of an array of wchar_t, you'll want to use (wcslen(ptr) + 1) * sizeof(wchar_t) instead of sizeof.
Mac OS X uses UTF-8, so you need to convert the wide-character strings into UTF-8. You can do this using wcstombs, provided you first switch into a UTF-8 locale. For example:
// Do this once at program startup
setlocale(LC_ALL, "en_US.UTF-8");
...
// Error checking omitted for expository purposes
wchar_t *wideFilename = ...; // This comes from wherever
char filename[256]; // Make sure this buffer is big enough!
wcstombs(filename, wideFilename, sizeof(filename));
// Construct popen command using the UTF-8 filename
You can also use libiconv to do the UTF-16 to UTF-8 conversion for you if you don't want to change your program's locale setting; you could also roll your own implementation, as doing the conversion is not all that complicated.

WideCharToMultiByte problem

I have the lovely functions from my previous question, which work fine if I do this:
wstring temp;
wcin >> temp;
string whatever( toUTF8(getSomeWString()) );
// store whatever, copy, but do not use it as UTF8 (see below)
wcout << toUTF16(whatever) << endl;
The original form is reproduced, but the in between form often contains extra characters. If I enter for example àçé as the input, and add a cout << whatever statement, i'll get ┬à┬ç┬é as output.
Can I still use this string to compare to others, procured from an ASCII source? Or asked differently: if I would output ┬à┬ç┬é through the UTF8 cout in linux, would it read àçé? Is the byte content of a string àçé, read in UTF8 linux by cin, exactly the same as what the Win32 API gets me?
Thanks!
PS: the reason I'm asking is because I need to use the string a lot to compare to other read values (comparing and concatenating...).
Let's start by me saying that it appears that there is simply no way to output UTF-8 text to the console in Windows via cout (assuming you compile with Visual Studio).
What you can do however for your tests is to output your UTF-8 text via the Win32 API fn WriteConsoleA:
if(!SetConsoleOutputCP(CP_UTF8)) { // 65001
cerr << "Failed to set console output mode!\n";
return 1;
}
HANDLE const consout = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD nNumberOfCharsWritten;
const char* utf8 = "Umlaut AE = \xC3\x84 / ue = \xC3\xBC \n";
if(!WriteConsoleA(consout, utf8, strlen(utf8), &nNumberOfCharsWritten, NULL)) {
DWORD const err = GetLastError();
cerr << "WriteConsole failed with << " << err << "!\n";
return 1;
}
This should output:
Umlaut AE = Ä / ue = ü if you set your console (cmd.exe) to use the Lucida Console font.
As for your question (taken from your comment) if
a win23 API converted string is the
same as a raw UTF8 (linux) string
I will say yes: Given a Unicode character sequence, it's UTF-16 (Windows wchar_t) representation converted to a UTF-8 (char) representation via the WideCharToMultiByte function will always yield the same byte sequence.
When you convert the string to a UTF 16 it is a 16 byte wide character, you can't compare it to the ASCII values because they aren't 16 byte values. You have to convert them to compare, or write a specialized comparision to ASCII function.
I doubt the UTF8 cout in linux would produce the same correct output unless it were regular ASCII values, as UTF8 UTF-8 encoding forms are binary-compatible with ASCII for code points below 128, and I assume UTF16 comes after UTF8 in a simliar fashion.
The good news is there are many converters out there written to convert these strings to different character sets.