MultiByteToWideChar or WideCharToMultiByte and txt files - c++

I'm trying to write a universal text editor which can open and display ANSI and Unicode in EditControl. Do I need to repeatedly call ReadFile() if I determine that the text is ANSI? Can't figure out how to perform this task. My attempt below does not work, it displays '?' characters in EditControl.
LARGE_INTEGER fSize;
GetFileSizeEx(hFile,&fSize);
int bufferLen = fSize.QuadPart/sizeof(TCHAR)+1;
TCHAR* buffer = new TCHAR[bufferLen];
buffer[0] = _T('\0');
DWORD wasRead = 0;
ReadFile(hFile,buffer,fSize.QuadPart,&wasRead,NULL);
buffer[wasRead/sizeof(TCHAR)] = _T('\0');
if(!IsTextUnicode(buffer,bufferLen,NULL))
{
CHAR* ansiBuffer = new CHAR[bufferLen];
ansiBuffer[0] = '\0';
WideCharToMultiByte(CP_ACP,0,buffer,bufferLen,ansiBuffer,bufferLen,NULL,NULL);
SetWindowTextA(edit,ansiBuffer);
delete[]ansiBuffer;
}
else
SetWindowText(edit,buffer);
CloseHandle(hFile);
delete[]buffer;

There are a few buffer length errors and oddities, but here's your big problem. You call WideCharToMultiByte incorrectly. That is meant to receive UTF-16 encoded text as input. But when IsTextUnicode returns false that means that the buffer is not UTF-16 encoded.
The following is basically what you need:
if(!IsTextUnicode(buffer,bufferLen*sizeof(TCHAR),NULL))
SetWindowTextA(edit,(char*)buffer);
Note that I've fixed the length parameter to IsTextUnicode.
For what it is worth, I think I'd read in to a buffer of char. That would remove the need for the sizeof(TCHAR). In fact I'd stop using TCHAR altogether. This program should be Unicode all the way - TCHAR is what you use when you compile for both NT and 9x variants of Windows. You aren't compiling for 9x anymore I imagine.
So I'd probably code it like this:
char* buffer = new char[filesize+2];//+2 for UTF-16 null terminator
DWORD wasRead = 0;
ReadFile(hFile, buffer, filesize, &wasRead, NULL);
//add error checking for ReadFile, including that wasRead == filesize
buffer[filesize] = '\0';
buffer[filesize+1] = '\0';
if (IsTextUnicode(buffer, filesize, NULL))
SetWindowText(edit, (wchar_t*)buffer);
else
SetWindowTextA(edit, buffer);
delete[] buffer;
Note also that this code makes no allowance for the possibility of receiving UTF-8 encoded text. If you want to handle that you'd need to take your char buffer and send to through MultiByteToWideChar using CP_UTF8.

Related

Why does RAD Studio CreateBlobStream with CryptUnprotectData return extra characters?

I'm writing a recovery app that pulls passwords from Chrome. It has a GUI, so I've used their SQLite wrapper, which uses both SQLConnection and SQLQuery. Here's a snip of my code:
//Create our blob stream
TStream *Stream2 = SQLQuery1->CreateBlobStream(SQLQuery1->FieldByName("password_value"), bmRead);
//Get our blob size
int size = Stream2->Size;
//Create our buffer
char* pbDataInput = new char[size+1];
//Adding null terminator to buffer
memset(pbDataInput, 0x00, sizeof(char)*(size+1));
//Write to our buffer
Stream2->ReadBuffer(pbDataInput, size);
DWORD cbDataInput = size;
DataOut.pbData = pbDataInput;
DataOut.cbData = cbDataInput;
LPWSTR pDescrOut = NULL;
//Decrypt password
CryptUnprotectData( &DataOut,
&pDescrOut,
NULL,
NULL,
NULL,
0,
&DataVerify);
//Output password
UnicodeString password = (UnicodeString)(char*)DataVerify.pbData;
passwordgrid->Cells[2][i] = password;
The output data looks fine, except it behaves as if something went wrong with my null terminator. Here's what output looks like on every line:
I've Read
Windows doc for CryptUnprotectData:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa382377.aspx
Embarcadero documentation for CreateBlobStream:
http://docwiki.embarcadero.com/Libraries/en/Data.DB.TDataSet.CreateBlobStream
memset:
http://www.cplusplus.com/reference/cstring/memset/
Your reading and decrypting calls operate on raw bytes only, they know nothing about strings, and don't care about them. The null-terminator you are adding to pbDataInput is never used, so get rid of it:
//Get our blob size
int size = Stream2->Size;
//Create our buffer
char* pbDataInput = new char[size];
//Write to our buffer
Stream2->ReadBuffer(pbDataInput, size);
DWORD cbDataInput = size;
...
delete[] pbDataInput;
delete Stream2;
Now, when assigning pbData to password, you are casting pbData to char*, so the UnicodeString constructor interprets the data as a null-terminated ANSI string and will convert it to UTF-16 using the system default ANSI codepage, which is potentially a lossy conversion for non-ASCII characters. Is that what you really want?
If so, and if the decrypted data is not actually null-terminated, you have to specify the number of characters to the UnicodeString constructor:
UnicodeString password( (char*)DataVerify.pbData, DataVerify.cbData );
On the other hand, if the decrypted output is already in UTF-16, you need to cast pbData to wchar_t* instead:
UnicodeString password = (wchar_t*)DataVerify.pbData;
Or, if not null-terminated:
UnicodeString password( (wchar_t*)DataVerify.pbData, DataVerify.cbData / sizeof(wchar_t) );

Why obtained MachineGuid looks not alike a GUID but like Korean?

I created a simple function:
std::wstring GetRegKey(const std::string& location, const std::string& name){
const int valueLength = 10240;
auto platformFlag = KEY_WOW64_64KEY;
HKEY key;
TCHAR value[valueLength];
DWORD bufLen = valueLength*sizeof(TCHAR);
long ret;
ret = RegOpenKeyExA(HKEY_LOCAL_MACHINE, location.c_str(), 0, KEY_READ | platformFlag, &key);
if( ret != ERROR_SUCCESS ){
return std::wstring();
}
ret = RegQueryValueExA(key, name.c_str(), NULL, NULL, (LPBYTE) value, &bufLen);
RegCloseKey(key);
if ( (ret != ERROR_SUCCESS) || (bufLen > valueLength*sizeof(TCHAR)) ){
return std::wstring();
}
std::wstring stringValue(value, (size_t)bufLen - 1);
size_t i = stringValue.length();
while( i > 0 && stringValue[i-1] == '\0' ){
--i;
}
return stringValue;
}
And I call it like auto result = GetRegKey("SOFTWARE\\Microsoft\\Cryptography", "MachineGuid");
yet string looks like
㤴ㄷ㤵戰㌭㉣ⴱ㔴㍥㤭慣ⴹ㍥摢㘵〴㉡ㄵ\0009ca9-e3bd5640a251
not like RegEdit
4971590b-3c21-45e3-9ca9-e3bd5640a251
So I wonder what shall be done to get a correct representation of MachineGuid in C++?
RegQueryValueExA is an ANSI wrapper around the Unicode version since Windows NT. When building on a Unicode version of Windows, it not only converts the the lpValueName to a LPCWSTR, but it will also convert the lpData retrieved from the registry to an LPWSTR before returning.
MSDN has the following to say:
If the data has the REG_SZ, REG_MULTI_SZ or REG_EXPAND_SZ type, and
the ANSI version of this function is used (either by explicitly
calling RegQueryValueExA or by not defining UNICODE before including
the Windows.h file), this function converts the stored Unicode string
to an ANSI string before copying it to the buffer pointed to by
lpData.
Your problem is that you are populating the lpData, which holds TCHARs (WCHAR on Unicode versions of Windows) with an ANSI string.
The garbled string that you see is a result of 2 ANSI chars being used to populate a single wchar_t. That explains the Asian characters. The portion that looks like the end of the GUID is because the print function blew past the terminating null since it was only one byte and began printing what is probably a portion of the buffer that was used by RegQueryValueExA before converting to ANSI.
To solve the problem, either stick entirely to Unicode, or to ANSI (if you are brave enough to continue using ANSI in the year 2014), or be very careful about your conversions. I would change GetRegKey to accept wstrings and use RegQueryValueExW instead, but that is a matter of preference and what sort of code you plan on using this in.
(Also, I would recommend you have someone review this code since there are a number of oddities in the error checking, and a hard coded buffer size.)

WM_COPYDATA won't deliver my string correctly

I tried to use WM_COPYDATA to send a string from one window to another. The messaages gets received perfectly by my receiving window. Except the string I send does not stay intact.
Here is my code in the sending application:
HWND wndsend = 0;
wndsend = FindWindowA(0, "Receiving window");
if(wndsend == 0)
{
printf("Couldn't find window.");
}
TCHAR* lpszString = (TCHAR*)"De string is ontvangen";
COPYDATASTRUCT cds;
cds.dwData = 1;
cds.cbData = sizeof(lpszString);
cds.lpData = (TCHAR*)lpszString;
SendMessage(wndsend, WM_COPYDATA, (WPARAM)hwnd, (LPARAM)(LPVOID)&cds);
And this is the code in the receiving application:
case WM_COPYDATA :
COPYDATASTRUCT* pcds;
pcds = (COPYDATASTRUCT*)lParam;
if (pcds->dwData == 1)
{
TCHAR *lpszString;
lpszString = (TCHAR *) (pcds->lpData);
MessageBox(0, lpszString, TEXT("clicked"), MB_OK | MB_ICONINFORMATION);
}
return 0;
Now what happens is that the messagebox that gets called outputs chinese letters.
My guess is that I didn't convert it right, or that I don't actually send the string but just the pointer to it, which gives a totally different data in the receiver's window. I don't know how to fix it though.
sizeof(lpszString) is the size of the pointer, but you need the size in bytes of the buffer. You need to use:
sizeof(TCHAR)*(_tcsclen(lpszString)+1)
The code that reads the string should take care not to read off the end of the buffer by reading the value of cbData that is supplied to it.
Remember that sizeof evaluates at compile time. Keep that thought to the front of your mind when you use it and if ever you find yourself using sizeof with something that you know to be dynamic, take a step back.
As an extra, free, piece of advice I suggest that you stop using TCHAR and pick one character set. I would recommend Unicode. So, use wchar_t in place of TCHAR. You are already building a Unicode app.
Also, lpData is a pointer to the actual data, and cbData should be the size of the data, but you're actually setting the size of the pointer. Set it to the length of the string instead (and probably the terminating 0 character too: strlen(lpszString)+1

Visual Studio multibyte chars to single bytes

Is there a simple way to convert multibyte UTF8 data (from Google Contacts API via https://www.google.com/m8/feeds/) to single bytes? I know the extended ASCII set is non-standard but, for example, my program which will display the info in an MFC CListBox is quite happy to show 'E acute' as 0xE9. I only need it to cope with a few similar European symbols. I've discovered I can convert everything with MultiByteToWideChar() but don't want to have to change lots of functions to accept wide characters if possible.
Thanks.
If you need to convert char * from UTF8 to ANSI, try the following function:
// change encoding from UTF8 to ANSI
char* change_encoding_from_UTF8_to_ANSI(char* szU8)
{
int wcsLen = ::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), NULL, 0);
wchar_t* wszString = new wchar_t[wcsLen + 1];
::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), wszString, wcsLen);
wszString[wcsLen] = '\0';
int ansiLen = ::WideCharToMultiByte(CP_ACP, NULL, wszString, wcslen(wszString), NULL, 0, NULL, NULL);
char* szAnsi = new char[ansiLen + 1];
::WideCharToMultiByte(CP_ACP, NULL, wszString, wcslen(wszString), szAnsi, ansiLen, NULL, NULL);
szAnsi[ansiLen] = '\0';
delete []wszString;
return szAnsi;
}
Utf8 has a 1-to-1 mapping with Ascii characters so if you are receiving Ascii characters as utf8 ones, AFAIK you can directly read them as Ascii. If you have non-Ascii chars then there's no way you can express them in Ascii (any byte > 0x80)

Call popen() on a command with Chinese characters on Mac

I'm trying to execute a program on a file using the popen() command on a Mac. For this, I create a command of the form <path-to_executable> <path-to-file> and then call popen() on this command. Right now, both these two components are declared in a char*. I need to read the output of the command so I need the pipe given by popen().
Now it turns out that path-to-file can contain Chinese, Japanese, Russian and pretty much any other characters. For this, I can represent the path-to-file as wchar_t*. But this doesn't work with popen() because apparently Mac / Linux don't have a wide _wpopen() like Windows.
Is there any other way I can make this work? I'm getting the path-to-file from a data structure that can only give me wchar_t* so I have to take it from there and convert it appropriately, if needed.
Thanks in advance.
Edit:
Seems like one of those days when you just end up pulling your hair out.
So I tried using wcstombs, but the setlocale call failed for "C.UTF-8" and any of its permutations. Unsurprisingly, the wcstombs call failed returning -1 after that.
Then I tried to write my own iconv implementation based on some sample codes searched on Google. I came up with this, which stubbornly refuses to work:
iconv_t cd = iconv_open("UTF-8", "WCHAR_T");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = (char*) inbuf;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
iconv always returns -1 and the errno is set to EINVAL. I've verified that <size-of-len> is set correctly. I've got no clue why this code's failing now.
Edit 2:
iconv was failing because I was not setting the input buffer length right. Also, Mac doesn't seem to support the "WCHAR_T" encoding so I've changed it to UTF-16. Now I've corrected the length and changed the from encoding but iconv just returns without converting any character. It just returns 0.
To debug this issue, I even changed the input string to a temp string and set the input length appropriately. Even this iconv call just returns 0. My code now looks like:
iconv_t cd = iconv_open("UTF-8", "UTF-16");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here - guaranteed to be UTF-16
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = "abc"; // (char*) inbuf;
inlen = 4;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
I've confirmed that the converter descriptor is being opened correctly. The from-encoding is correct. The input buffer contains a few simple characters. Everything is hardcoded and still, iconv doesn't convert any characters and just returns 0 and outbuf remains empty.
Sanity loss alert!
You'll need an UTF-8 string for popen. For this, you can use iconv to convert between different encodings, including from the local wchar_t encoding to UTF-8. (Note that on my Mac OS install, wchar_t is actually 32 bits, and not 16.)
EDIT Here's an example that works on OS X Lion. I did not have problems using the wchar_t encoding (and it is documented in the iconv man page).
#include <sys/param.h>
#include <string.h>
#include <iconv.h>
#include <stdio.h>
#include <errno.h>
char* utf8path(const wchar_t* wchar, size_t utf32_bytes)
{
char result_buffer[MAXPATHLEN];
iconv_t converter = iconv_open("UTF-8", "wchar_t");
char* result = result_buffer;
char* input = (char*)wchar;
size_t output_available_size = sizeof result_buffer;
size_t input_available_size = utf32_bytes;
size_t result_code = iconv(converter, &input, &input_available_size, &result, &output_available_size);
if (result_code == -1)
{
perror("iconv");
return NULL;
}
iconv_close(converter);
return strdup(result_buffer);
}
int main()
{
wchar_t hello_world[] = L"/éè/path/to/hello/world.txt";
char* utf8 = utf8path(hello_world, sizeof hello_world);
printf("%s\n", utf8);
free(utf8);
return 0;
}
The utf8_hello_world function accepts a wchar_t string with its byte length and returns the equivalent UTF-8 string. If you deal with pointers to wchar_t instead of an array of wchar_t, you'll want to use (wcslen(ptr) + 1) * sizeof(wchar_t) instead of sizeof.
Mac OS X uses UTF-8, so you need to convert the wide-character strings into UTF-8. You can do this using wcstombs, provided you first switch into a UTF-8 locale. For example:
// Do this once at program startup
setlocale(LC_ALL, "en_US.UTF-8");
...
// Error checking omitted for expository purposes
wchar_t *wideFilename = ...; // This comes from wherever
char filename[256]; // Make sure this buffer is big enough!
wcstombs(filename, wideFilename, sizeof(filename));
// Construct popen command using the UTF-8 filename
You can also use libiconv to do the UTF-16 to UTF-8 conversion for you if you don't want to change your program's locale setting; you could also roll your own implementation, as doing the conversion is not all that complicated.