OPOS BSTR* not converting properly - c++

So after a bunch of research I found that using WideCharToMultiByte worked great for sending data from the Control Object through OPOS to my custom SO. well we encountered a bug. In the DirectIO portion the C# Control Object's map is DirectIO(int command, ref int data, ref string object);
and for the longest we only needed to send simple commands through DirectIO. For instance to turn on the LED we would set the data to be the length in milleseconds, and the object to be the color. When we needed to write data to a tag or a card the text had to be parsed from a special XML styled string to a byte array... well now the need has come about that we need to have a byte array, use ASCII encoding to put that array into string form, and have it write..
The problem comes around that when I convert this string in my Service Object it doesn't convert it right. it seems to stop on null, even though SysStringLen knows the length is 4bytes. example The control object does this
int page = 16;
byte[] data = new byte[] { 0x19, 0x00, 0x30, 0x00 };
string pData = System.Text.ASCIIEncoding.ASCII.GetString(data);
msr.DirectIO(902, ref page, ref pData);
The SO sees this
int len = (int)SysStringLen(*pString);
long dataData = *pData;
char* dataObject = new char[1+len];
WideCharToMultiByte(CP_ACP, 0, *pString, -1, dataObject, len, NULL, NULL);
ByteUtil::LogArray("dataObject", (BYTE*)dataObject, len);
yields the output of
dataObject(4)-19:00:00:00
basically as soon as teh first null character is reached the rest of the data is lost. now if i convert the number from a string to a string it works out ok because I have a ByteUtil function just for that occasion... but it doesn't seem right for me to have to do that... why can't i just have it as a BYTE array?

Very easy, just change this line:
WideCharToMultiByte(CP_ACP, 0, *pString, -1, dataObject, len, NULL, NULL);
into:
WideCharToMultiByte(CP_ACP, 0, *pString, len, dataObject, len, NULL, NULL);
If you set the fourth parameter to -1, WideCharToMultiByte treats the input string as a null-terminated string. BSTRs are null-terminated for compatibility reasons, but you should never treat them as null-terminated because they can contain null characters as part of the string.

Related

Conversion from WCHAR to const unsigned char

I am a wix application packager. I am quite new to C++ and I am stuck with the below issue.
In my code, I am trying to convert wchar to const unsigned char. I have tried quite a few solutions that I got on the Internet, but I am unable to do it.
WCHAR szabc[888] = L"Example";
const unsigned char* pText = (const unsigned char*)szabc;
For your reference, the value of szabc is hard-coded, but ideally it is fetched as user input during installation of my code. szabc needs to be converted to const unsigned char as operator= doesn't seem to be working for conversion.
I am not getting any compilation error, but when I run this code, only the first character of szabc is being assigned to pText, I want the whole value of szabc to be assigned to pText.
As the value of pText is a user account password in a real time scenario, and it will be passed to a method which encrypts the value of the password.
Since you neglected to mention your OS, I am assuming it is Windows. You need WideCharToMultiByte or the standard wcstombs functions.
Note that both will determine the target encoding using system settings, so results will vary across computers. If possible, convert to UTF-8 or tell your users to stay away from special characters.
operator= cannot assign a value to a variable of an unrelated type. Which is why you cannot assign a WCHAR[] directly to an unsigned char*.
However, the real problem is with how the pointed data is being interpreted. You have a 16-bit Unicode string, and you are trying to pass it to a method that clearly wants a null-terminated 8-bit string instead.
On Windows, WCHAR is 2 bytes, and so the 2nd byte in your Unicode string is 0x00, eg:
WCHAR szabc[] = {L'E', L'x', L'a', L'm', L'p', L'l', L'e', L'\0'};
Has the same memory layout as this:
BYTE szabc[] = {'E', 0x00, 'x', 0x00, 'a', 0x00, 'm', 0x00, 'p', 0x00, 'l', 0x00, 'e', 0x00, '\0', 0x00};
This is why the method appears to see only 1 "character". It stops reading when it encounters the 1st 0x00 byte.
Thus, a simple pointer type-cast will not suffice. You will need to either:
use an 8-bit string to begin with, eg:
CHAR szabc[888] = "Example";
unsigned char* pText = (unsigned char*)szabc;
// use pText as needed...
convert the Unicode data at runtime, using WideCharToMultiByte() or equivalent, eg:
WCHAR szabc[888] = L"Example";
int len = WideCharToMultiByte(CP_ACP, 0, szabc, -1, NULL, 0, NULL, NULL);
CHAR szConverted = new char[len];
WideCharToMultiByte(CP_ACP, 0, szabc, -1, szConverted, len, NULL, NULL);
unsigned char* pText = (unsigned char*)szConverted;
// use pText as needed...
delete[] szConverted;

WideCharToMultiByte - required size and bytes written are different for Shift-JIS codepage

I've got a Unicode string containing four Japanese characters and I'm using WideCharToMultiByte to convert it to a multi-byte string specifying the Shift-JIS codepage of 932. In order to get the size of the required buffer I'm calling WideCharToMultiByte first with the cbMultiByte parameter set to 0. This is returning 9 as expected, but then when I actually call WideCharToMultiByte again to do the conversion it's returning the number of bytes written as 13. An example is below, I'm currently hard coding my buffer size to 100:
BSTR value = SysAllocString(L"日経先物");
char *buffer = new char[100];
int sizeRequired = WideCharToMultiByte(932, 0, value, -1, NULL, 0, NULL, NULL);
// sizeRequired is 9 as expected
int bytesWritten = WideCharToMultiByte(932, 0, value, sizeRequired, buffer, 100, NULL, NULL);
// bytesWritten is 13
buffer[8] contains the string terminator \0 as expected. buffer[9-12] contains byte 63.
So if I set the size of my buffer to be sizeRequired it's too small and the second call to WideCharToMultiByte fails. Does anyone know why an extra 4 bytes are written each with a byte value of 63?
You are passing the wrong arguments to WideCharToMultiByte in your second call (the required size of the destination as the length of the source). You need to change
int bytesWritten = WideCharToMultiByte(932, 0, value, sizeRequired, buffer, 100,
NULL, NULL);
to
int bytesWritten = WideCharToMultiByte(932, 0, value, -1, buffer, sizeRequired,
NULL, NULL);

Number of bytes of CString in C++

I have a Unicode string stored in CString and I need to know the number bytes this string takes in UTF-8 encoding. I know CString has a method getLength(), but that returns number of characters, not bytes.
I tried (beside other things) converting to char array, but I get (logically, I guess) only array of wchar_t, so this doesn't solve my problem.
To be clear about my goal. For the input lets say "aaa" I want "3" as output (since "a" takes one byte in UTF-8). But for the input "āaa", I'd like to see output "4" (since ā is two byte character).
I think this has to be quite common request, but even after 1,5 hours of search and experimenting, I couldn't find the correct solution.
I have very little experience with Windows programming, so maybe I left out some crucial information. If you feel like that, please let me know, I'll add any information you request.
As your CString contains a series of wchar_t, you can just use WideCharToMultiByte with the output charset as CP_UTF8. The function will return the number of bytes written to the output buffer, or the length of the UTF-8 encoded string
LPWSTR instr;
char outstr[MAX_OUTSTR_SIZE];
int utf8_len = WideCharToMultiByte(CP_UTF8, 0, instr, -1, outstr, MAX_OUTSTR_SIZE, NULL, NULL);
If you don't need the output string, you can simply set the output buffer size to 0
cbMultiByte
Size, in bytes, of the buffer indicated by lpMultiByteStr. If this parameter is set to 0, the function returns the required buffer size for lpMultiByteStr and makes no use of the output parameter itself.
In that case the function will return the number of bytes in UTF-8 without really outputting anything
int utf8_len = WideCharToMultiByte(CP_UTF8, 0, instr, -1, NULL, 0, NULL, NULL);
If your CString is really CStringA, i.e. _UNICODE is not defined, then you need to use Multi­Byte­To­Wide­Char to convert the string to UTF-16 and then convert from UTF-16 to UTF-8 with Wide­Char­To­Multi­byte. See How do I convert an ANSI string directly to UTF-8? But new code should never be compiled without Unicode support anyway

LPWSTR to wstring c++

I would like to read utf-8 test from a .dll string table.
something like this
LPWSTR nnW;
LoadStringW(hMod, id, nnW, MAX_PATH);
and after that I would like to convert the LPWSTR nnW to std::wstring nnWstring.
I tried in this way:
LPWSTR nnW;
LoadStringW(hMod, id, nnW, MAX_PATH);
const int length = MultiByteToWideChar(CP_UTF8,
0, // no flags required
(LPCSTR)nnW,
-1, // automatically determine length
NULL,
0);
std::wstring nnWstring(length, L'\0');
if (!MultiByteToWideChar(CP_UTF8,
0,
(LPCSTR)nnW,
-1,
&nnWstring[0],
length))
MessageBoxW(NULL, (LPCWSTR)nnWstring.c_str(), L"wstring", MB_OK | MB_ICONERROR);
After that in the MessageBoxW only shows the first letter.
No conversion or copying needed.
std::wstring nnWString(MAX_PATH, 0);
nnWString.resize(LoadStringW(hMod, id, &nnWString[0], nnWString.size());
Note: Your original code causes undefined behavior, because it writes using an uninitialized pointer. Surely not what you wanted.
Here's another variation:
http://msmvps.com/blogs/gdicanio/archive/2010/01/05/stl-strings-loading-from-resources.aspx
I would like to read utf-8 test from a .dll string table. something like this
Generally, string tables in Windows are UTF-16. You're trying to put UTF-8 data into one. The UTF-8 data is being treated like "extended" ASCII, so each byte is being expanded to two bytes with zero bytes between them.
You should probably put UTF-16 data in the string table directly.
If you must store UTF-8 data in the resources, you can put it into an RCDATA resource and use the lower-level resource functions to get the data out. Then you'll have to convert from UTF-8 to UTF-16 to store it in a wstring.

c++ string to utf8 valid string using utf8proc

I have a std::string output. Using utf8proc i would like to transform it into an valid utf8 string.
http://www.public-software-group.org/utf8proc-documentation
typedef int int32_t;
#define ssize_t int
ssize_t utf8proc_reencode(int32_t *buffer, ssize_t length, int options)
Reencodes the sequence of unicode characters given by the pointer buffer and length as UTF-8. The result is stored in the same memory area where the data is read. Following flags in the options field are regarded: (Documentation missing here) In case of success the length of the resulting UTF-8 string is returned, otherwise a negative error code is returned.
WARNING: The amount of free space being pointed to by buffer, has to exceed the amount of the input data by one byte, and the entries of the array pointed to by str have to be in the range of 0x0000 to 0x10FFFF, otherwise the program might crash!
So first, how do I add an extra byte at the end? Then how do I convert from std::string to int32_t *buffer?
This does not work:
std::string g = output();
fprintf(stdout,"str: %s\n",g.c_str());
g += " "; //add an extra byte??
g = utf8proc_reencode((int*)g.c_str(), g.size()-1, 0);
fprintf(stdout,"strutf8: %s\n",g.c_str());
You very likely don't actually want utf8proc_reencode() - that function takes a valid UTF-32 buffer and turns it into a valid UTF-8 buffer, but since you say you don't know what encoding your data is in then you can't use that function.
So, first you need to figure out what encoding your data is actually in. You can use http://utfcpp.sourceforge.net/ to test whether you already have valid UTF-8 with utf8::is_valid(g.begin(), g.end()). If that's true, you're done!
If false, things get complicated...but ICU ( http://icu-project.org/ ) can help you; see http://userguide.icu-project.org/conversion/detection
Once you somewhat reliably know what encoding your data is in, ICU can help again with getting it to UTF-8. For example, assuming your source data g is in ISO-8859-1:
UErrorCode err = U_ZERO_ERROR; // check this after every call...
// CONVERT FROM ISO-8859-1 TO UChar
UConverter *conv_from = ucnv_open("ISO-8859-1", &err);
std::vector<UChar> converted(g.size()*2); // *2 is usually more than enough
int32_t conv_len = ucnv_toUChars(conv_from, &converted[0], converted.size(), g.c_str(), g.size(), &err);
converted.resize(conv_len);
ucnv_close(conv_from);
// CONVERT FROM UChar TO UTF-8
g.resize(converted.size()*4);
UConverter *conv_u8 = ucnv_open("UTF-8", &err);
int32_t u8_len = ucnv_fromUChars(conv_u8, &g[0], g.size(), &converted[0], converted.size(), &err);
g.resize(u8_len);
ucnv_close(conv_u8);
after which your g is now holding UTF-8 data.