Converting contents of a byte array to wchar_t* - c++

I seem to be having an issue converting a byte array (containing the text from a word document) to a LPTSTR (wchar_t *) object. Every time the code executes, I am getting a bunch of unwanted Unicode characters returned.
I figure it is because I am not making the proper calls somewhere, or not using the variables properly, but not quite sure how to approach this. Hopefully someone here can guide me in the right direction.
The first thing that happens in we call into C# code to open up Microsoft Word and convert the text in the document into a byte array.
byte document __gc[];
document = word->ConvertToArray(filename);
The contents of document are as follows:
{84, 101, 115, 116, 32, 68, 111, 99, 117, 109, 101, 110, 116, 13, 10}
Which ends up being the following string: "Test Document".
Our next step is to allocate the memory to store the byte array into a LPTSTR variable,
byte __pin * value;
value = &document[0];
LPTSTR image;
image = (LPTSTR)malloc( document->Length + 1 );
Once we execute the line where we start allocating the memory, our image variable gets filled with a bunch of unwanted Unicode characters:
췍췍췍췍췍췍췍췍﷽﷽����˿於潁
And then we do a memcpy to transfer over all of the data
memcpy(image,value,document->Length);
Which just causes more unwanted Unicode characters to appear:
敔瑳䐠捯浵湥൴촊﷽﷽����˿於潁
I figure the issue that we are having is either related to how we are storing the values in the byte array, or possibly when we are copying the data from the byte array to the LPTSTR variable. Any help with explaining what I'm doing wrong, or anything to point me in the right direction will be greatly appreciated.

First you should learn something about text data and how it's represented. A reference that will get you started there is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
byte is just a typedef or something for char or unsigned char. So the byte array is using some char encoding for the string. You need to actually convert from that encoding, whatever it is, into UTF-16 for Windows' wchar_t. Here's the typical method recommended for doing such conversions on Windows:
int output_size = MultiByteToWideChar(CP_ACP,0,value,-1,NULL,0);
assert(0<output_size);
wchar_t *converted_buf = new wchar_t[output_size];
int size = MultiByteToWideChar(CP_ACP,0,value,-1,converted_buf,output_size);
assert(output_size==size);
We call the function MultiByteToWideChar() twice, once to figure out how large of a buffer is needed to hold the result of the conversion, and a second time, passing in the buffer we allocated, to do the actual conversion.
CP_ACP specifies the source encoding, and you'll need to check the API documentation to figure out what that value really should be. CP_ACP stands for 'codepage: Ansi codepage', which is Microsoft's way of saying 'the encoding set for "non-Unicode" programs.' The API may be using something else, like CP_UTF8 (we can hope) or 1252 or something.
You can view the rest of the documentation on MultiByteToWideChar here to figure out the other arguments.
Once we execute the line where we start allocating the memory, our image variable gets filled with a bunch of unwanted Unicode characters:
When you call malloc() the memory given to you is uninitialized and just contains garbage. The values you see before initializing it don't matter and you simply shouldn't use that data. The only data that matters is what you fill the buffer with. The MultiByteToWideChar() code above will also automatically null terminate the string so you won't see garbage in unused buffer space (and the method we use of allocating the buffer will not leave any extra space).
The above code is not actually very good C++ style. It's just typical usage of the C-style API provided by Win32. The way I prefer to do conversions (if I'm forced to) is more like:
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert; // converter object saved somewhere
std::wstring output = convert.from_bytes(value);
(Assuming the char encoding being used is UTF-8. You'll have to use a different codecvt facet for any other encoding.)

Related

Have a PCWSTR and need it to be a WCHAR[]

I am re-writing a C++ method from some code I downloaded. The method originally took a PCWSTR as a parameter and then prompted the user to enter a file name. I modified the method to take two parameters (both PCWSTR) and not to prompt the user. I am already generating the list of files somewhere else. I am attempting to call my new (modified) method with both parameters from my method that iterates the list of files.
The original method prompted the user for input using a StringCBGetsW command. Like this...
HRESULT tst=S_OK; //these are at the top of the method
WCHAR fname[85] = {0}; //these are at the top of the method
tst = StringCbGetsW(fname,sizeof(fname));
The wchar fname gets passed to another iteration method further down. When I look at that method, it says it's a LPCWSTR type; I'm assuming it can take the WCHAR instead.
But what it can't do is take the PCWSTR that the method got handed. My ultimate goal is to try not prompt the user for the file name and to take instead the filename that was iterated earlier in another method.
tl;dr. I have a PCWSTR and it needs to get converted to a WCHAR. I don't know what a WCHAR [] is or how to do anything with it. Including to try to do a printf to see what it is.
PS...I know there are easier ways to move and copy around files, there is a reason I'm attempting to make this work using a program.
First, let's try to make some clarity on some Windows specific types.
WCHAR is a typedef for wchar_t.
On Windows with Microsoft Visual C++, it's a 16-bit character type (that can be used for Unicode UTF-16 strings).
PCWSTR and LPCWSTR are two different names for the same thing: they are basically typedefs for const wchar_t*.
The initial L in LPCWSTR is some legacy prefix that, read with the following P, stands for "long pointer". I've never programmed Windows in the 16-bit era (I started with Windows 95 and Win32), but my understanding is that in 16-bit Windows there were something like near pointers and far, or long pointers. Now we have just one type of pointers, so the L prefix can be omitted.
The P stands for "pointer".
The C stands for "constant".
The W stands for WCHAR/wchar_t, and last but not least, the STR part stands for "string".
So, decoding this kind of "Hungarian Notation", PCWSTR means const wchar_t*.
Basically, it's a pointer to a read-only NUL-terminated wchar_t Unicode UTF-16 string.
Is this information enough for you to solve your problem?
If you have a wchar_t string buffer, and a function that expects a PCWSTR, you can just pass the name of the buffer (corresponding the the address of its first character) to the function:
WCHAR buffer[100];
DoSomething(buffer, ...); // DoSomething(PCWSTR ....)
Sometimes - typically for output string parameters - you may also want to specify the size (i.e. "capacity") of the destination string buffer.
If this size is expressed using a count in characters (in this case, in wchar_ts), the the usual Win32 Hungarian Notation is cch ("count of characters"); else, if you want the size expressed in bytes, then the usual prefix is cb ("count of bytes").
So, if you have a function like StringCchCopy(), then from the Cch part you know the size is expressed in characters (wchar_ts).
Note that you can use _countof() to get the size of a buffer in wchar_ts.
e.g. in the above code snippet, _countof(buffer) == 100, since buffer is made by 100 wchar_ts; instead, sizeof(buffer) == 200, since each wchar_t is 2 bytes == 16 bits in size, so the total buffer size in bytes is 100 [wchar_t] * 2 [bytes/wchar_t] = 200 [bytes].

How to read a utf-8 string from an xml using rapidxml?

My question is same as this unanswered question ?
How to read Unicode XML values with rapidxml
But the content of my XML is encoded in UTF-8. I am a newbie to MS Visual Studio, C++.
My question is, How do we read an UTF-8 string into a wchar_t type string ?
Say, I define a structure like this,
typedef struct{
vector<int> stroke_labels;
int stroke_count;
wchar_t* uni_val;
}WORD_DETAIL;
and when I read the value from xml i use..
WORD_DETAIL this_detail;
this_detail.uni_val=curr_word->first_node("labelDesc")->first_node("annotationDetails")->first_node("codeSequence")->value();
But the utf-8 strings that are being stored are not as expected. They are corrupted characters.
My questions are:
How can I use rapidxml to read Unicode/Utf-8 values ?
Are there any more simple xml parsers that do the same thing ?
Any example code will be deeply appreciated.
In section 2.1 here it is mentioned
"Note that RapidXml performs no decoding - strings returned by name() and value() functions will contain text encoded using the same encoding as source file."
If the encoding of my XML is UTF-8 , what is the best way to get the return value of ->value() function ?
Thanks in advance.
Remember that RapidXML is an 'in-situ' parser: It parses the XML and modifies the content by adding null terminators in the correct places (and other things).
So the value() function is really just returning a char * pointer into your original data. If that's UTF-8, then RapidXML returns a pointer to a UTF-8 character string. In other words, you're already doing what you asked for in the question title.
But, in the code snippet you posted you want to store a wchar_t in a struct. First off, I recommend you don't do that at all, because of the memory ownership issues. Remember, you're meant to be using C++, not C. And if you really want to store a raw pointer, why not the UTF-8 one you already have? http://www.utf8everywhere.org/
But, because it's windows there's a (remote) chance you'll need to pass a wide char array to an API function. If so, you will need to convert UTF-8 to Wide chars, using the OS function MultiByteToWideChar
// Get the UTF-8
char *str = xml->first_node("codeSequence")->value();
// work out the size
int size = MultiByteToWideChar(CP_UTF8, 0, str, -1, NULL, 0);
// allocate a vector for that size
std::vector<wchar_t> wide(size);
// do the conversion
MultiByteToWideChar(CP_UTF8, 0, str, -1, &wide[0], size);

Converting jbyteArray to a character array, and then printing to console

I am writing a JNI program where my .cpp file gets a jbyteArray and I want to be able to print the jbyteArray with printf. For that to happen, I believe I have to convert the jbyteArray to a character array.
For background knowledge, the java side of my JNI converts a String to a byteArray, and then that byteArray is passed in as an argument to my JNI function.
What I've done so far prints out the String correctly, but it is followed by junk characters, and I do not know how to get rid of these/if I am doing something wrong.
Here is what the String is:
dsa
and what prints to console:
dsa,�
The junk characters change depending on what the String is.
Here is the part of the code that is relevant:
.java file:
public class tcr extends javax.swing.JFrame{
static{
System.loadLibrary("tcr");
}
public native int print(byte file1[]);
.....
String filex1 = data1TextField.getText();//gets a filepath in the form of a String from a GUI jtextfield.
byte file1[]= filex1.getBytes();//convert file path from string to byte array
tcr t = new tcr();
t.print(file1);
}
.cpp code:
JNIEXPORT jint JNICALL Java_tcr_print(JNIIEnv *env, jobject thisobj, jbyteArray file1){
jboolean isCopy;
jbyte* a = env->GetByteArrayElements(file1,&isCopy);
char* b;
b = (char*)a;
printf("%s\n",b);
}
Any help would be appreciated.
Look what you are doing:
jbyte* a = env->GetByteArrayElements(file1,&isCopy);
a now points to a memory address where the byte contents of the string are stored. Let's assume that the file contains the string "Hello world". In UTF-8 encoding, that would be:
48 65 6c 6c 6f 20 77 6f 72 6c 64
char* b = (char*)a;
b now points to that memory region. It's a char pointer, so you probably want to use it as a C string. However, that won't work. C strings are defined as some bytes, ending with a zero byte. Now look up there and you'll see that there is no zero byte at the end of this string.
printf("%s\n",b);
Here it is. You are passing the char pointer to printf as %s which tells printf that it's a C string. However, it isn't a C string but printf still tries to print all characters until it reaches a zero byte. So what you see after dsa are actually bytes from your memory after the end of the byte array until there is (by coincidence) a zero byte. You can fix this by copying the bytes to a buffer that is one byte longer than the byte array and then setting the last element to zero.
UPDATE:
You can create the bigger buffer and append the null byte like this:
int textLength = strlen((const char*)a);
char* b = malloc(textLength + 1);
memcpy(b, a, textLength);
b[textLength] = '\0';
Now b is a valid null-terminated C string. Also, don't forget the call to ReleaseByteArrayElements. You can do that right after the memcpy call.
A jbyteArray is actually a very good way to pass a Java String through JNI. It allows you to easily convert the string into the character set and encoding needed by the libraries and files/devices you are using on the C++ side.
Be sure you understand "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
Java String uses the Unicode character set and UTF-16 encoding (with a platform-dependent byte order).
String.getBytes() converts to the "platform's default charset". So, it is making an assumption about the character set and encoding you need, and what to do about characters that are not in the target character set. You can use other Java String.getBytes overloads or the Charset methods if you want to control these things explicitly.
In deciding which character set and encoding to use, consider that Unicode has been used for a couple decades as the primary string type in Java, .NET, VB, ...; in compiler source files for Java, ...; generally in the WWW. Of course, you might be limited by the things you want to interoperate with.
Now, it seems the problem you are facing is either that the target character set is missing characters that your Java String has and a substitute is being used, or the console you are using isn't displaying them properly.
The console (or any app with a UI), obviously, has to pick a typeface with which to render the characters. Typefaces generally don't support the million codepoints available in Unicode. You may be able to change the configuration of your console (or use another). For example, in Windows, you can use cmd.exe or ps (Windows PowerShell). You can change the font in Cmd.exe windows and use chcp to change the character set.
UPDATE:
As #main-- points out, if you use a function that expects a terminator appended to the string then you have to provide it, usually by copying the array since the JVM retains ownership of the array. This the actual cause of the behavior in this case. But, all of the above is relevant, too.

wchar_t* to char* conversion problems

I have a problem with wchar_t* to char* conversion.
I'm getting a wchar_t* string from the FILE_NOTIFY_INFORMATION structure, returned by the ReadDirectoryChangesW WinAPI function, so I assume that string is correct.
Assume that wchar string is "New Text File.txt"
In Visual Studio debugger when hovering on variable in shows "N" and some unknown Chinese letters. Though in watches string is represented correctly.
When I try to convert wchar to char with wcstombs
wcstombs(pfileName, pwfileName, fileInfo.FileNameLength);
it converts just two letters to char* ("Ne") and then generates an error.
Some internal error in wcstombs.c at function _wcstombs_l_helper() at this block:
if (*pwcs > 255) /* validate high byte */
{
errno = EILSEQ;
return (size_t)-1; /* error */
}
It's not thrown up as exception.
What can be the problem?
In order to do what you're trying to do The Right Way, there are several nontrivial things that you need to take into account. I'll do my best to break them down for you here.
Let's start with the definition of the count parameter from the wcstombs() function's documentation on MSDN:
The maximum number of bytes that can be stored in the multibyte output string.
Note that this does NOT say anything about the number of wide characters in the wide character input string. Even though all of the wide characters in your example input string ("New Text File.txt") can be represented as single-byte ASCII characters, we cannot assume that each wide character in the input string will generate exactly one byte in the output string for every possible input string (if this statement confuses you, you should check out Joel's article on Unicode and character sets). So, if you pass wcstombs() the size of the output buffer, how does it know how long the input string is? The documentation states that the input string is expected to be null-terminated, as per the standard C language convention:
If wcstombs encounters the wide-character null character (L'\0') either before or when count occurs, it converts it to an 8-bit 0 and stops.
Though this isn't explicitly stated in the documentation, we can infer that if the input string isn't null-terminated, wcstombs() will keep reading wide characters until it has written count bytes to the output string. So if you're dealing with a wide character string that isn't null-terminated, it isn't enough to just know how long the input string is; you would have to somehow know exactly how many bytes the output string would need to be (which is impossible to determine without doing the conversion) and pass that as the count parameter to make wcstombs() do what you want it to do.
Why am I focusing so much on this null-termination issue? Because the FILE_NOTIFY_INFORMATION structure's documentation on MSDN has this to say about its FileName field:
A variable-length field that contains the file name relative to the directory handle. The file name is in the Unicode character format and is not null-terminated.
The fact that the FileName field isn't null-terminated explains why it has a bunch of "unknown Chinese letters" at the end of it when you look at it in the debugger. The FILE_NOTIFY_INFORMATION structure's documentation also contains another nugget of wisdom regarding the FileNameLength field:
The size of the file name portion of the record, in bytes.
Note that this says bytes, not characters. Therefore, even if you wanted to assume that each wide character in the input string will generate exactly one byte in the output string, you shouldn't be passing fileInfo.FileNameLength for count; you should be passing fileInfo.FileNameLength / sizeof(WCHAR) (or use a null-terminated input string, of course). Putting all of this information together, we can finally understand why your original call to wcstombs() was failing: it was reading past the end of the string and choking on invalid data (thereby triggering the EILSEQ error).
Now that we've elucidated the problem, it's time to talk about a possible solution. In order to do this The Right Way, the first thing you need to know is how big your output buffer needs to be. Luckily, there is one final tidbit in the documentation for wcstombs() that will help us out here:
If the mbstr argument is NULL, wcstombs returns the required size in bytes of the destination string.
So the idiomatic way to use the wcstombs() function is to call it twice: the first time to determine how big your output buffer needs to be, and the second time to actually do the conversion. The final thing to note is that as we stated previously, the wide character input string needs to be null-terminated for at least the first call to wcstombs().
Putting this all together, here is a snippet of code that does what you are trying to do:
size_t fileNameLengthInWChars = fileInfo.FileNameLength / sizeof(WCHAR); //get the length of the filename in characters
WCHAR *pwNullTerminatedFileName = new WCHAR[fileNameLengthInWChars + 1]; //allocate an intermediate buffer to hold a null-terminated version of fileInfo.FileName; +1 for null terminator
wcsncpy(pwNullTerminatedFileName, fileInfo.FileName, fileNameLengthInWChars); //copy the filename into a the intermediate buffer
pwNullTerminatedFileName[fileNameLengthInWChars] = L'\0'; //null terminate the new buffer
size_t fileNameLengthInChars = wcstombs(NULL, pwNullTerminatedFileName, 0); //first call to wcstombs() determines how long the output buffer needs to be
char *pFileName = new char[fileNameLengthInChars + 1]; //allocate the final output buffer; +1 to leave room for null terminator
wcstombs(pFileName, pwNullTerminatedFileName, fileNameLengthInChars + 1); //finally do the conversion!
Of course, don't forget to call delete[] pwNullTerminatedFileName and delete[] pFileName when you're done with them to clean up.
ONE LAST THING
After writing this answer, I reread your question a bit more closely and thought of another mistake you may be making. You say that wcstombs() fails after just converting the first two letters ("Ne"), which means that it's hitting uninitialized data in the input string after the first two wide characters. Did you happen to use the assignment operator to copy one FILE_NOTIFY_INFORMATION variable to another? For example,
FILE_NOTIFY_INFORMATION fileInfo = someOtherFileInfo;
If you did this, it would only copy the first two wide characters of someOtherFileInfo.FileName to fileInfo.FileName. In order to understand why this is the case, consider the declaration of the FILE_NOTIFY_INFORMATION structure:
typedef struct _FILE_NOTIFY_INFORMATION {
DWORD NextEntryOffset;
DWORD Action;
DWORD FileNameLength;
WCHAR FileName[1];
} FILE_NOTIFY_INFORMATION, *PFILE_NOTIFY_INFORMATION;
When the compiler generates code for the assignment operation, it does't understand the trickery that is being pulled with FileName being a variable length field, so it just copies sizeof(FILE_NOTIFY_INFORMATION) bytes from someOtherFileInfo to fileInfo. Since FileName is declared as an array of one WCHAR, you would think that only one character would be copied, but the compiler pads the struct to be an extra two bytes long (so that its length is an integer multiple of the size of an int), which is why a second WCHAR is copied as well.
My guess is that the wide string that you are passing is invalid or incorrectly defined.
How is pwFileName defined? It seems you have a FILE_NOTIFY_INFORMATION structure defined as fileInfo, so why are you not using fileInfo.FileName, as shown below?
wcstombs(pfileName, fileInfo.FileName, fileInfo.FileNameLength);
the error you get says it all, it found a character that it cannot convert to MB (cause it has no representation in MB), source:
If wcstombs encounters a wide character it cannot convert to a
multibyte character, it returns –1 cast to type size_t and sets errno
to EILSEQ
In cases like this you should avoid 'assumed' input, and give an actual test case that fails.

MS VC++ Convert a byte array to a BSTR?

I have a string that starts out in a .Net application, is encrypted and stored in AD. It's then picked up by a native C++ app and decrypted to produce an array of bytes
e.g "ABCDEF" becomes 00,41,00,42,00,43,00,44,00,45 once it has been decrypted at the C++ end.
I need to take this byte array and convert it to the BSTR "ABCDEF" so that I can use it elsewhere and I can't find a way to acomplish this last step.
Can anybody help?
If you really have an array of arbitrary bytes, use SysAllocStringByteLen. But it looks like, despite being in a byte array, your data is really a UTF-16-encoded Unicode string, so in that case, you're probably better off using SysAllocStringLen instead. Pass the byte-array pointer to the function (type-cast to OLECHAR*), and the characters will be copied into the new string for you, along with an additional null character at the end.
The "decrypted string" is just a Unicode string - latin characters contain the first byte equal to null when represented in Unicode. So you don't need any real conversion, just cook a BSTR out of that buffer.
Knowing the number of Unicode characters - it will be half the length of the buffer - call SysAllocStringLen() to allocate a long enough null-terminated uninitialized BSTR. Then copy your array onto the allocated string with memcpy(). Alternatively you can call SysAllocStringLen() and pass it the byte buffer so that it does the copy for you and skip the memcpy(). Don't forget to call SysFreeString() when you no longer need the BSTR.