MS VC++ Convert a byte array to a BSTR? - c++

I have a string that starts out in a .Net application, is encrypted and stored in AD. It's then picked up by a native C++ app and decrypted to produce an array of bytes
e.g "ABCDEF" becomes 00,41,00,42,00,43,00,44,00,45 once it has been decrypted at the C++ end.
I need to take this byte array and convert it to the BSTR "ABCDEF" so that I can use it elsewhere and I can't find a way to acomplish this last step.
Can anybody help?

If you really have an array of arbitrary bytes, use SysAllocStringByteLen. But it looks like, despite being in a byte array, your data is really a UTF-16-encoded Unicode string, so in that case, you're probably better off using SysAllocStringLen instead. Pass the byte-array pointer to the function (type-cast to OLECHAR*), and the characters will be copied into the new string for you, along with an additional null character at the end.

The "decrypted string" is just a Unicode string - latin characters contain the first byte equal to null when represented in Unicode. So you don't need any real conversion, just cook a BSTR out of that buffer.
Knowing the number of Unicode characters - it will be half the length of the buffer - call SysAllocStringLen() to allocate a long enough null-terminated uninitialized BSTR. Then copy your array onto the allocated string with memcpy(). Alternatively you can call SysAllocStringLen() and pass it the byte buffer so that it does the copy for you and skip the memcpy(). Don't forget to call SysFreeString() when you no longer need the BSTR.

Related

Erase unwanted characters

We have created a char array with a fixed length. Now, we write a word or a sentence inside that array. However, the length of this word-sentence is shorter than the length of the char array, so when we print the message with printf function, a number of crap characters are also printed. We would like to erase all this characters, even if the length of the message written is variable.
Thank you!
C strings are terminated by a NUL byte ('\0'). If you don't have this terminator then printf doesn't know that your string has ended. The solution is to put a \0 after your word in the array.
Note: learn to use std::string which manages this for you.
Have you considered not using a fixed array size if that is not what you want? You could just use a char* instead and assign it's size dynamically.
If you want a fixed size for some reason the only solution I can come up with is to track in a separate variable how long the word is and then only print the n first characters. There is no way to determine which chars are valid and which are not in the array as far as I know.
Other than that, if your buffer is supposed to be just a byte array and not a (null terminated) string, then you can use fwrite instead of fprintf to dump the contents.
But in general, I agree with others that it could be better to use std::string.

What are the problems of a zero-terminated string that length-prefixed strings overcome?

What are the problems of a zero-terminated string that length-prefixed strings overcome?
I was reading the book Write Great Code vol. 1 and I had that question in mind.
One problem is that with zero-terminated strings you have to keep finding the end of the string repeatedly. The classic example where this is inefficient is concatenating into a buffer:
char buf[1024] = "first";
strcat(buf, "second");
strcat(buf, "third");
strcat(buf, "fourth");
On every call to strcat the program has to start from the beginning of the string and find the terminator to know where to start appending. This means the function spends more and more time finding the place to append as the string grows longer.
With a length-prefixed string the equivalent of the strcat function would know where the end is immediately, and would just update the length after appending to it.
There are pros and cons to each way of representing strings and whether they cause problems for you depend on what you are doing with strings, and which operations need to be efficient. The problem described above can be overcome by manually keeping track of the end of the string as it grows, so by changing the code you can avoid the performance cost.
One problem is that you can not store null characters (value zero) in a zero terminated string. This makes it impossible to store some character encodings as well as encrypted data.
Length-prefixed strings do not suffer that limitation.
First a clarification: C++ strings (i.e. std::string) aren't weren't required to end with zero until C++11. They always provided access to a zero-terminated C string though.
C-style strings end with a 0 character for historical reasons.
The problems you're referring to are mainly bound to security issues: zero ended strings need to have a zero terminator. If they lack it (for whatever reason), the string's length becomes unreliable and they can lead to buffer overrun problems (which a malicious attacker can exploit by writing arbitrary data in places where it shouldn't be.. DEP helps in mitigating these issues but it's off-topic here).
It is best summarized in The Most Expensive One-byte Mistake by Poul-Henning Kamp.
Performance Costs: It is cheaper to manipulate memory in chunks, which cannot be done if you're always having to look for the NULL character. In other words if you know before hand you have a 129 character string, it would likely be more efficient to manipulate it in sections of 64, 64, and 1 bytes, instead of character by character.
Security: Marco A. already hit this pretty hard. Over and under-running string buffers is still a major route for attacks by hackers.
Compiler Development Costs: Big costs are associated with optimizing compilers for null terminating strings that would have been easier with the address and length format.
Hardware Development Costs: Hardware development costs are also large for string specific instructions associated with null terminating strings.
A few more bonus features that can be implemented with length-prefixed strings:
It's possible to have multiple styles of length prefix, identifiable through one or more bits of the first byte identified by the string pointer/reference. In exchange for a little extra time determining string length, one could e.g. use a single-byte prefix for short strings and longer prefixes for longer strings. If one uses a lot of 1-3 byte strings that could save more than 50% on overall memory consumption for such strings compared with using a fixed four-byte prefix; such a format could also accommodate strings whose length exceeded the range of 32-bit integers.
One may store variable-length strings within bounds-checked buffers at a cost of only one or two bits in the length prefix. The number N combined with the other bits would indicate one of three things:
An N-byte string
(Optional) An N-byte buffer holding a zero-length string
An N-byte buffer which, if its last byte B is less than 248, holds a string of length N-B-1; if the 248 or more, the preceding B-247 bytes would store the difference between the buffer size and the string length. Note that if the length of the string is precisely N-1, the string will be followed by a NUL byte, and if it's less than that the byte following the string will be unused and could be set to NUL.
Using such an approach, one would need to initialize strong buffers before use (to indicate their length), but would then no longer need to pass the length of a string buffer to a routine that was going to store data there.
One may use certain prefix values to indicate various special things. For example, one may have a prefix that indicates that it is not followed by a string, but rather by a string-data pointer and two integers giving buffer size and current length. If methods that operate on strings call a method to get the data pointer, buffer size, and length, one may pass such a method a reference to a portion of a string cheaply provided that the string itself will outlive the method call.
One may extend the above feature with a bit to indicate that the string data is in a region that was generated by malloc and may be resized if needed; additionally, one could safely have methods that sometimes return a dynamically-generated string allocated on the heap, and sometimes return an immutable static string, and have the recipient perform a "free this string if it isn't static".
I don't know if any prefixed-string implementations implement all those bonus features, but they can all be accommodated for very little cost in storage space, relatively little cost in code, and less cost in time than would be required to use NUL-terminated strings whose length was neither known nor short.
What are the problems of a zero-terminated string that length-prefixed strings overcome?
None whatsoever.
It's just eye candy.
Length-prefixed strings have, as part of their structure, information on how long the string is. If you want to do the same with zero-terminated strings you can use a helper variable;
lpstring = "foobar"; // saves '6' somewhere "inside" lpstring
ztstring = "foobar";
ztlength = 6; // saves '6' in a helper variable
Lots of C library functions work with zero-terminated strings and cannot use anything past the '\0' byte. That's an issue with the functions themselves, not the string structure. If you need functions which deal with zero-terminated strings with embedded zeroes, write your own.

how to make a not null-terminated c string?

i am wondering :char *cs = .....;what will happen to strlen() and printf("%s",cs) if cs point to memory block which is huge but with no '\0' in it?
i write these lines:
char s2[3] = {'a','a','a'};
printf("str is %s,length is %d",s2,strlen(s2));
i get the result :"aaa","3",but i think this result is because that a '\0'(or a 0 byte) happens to reside in the location s2+3.
how to make a not null-terminated c string? strlen and other c string function relies heavily on the '\0' byte,what if there is no '\0',i just want know this rule deeper and better.
ps: my curiosity is aroused by studying the follw post on SO.
How to convert a const char * to std::string
and these word in that post :
"This is actually trickier than it looks, because you can't call strlen unless the string is actually nul terminated."
If it's not null-terminated, then it's not a C string, and you can't use functions like strlen - they will march off the end of the array, causing undefined behaviour. You'll need to keep track of the length some other way.
You can still print a non-terminated character array with printf, as long as you give the length:
printf("str is %.3s",s2);
printf("str is %.*s",s2_length,s2);
or, if you have access to the array itself, not a pointer:
printf("str is %.*s", (int)(sizeof s2), s2);
You've also tagged the question C++: in that language, you usually want to avoid all this error-prone malarkey and use std::string instead.
A "C string" is, by definition, null-terminated. The name comes from the C convention of having null-terminated strings. If you want something else, it's not a C string.
So if you have a string that is not null-terminated, you cannot use the C string manipulation routines on it. You can't use strlen, strcpy or strcat. Basically, any function that takes a char* but no separate length is not usable.
Then what can you do? If you have a string that is not null-terminated, you will have the length separately. (If you don't, you're screwed. You need some way to find the length, either by a terminator or by storing it separately.) What you can do is allocate a buffer of the appropriate size, copy the string over, and append a null. Or you can write your own set of string manipulation functions that work with pointer and length. In C++ you can use std::string's constructor that takes a char* and a length; that one doesn't need the terminator.
Your supposition is correct: your strlen is returning the correct value out of sheer luck, because there happens to be a zero on the stack right after your improperly terminated string. It probably helps that the string is 3 bytes, and the compiler is likely aligning stuff on the stack to 4-byte boundaries.
You cannot depend on this. C strings need NUL characters (zeroes) at the end to work correctly. C string handling is messy, and error-prone; there are libraries and APIs that help make it less so… but it's still easy to screw up. :)
In this particular case, your string could be initialized as one of these:
A: char s2[4] = { 'a','a','a', 0 }; // good if string MUST be 3 chars long
B: char *s2 = "aaa"; // if you don't need to modify the string after creation
C: char s2[]="aaa"; // if you DO need to modify the string afterwards
Also note that declarations B and C are 'safer' in the sense that if someone comes along later and changes the string declaration in a way that alters the length, B and C are still correct automatically, whereas A depends on the programmer remembering to change the array size and keeping the explicit null terminator at the end.
What happens is that strlen keeps going, reading memory values until it eventually gets to a null. it then assumes that is the terminator and returns the length that could be massively large. If you're using strlen in an environment that expects C-strings to be used, you could then copy this huge buffer of data into another one that is just not big enough - causing buffer overrun problems, or at best, you could copy a large amount of garbage data into your buffer.
Copying a non-null terminated C string into a std:string will do this. If you then decide that you know this string is only 3 characters long and discard the rest, you will still have a massively long std:string that contains the first 3 good characters and then a load of wastage. That's inefficient.
The moral is, if you're using the CRT functions to operator on C strings, they must be null-terminated. Its no different to any other API, you must follow the rules that API sets down for correct usage.
Of course, there is no reason you cannot use the CRT functions if you always use the specific-length versions (eg strncpy) but you will have to limit yourself to just those, always, and manually keep track of the correct lengths.
Convention states that a char array with a terminating \0 is a null terminated string. This means that all str*() functions expect to find a null-terminator at the end of the char-array. But that's it, it's convention only.
By convention also strings should contain printable characters.
If you create an array like you did char arr[3] = {'a', 'a', 'a'}; you have created a char array. Since it is not terminated by a \0 it is not called a string in C, although its contents can be printed to stdout.
The C standard does not define the term string until the section 7 - Library functions. The definition in C11 7.1.1p1 reads:
A string is a contiguous sequence of characters terminated by and including the first null character.
(emphasis mine)
If the definition of string is a sequence of characters terminated by a null character, a sequence of non-null characters not terminated by a null is not a string, period.
What you have done is undefined behavior.
You are trying to write to a memory location that is not yours.
Change it to
char s2[] = {'a','a','a','\0'};

What is the internal structure of an object of the (EDIT: MFC) CString class?

I need to strncpy() (effectively) from a (Edit: MFC) CString object to a C string variable. It's well known that strncpy() sometimes fails (depending on the source length **EDIT and the length specified in the call) to terminate the dest C string correctly. To avoid that evil, I'm thinking to store a NUL char inside the CString source object and then to strcpy() or memmove() that guy.
Is this a reasonable way to go about it? If so, what must I manipulate inside the CString object? If not, then what's an alternative that will guarantee a properly-terminated destination C string?
strncpy() only "fails" to null-terminate the destination string when the source string is longer than the length limit you specify. You can ensure that the destination is null-terminated by setting its last character to null yourself. For example:
#define DEST_STR_LEN 10
char dest_str[DEST_STR_LEN + 1]; // +1 for the null
strncpy(dest_str, src_str, DEST_STR_LEN);
dest_str[DEST_STR_LEN] = '\0';
If src_str is more than DEST_STR_LEN characters long, dest_str will be a properly-terminated string of DEST_STR_LEN characters. If src_str is shorter than that, strncpy() will put a null terminator somewhere within dest_str, so the null at the very end is irrelevant and harmless.
CSimpleStringT::GetString gives a pointer to a null-terminated string. Use this as the soure for strncpy. As this is C++, you should only use C-style strings when interfacing with legacy APIs. Use std::string instead.
One of the alternative ways would be to zero string first and then cast or memcpy from CString.
I hope they don't changed from when I used them: that was many years ago :)
They used an interesting 'trick' to handle the refcount and the very fast and efficient automatic conversion to char*: i.e the pointer is to LPCSTR, but some back byte is reserved to keep the implementation state.
So the struct can be used with the older windows API (LPCSTR without overhead). I found at the time the idea interesting!
Of course the key ìs the availability of allocators: they simply offsets the pointer when mallocing/freeing.
I remember there was a buffer request to (for instance) modify the data available: GetBuffer(0), followed by ReleaseBuffer().
HTH
If you are not compiling with _UNICODE enabled, then you can get a const char * from a CString very easily. Just cast it to an LPCTSTR:
CString myString("stuff");
const char *byteString = (LPCTSTR)myString;
This is guaranteed to be NULL-terminated.
If you have built with _UNICODE, then CString is a UTF-16 encoded string. You can't really do anything directly with that.
If you do need to copy the data from the CString, this very easy, even using C-style code. Just make sure that you allocate sufficient memory and are copying the right length:
CString myString("stuff");
char *outString = (char*)malloc(myString.Length() + 1);
strncpy(outString, (LPCTSTR)myString, myString.Length());
CString ends with NULL so as long as your text is correct (no NULL characters inside) then copying should be safe. You can write:
char szStr[256];
strncpy(szStr, (LPCSTR) String, 3);
szStr[3]='\0'; /// b-cos no null-character is implicitly appended to the end of destination
if you store null somehere inside CString object you will probably cause yourself more problems, CString stores its lenght internally.
Another alternative solution would rather involve support from CPU or compiler, as it's much better approach - simply make sure that when copying memory in "safe" mode, at any time after every atomic operation there is zero added on the end, so when whole loop fails, the destination string will still be terminated, without need to zero it fully before making copy.
There could be also support for fast zero - just mark start and stop of zeroed region and it's instantly cleared in RAM, this would make things a lot easier.

wchar_t* to char* conversion problems

I have a problem with wchar_t* to char* conversion.
I'm getting a wchar_t* string from the FILE_NOTIFY_INFORMATION structure, returned by the ReadDirectoryChangesW WinAPI function, so I assume that string is correct.
Assume that wchar string is "New Text File.txt"
In Visual Studio debugger when hovering on variable in shows "N" and some unknown Chinese letters. Though in watches string is represented correctly.
When I try to convert wchar to char with wcstombs
wcstombs(pfileName, pwfileName, fileInfo.FileNameLength);
it converts just two letters to char* ("Ne") and then generates an error.
Some internal error in wcstombs.c at function _wcstombs_l_helper() at this block:
if (*pwcs > 255) /* validate high byte */
{
errno = EILSEQ;
return (size_t)-1; /* error */
}
It's not thrown up as exception.
What can be the problem?
In order to do what you're trying to do The Right Way, there are several nontrivial things that you need to take into account. I'll do my best to break them down for you here.
Let's start with the definition of the count parameter from the wcstombs() function's documentation on MSDN:
The maximum number of bytes that can be stored in the multibyte output string.
Note that this does NOT say anything about the number of wide characters in the wide character input string. Even though all of the wide characters in your example input string ("New Text File.txt") can be represented as single-byte ASCII characters, we cannot assume that each wide character in the input string will generate exactly one byte in the output string for every possible input string (if this statement confuses you, you should check out Joel's article on Unicode and character sets). So, if you pass wcstombs() the size of the output buffer, how does it know how long the input string is? The documentation states that the input string is expected to be null-terminated, as per the standard C language convention:
If wcstombs encounters the wide-character null character (L'\0') either before or when count occurs, it converts it to an 8-bit 0 and stops.
Though this isn't explicitly stated in the documentation, we can infer that if the input string isn't null-terminated, wcstombs() will keep reading wide characters until it has written count bytes to the output string. So if you're dealing with a wide character string that isn't null-terminated, it isn't enough to just know how long the input string is; you would have to somehow know exactly how many bytes the output string would need to be (which is impossible to determine without doing the conversion) and pass that as the count parameter to make wcstombs() do what you want it to do.
Why am I focusing so much on this null-termination issue? Because the FILE_NOTIFY_INFORMATION structure's documentation on MSDN has this to say about its FileName field:
A variable-length field that contains the file name relative to the directory handle. The file name is in the Unicode character format and is not null-terminated.
The fact that the FileName field isn't null-terminated explains why it has a bunch of "unknown Chinese letters" at the end of it when you look at it in the debugger. The FILE_NOTIFY_INFORMATION structure's documentation also contains another nugget of wisdom regarding the FileNameLength field:
The size of the file name portion of the record, in bytes.
Note that this says bytes, not characters. Therefore, even if you wanted to assume that each wide character in the input string will generate exactly one byte in the output string, you shouldn't be passing fileInfo.FileNameLength for count; you should be passing fileInfo.FileNameLength / sizeof(WCHAR) (or use a null-terminated input string, of course). Putting all of this information together, we can finally understand why your original call to wcstombs() was failing: it was reading past the end of the string and choking on invalid data (thereby triggering the EILSEQ error).
Now that we've elucidated the problem, it's time to talk about a possible solution. In order to do this The Right Way, the first thing you need to know is how big your output buffer needs to be. Luckily, there is one final tidbit in the documentation for wcstombs() that will help us out here:
If the mbstr argument is NULL, wcstombs returns the required size in bytes of the destination string.
So the idiomatic way to use the wcstombs() function is to call it twice: the first time to determine how big your output buffer needs to be, and the second time to actually do the conversion. The final thing to note is that as we stated previously, the wide character input string needs to be null-terminated for at least the first call to wcstombs().
Putting this all together, here is a snippet of code that does what you are trying to do:
size_t fileNameLengthInWChars = fileInfo.FileNameLength / sizeof(WCHAR); //get the length of the filename in characters
WCHAR *pwNullTerminatedFileName = new WCHAR[fileNameLengthInWChars + 1]; //allocate an intermediate buffer to hold a null-terminated version of fileInfo.FileName; +1 for null terminator
wcsncpy(pwNullTerminatedFileName, fileInfo.FileName, fileNameLengthInWChars); //copy the filename into a the intermediate buffer
pwNullTerminatedFileName[fileNameLengthInWChars] = L'\0'; //null terminate the new buffer
size_t fileNameLengthInChars = wcstombs(NULL, pwNullTerminatedFileName, 0); //first call to wcstombs() determines how long the output buffer needs to be
char *pFileName = new char[fileNameLengthInChars + 1]; //allocate the final output buffer; +1 to leave room for null terminator
wcstombs(pFileName, pwNullTerminatedFileName, fileNameLengthInChars + 1); //finally do the conversion!
Of course, don't forget to call delete[] pwNullTerminatedFileName and delete[] pFileName when you're done with them to clean up.
ONE LAST THING
After writing this answer, I reread your question a bit more closely and thought of another mistake you may be making. You say that wcstombs() fails after just converting the first two letters ("Ne"), which means that it's hitting uninitialized data in the input string after the first two wide characters. Did you happen to use the assignment operator to copy one FILE_NOTIFY_INFORMATION variable to another? For example,
FILE_NOTIFY_INFORMATION fileInfo = someOtherFileInfo;
If you did this, it would only copy the first two wide characters of someOtherFileInfo.FileName to fileInfo.FileName. In order to understand why this is the case, consider the declaration of the FILE_NOTIFY_INFORMATION structure:
typedef struct _FILE_NOTIFY_INFORMATION {
DWORD NextEntryOffset;
DWORD Action;
DWORD FileNameLength;
WCHAR FileName[1];
} FILE_NOTIFY_INFORMATION, *PFILE_NOTIFY_INFORMATION;
When the compiler generates code for the assignment operation, it does't understand the trickery that is being pulled with FileName being a variable length field, so it just copies sizeof(FILE_NOTIFY_INFORMATION) bytes from someOtherFileInfo to fileInfo. Since FileName is declared as an array of one WCHAR, you would think that only one character would be copied, but the compiler pads the struct to be an extra two bytes long (so that its length is an integer multiple of the size of an int), which is why a second WCHAR is copied as well.
My guess is that the wide string that you are passing is invalid or incorrectly defined.
How is pwFileName defined? It seems you have a FILE_NOTIFY_INFORMATION structure defined as fileInfo, so why are you not using fileInfo.FileName, as shown below?
wcstombs(pfileName, fileInfo.FileName, fileInfo.FileNameLength);
the error you get says it all, it found a character that it cannot convert to MB (cause it has no representation in MB), source:
If wcstombs encounters a wide character it cannot convert to a
multibyte character, it returns –1 cast to type size_t and sets errno
to EILSEQ
In cases like this you should avoid 'assumed' input, and give an actual test case that fails.