wchar_t* to char* conversion problems - c++

I have a problem with wchar_t* to char* conversion.
I'm getting a wchar_t* string from the FILE_NOTIFY_INFORMATION structure, returned by the ReadDirectoryChangesW WinAPI function, so I assume that string is correct.
Assume that wchar string is "New Text File.txt"
In Visual Studio debugger when hovering on variable in shows "N" and some unknown Chinese letters. Though in watches string is represented correctly.
When I try to convert wchar to char with wcstombs
wcstombs(pfileName, pwfileName, fileInfo.FileNameLength);
it converts just two letters to char* ("Ne") and then generates an error.
Some internal error in wcstombs.c at function _wcstombs_l_helper() at this block:
if (*pwcs > 255) /* validate high byte */
{
errno = EILSEQ;
return (size_t)-1; /* error */
}
It's not thrown up as exception.
What can be the problem?

In order to do what you're trying to do The Right Way, there are several nontrivial things that you need to take into account. I'll do my best to break them down for you here.
Let's start with the definition of the count parameter from the wcstombs() function's documentation on MSDN:
The maximum number of bytes that can be stored in the multibyte output string.
Note that this does NOT say anything about the number of wide characters in the wide character input string. Even though all of the wide characters in your example input string ("New Text File.txt") can be represented as single-byte ASCII characters, we cannot assume that each wide character in the input string will generate exactly one byte in the output string for every possible input string (if this statement confuses you, you should check out Joel's article on Unicode and character sets). So, if you pass wcstombs() the size of the output buffer, how does it know how long the input string is? The documentation states that the input string is expected to be null-terminated, as per the standard C language convention:
If wcstombs encounters the wide-character null character (L'\0') either before or when count occurs, it converts it to an 8-bit 0 and stops.
Though this isn't explicitly stated in the documentation, we can infer that if the input string isn't null-terminated, wcstombs() will keep reading wide characters until it has written count bytes to the output string. So if you're dealing with a wide character string that isn't null-terminated, it isn't enough to just know how long the input string is; you would have to somehow know exactly how many bytes the output string would need to be (which is impossible to determine without doing the conversion) and pass that as the count parameter to make wcstombs() do what you want it to do.
Why am I focusing so much on this null-termination issue? Because the FILE_NOTIFY_INFORMATION structure's documentation on MSDN has this to say about its FileName field:
A variable-length field that contains the file name relative to the directory handle. The file name is in the Unicode character format and is not null-terminated.
The fact that the FileName field isn't null-terminated explains why it has a bunch of "unknown Chinese letters" at the end of it when you look at it in the debugger. The FILE_NOTIFY_INFORMATION structure's documentation also contains another nugget of wisdom regarding the FileNameLength field:
The size of the file name portion of the record, in bytes.
Note that this says bytes, not characters. Therefore, even if you wanted to assume that each wide character in the input string will generate exactly one byte in the output string, you shouldn't be passing fileInfo.FileNameLength for count; you should be passing fileInfo.FileNameLength / sizeof(WCHAR) (or use a null-terminated input string, of course). Putting all of this information together, we can finally understand why your original call to wcstombs() was failing: it was reading past the end of the string and choking on invalid data (thereby triggering the EILSEQ error).
Now that we've elucidated the problem, it's time to talk about a possible solution. In order to do this The Right Way, the first thing you need to know is how big your output buffer needs to be. Luckily, there is one final tidbit in the documentation for wcstombs() that will help us out here:
If the mbstr argument is NULL, wcstombs returns the required size in bytes of the destination string.
So the idiomatic way to use the wcstombs() function is to call it twice: the first time to determine how big your output buffer needs to be, and the second time to actually do the conversion. The final thing to note is that as we stated previously, the wide character input string needs to be null-terminated for at least the first call to wcstombs().
Putting this all together, here is a snippet of code that does what you are trying to do:
size_t fileNameLengthInWChars = fileInfo.FileNameLength / sizeof(WCHAR); //get the length of the filename in characters
WCHAR *pwNullTerminatedFileName = new WCHAR[fileNameLengthInWChars + 1]; //allocate an intermediate buffer to hold a null-terminated version of fileInfo.FileName; +1 for null terminator
wcsncpy(pwNullTerminatedFileName, fileInfo.FileName, fileNameLengthInWChars); //copy the filename into a the intermediate buffer
pwNullTerminatedFileName[fileNameLengthInWChars] = L'\0'; //null terminate the new buffer
size_t fileNameLengthInChars = wcstombs(NULL, pwNullTerminatedFileName, 0); //first call to wcstombs() determines how long the output buffer needs to be
char *pFileName = new char[fileNameLengthInChars + 1]; //allocate the final output buffer; +1 to leave room for null terminator
wcstombs(pFileName, pwNullTerminatedFileName, fileNameLengthInChars + 1); //finally do the conversion!
Of course, don't forget to call delete[] pwNullTerminatedFileName and delete[] pFileName when you're done with them to clean up.
ONE LAST THING
After writing this answer, I reread your question a bit more closely and thought of another mistake you may be making. You say that wcstombs() fails after just converting the first two letters ("Ne"), which means that it's hitting uninitialized data in the input string after the first two wide characters. Did you happen to use the assignment operator to copy one FILE_NOTIFY_INFORMATION variable to another? For example,
FILE_NOTIFY_INFORMATION fileInfo = someOtherFileInfo;
If you did this, it would only copy the first two wide characters of someOtherFileInfo.FileName to fileInfo.FileName. In order to understand why this is the case, consider the declaration of the FILE_NOTIFY_INFORMATION structure:
typedef struct _FILE_NOTIFY_INFORMATION {
DWORD NextEntryOffset;
DWORD Action;
DWORD FileNameLength;
WCHAR FileName[1];
} FILE_NOTIFY_INFORMATION, *PFILE_NOTIFY_INFORMATION;
When the compiler generates code for the assignment operation, it does't understand the trickery that is being pulled with FileName being a variable length field, so it just copies sizeof(FILE_NOTIFY_INFORMATION) bytes from someOtherFileInfo to fileInfo. Since FileName is declared as an array of one WCHAR, you would think that only one character would be copied, but the compiler pads the struct to be an extra two bytes long (so that its length is an integer multiple of the size of an int), which is why a second WCHAR is copied as well.

My guess is that the wide string that you are passing is invalid or incorrectly defined.
How is pwFileName defined? It seems you have a FILE_NOTIFY_INFORMATION structure defined as fileInfo, so why are you not using fileInfo.FileName, as shown below?
wcstombs(pfileName, fileInfo.FileName, fileInfo.FileNameLength);

the error you get says it all, it found a character that it cannot convert to MB (cause it has no representation in MB), source:
If wcstombs encounters a wide character it cannot convert to a
multibyte character, it returns –1 cast to type size_t and sets errno
to EILSEQ
In cases like this you should avoid 'assumed' input, and give an actual test case that fails.

Related

Reading / Writing Control Characters in binary file

I'm currently processing a binary file using C++...
At some point I read a byte in and the char * read is "\x3" which seems to be a control character.
But when i got to write it back out using:
const char *control = "\x3";
fout.write(control, sizeof(control));
And then i read the binary file back in the read value is "\x11C".
How does one write the control character array back out to file the correct way?
Your code is writing 4-8 characters to the binary file instead of the 1 you seem to be expecting. control is treated as a normal pointer, and sizeof(control) is interpreting said pointer without considering the data it points to, and is returning a value of 4-8.
The best way to fix this is to declare control as a single character, which is what you seem to intend:
char control = '\x3';
fout.write(&control, sizeof(control));
The other way, if you actually need to write multiple characters, is like this:
const std::string control = "\x3";
fout.write(control.data(), control.size());
Either method will correctly output the number of characters you expect.
Another method to write string literals, is by declaring them as an array:
static char const data[] = "Hello World!\n";
fout.write(data, sizeof(data) - 1U);
The - 1U is so that the terminating NUL is not written. Remove as you wish.
Since the data array is declared with no capacity, so the compiler determines the length based on the content.
The sizeof can be used since the size of a character is 1 (by definition).
A nice advantage of this method is that the size is known at compile time. No searching for the length is required.

Required to convert a String to UTF8 string

Problem Statement:
I am required to convert a generated string to UTF8 string, this generated string has extended ascii characters and I am on Linux system (2.6.32-358.el6.x86_64).
A POC is still in progress so I can only provide small code samples
and complete solution can be posted only once ready.
Why I required UFT8 (I have extended ascii characters to be stored in a string which has to be UTF8).
How I am proceeding:
Convert generated string to wchar_t string.
Please look at the below sample code
int main(){
char CharString[] = "Prova";
iconv_t cd;
wchar_t WcharString[255];
size_t size= mbstowcs(WcharString, CharString, strlen(CharString));
wprintf(L"%ls\n", WcharString);
wprintf(L"%s\n", WcharString);
printf("\n%zu\n",size);
}
One question here:
Output is
Prova?????
s
Why the size is not printed here ?
Why the second printf prints only one character.
If I print size before both printed string then only 5 is printed and both strings are missing from console.
Moving on to Second Part:
Now that I will have a wchar_t string I want to convert it to UTF8 string
For this I was surfing through and found iconv will help here.
Question here
These are the methods I found in manual
**iconv_t iconv_open(const char *, const char *);
size_t iconv(iconv_t, char **, size_t *, char **, size_t *);
int iconv_close(iconv_t);**
Do I need to convert back wchar_t array to char array to before feeding to iconv ?
Please provide suggestions on the above issues.
Extended ascii I am talking about please see letters i in the marked snapshot below
For your first question (which I am interpreting as "why is all the output not what I expect"):
Where does the '?????' come from? In the call mbstowcs(WcharString, CharString, strlen(CharString)), the last argument (strlen(CharString)) is the length of the output buffer, not the length of the input string. mbstowcs will not write more than that number of wide characters, including the NUL terminator. Since the conversion requires 6 wide characters including the terminator, and you are only allowing it to write 5 wide characters, the resulting wide character string is not NUL terminated, and when you try to print it out you end up printing garbage after the end of the converted string. Hence the ?????. You should use the size of the output buffer in wchar_t's (255, in this case) instead.
Why does the second wprintf only print one character? When you call wprintf with a wide character string argument, you must use the %ls format code (or, more accurately, the %s conversion needs to be qualified with an l length modifier). If you use %s without the l, then wprintf will interpret the string as a char*, and it will convert each character to a wchar_t as it outputs it. However, since the argument is actually a wide character string, the first wchar_t in the string is L"p", which is the number 0x70 in some integer size. That means that the second byte of the wchar_t (counting from the end, since you have a little-endian architecture) is a 0, so if you treat the string as a string of characters, it will be terminated immediately after the p. So only one character is printed.
Why doesn't the last printf print anything? In C, an output stream can either be a wide stream or a byte stream, but you don't specify that when you open the stream. (And, in any case, standard output is already opened for you.) This is called the orientation of the stream. A newly opened stream is unoriented, and the orientation is fixed when you first output to the stream. If the first output call is a wide call, like wprintf, then the stream is a wide stream; otherwise, it is a byte stream. Once set, the orientation is fixed and you can't use output calls of the wrong orientation. So the printf is illegal, and it does nothing other than raise an error.
Now, let's move on to your second question: What do I do about it?
The first thing is that you need to be clear about what format the input is in, and how you want to output it. On Linux, it is somewhat unlikely that you will want to use wchar_t at all. The most likely cases for the input string are that it is already UTF-8, or that it is in some ISO-8859-x encoding. And the most likely cases for the output are the same: either it is UTF-8, or it is some ISO-8859-x encoding.
Unfortunately, there is no way for your program to know what encoding the console is expecting. The output may not even be going to a console. Similarly, there is really no way for your program to know which ISO-8859-x encoding is being used in the input string. (If it is a string literal, the encoding might be specified when you invoke the compiler, but there is no standard way of providing the information.)
If you are having trouble viewing output because non-ascii characters aren't displaying properly, you should start by making sure that the console is configured to use the same encoding as the program is outputting. If the program is sending UTF-8 to a console which is displaying, say, ISO-8859-15, then the text will not display properly. In theory, your locale setting includes the encoding used by your console, but if you are using a remote console (say, through PuTTY from a Windows machine), then the console is not part of the Linux environment and the default locale may be incorrect. The simplest fix is to configure your console correctly, but it is also possible to change the Linux locale.
The fact that you are using mbstowcs from a byte string suggests that you believe that the original string is in UTF-8. So it seems unlikely that the problem is that you need to convert it to UTF-8.
You can certainly use iconv to convert a string from one encoding to another; you don't need to go through wchar_t to do so. But you do need to know the actual input encoding and the desired output encoding.
It's no good idea to use iconv for utf8. Just implement the definition of utf8 yourself. That is quite easily in done in C from the Description https://en.wikipedia.org/wiki/UTF-8.
You don't even need wchar_t, just use uint32_t for your characters.
You will learn much if you implement yourself and your program will gain speed from not using mb or iconv functions.

Have a PCWSTR and need it to be a WCHAR[]

I am re-writing a C++ method from some code I downloaded. The method originally took a PCWSTR as a parameter and then prompted the user to enter a file name. I modified the method to take two parameters (both PCWSTR) and not to prompt the user. I am already generating the list of files somewhere else. I am attempting to call my new (modified) method with both parameters from my method that iterates the list of files.
The original method prompted the user for input using a StringCBGetsW command. Like this...
HRESULT tst=S_OK; //these are at the top of the method
WCHAR fname[85] = {0}; //these are at the top of the method
tst = StringCbGetsW(fname,sizeof(fname));
The wchar fname gets passed to another iteration method further down. When I look at that method, it says it's a LPCWSTR type; I'm assuming it can take the WCHAR instead.
But what it can't do is take the PCWSTR that the method got handed. My ultimate goal is to try not prompt the user for the file name and to take instead the filename that was iterated earlier in another method.
tl;dr. I have a PCWSTR and it needs to get converted to a WCHAR. I don't know what a WCHAR [] is or how to do anything with it. Including to try to do a printf to see what it is.
PS...I know there are easier ways to move and copy around files, there is a reason I'm attempting to make this work using a program.
First, let's try to make some clarity on some Windows specific types.
WCHAR is a typedef for wchar_t.
On Windows with Microsoft Visual C++, it's a 16-bit character type (that can be used for Unicode UTF-16 strings).
PCWSTR and LPCWSTR are two different names for the same thing: they are basically typedefs for const wchar_t*.
The initial L in LPCWSTR is some legacy prefix that, read with the following P, stands for "long pointer". I've never programmed Windows in the 16-bit era (I started with Windows 95 and Win32), but my understanding is that in 16-bit Windows there were something like near pointers and far, or long pointers. Now we have just one type of pointers, so the L prefix can be omitted.
The P stands for "pointer".
The C stands for "constant".
The W stands for WCHAR/wchar_t, and last but not least, the STR part stands for "string".
So, decoding this kind of "Hungarian Notation", PCWSTR means const wchar_t*.
Basically, it's a pointer to a read-only NUL-terminated wchar_t Unicode UTF-16 string.
Is this information enough for you to solve your problem?
If you have a wchar_t string buffer, and a function that expects a PCWSTR, you can just pass the name of the buffer (corresponding the the address of its first character) to the function:
WCHAR buffer[100];
DoSomething(buffer, ...); // DoSomething(PCWSTR ....)
Sometimes - typically for output string parameters - you may also want to specify the size (i.e. "capacity") of the destination string buffer.
If this size is expressed using a count in characters (in this case, in wchar_ts), the the usual Win32 Hungarian Notation is cch ("count of characters"); else, if you want the size expressed in bytes, then the usual prefix is cb ("count of bytes").
So, if you have a function like StringCchCopy(), then from the Cch part you know the size is expressed in characters (wchar_ts).
Note that you can use _countof() to get the size of a buffer in wchar_ts.
e.g. in the above code snippet, _countof(buffer) == 100, since buffer is made by 100 wchar_ts; instead, sizeof(buffer) == 200, since each wchar_t is 2 bytes == 16 bits in size, so the total buffer size in bytes is 100 [wchar_t] * 2 [bytes/wchar_t] = 200 [bytes].

why std::wofstream do not print all wstring into file?

I have a std::wstring whose size is 139,580,199 characters.
For debugging I printed it into file with this code:
std::wofstream f(L"C:\\some file.txt");
f << buffer;
f.close();
After that noticed that the end of string is missing. The created file size is 109,592,584 bytes (and the "size on disk" is 109,596,672 bytes).
Also checked if buffer contains null chars, did this:
size_t pos = buffer.find(L'\0');
Expecting result to be std::wstring::npos but it is 18446744073709551615, but my string doesn't have null char at the end so probably it's ok.
Can somebody explain, why I have not all string printed into file?
A lot depends on the locale, but typically, files on disk will
not use the same encoding form (or even the same encoding) as
that used by wchar_t; the filebuf which does the actual
reading and writing translates the encodings according to its
imbued locale. And there is only a vague relationship between
the length of a string in different encodings or encoding form.
(And the size the system sees doesn't correspond directly to the
number of bytes you can read from the file.)
To see if everything was written, check the status of f
after the close, i.e.:
f.close();
if ( !f ) {
// Something went wrong...
}
One thing that can go wrong is that the external encoding
doesn't have a representation for one of the characters. If
you're in the "C" locale, this could occur for any character
outside of the basic execution character set.
If there is no error above, there's no reason off hand to assume
that not all of the string has been written. What happens if
you try to read it in another program? Do you get the same
number of characters or not?
For the rest, nul characters are characters like any others in
a std::wstring; there's nothing special about them, including
when they are output to a stream. And 18446744073709551615
looks very much like the value I would expect for
std::wstring::npos on a 64 bit machine.
EDIT:
Following up on Mat Petersson's comment: it's actually highly
unlikely that the file ends up with less bytes than there are
code points in the std::wstring. (std::wstring::size()
returns the number of code points.) I was thinking in terms of
bytes, not in terms of what std::wstring::size() returns. So
the most likely explination is that you have some characters in
your string which aren't representable in the target encoding
(which probably only supports characters with code points
32-126, plus a few control characters, by default).

MS VC++ Convert a byte array to a BSTR?

I have a string that starts out in a .Net application, is encrypted and stored in AD. It's then picked up by a native C++ app and decrypted to produce an array of bytes
e.g "ABCDEF" becomes 00,41,00,42,00,43,00,44,00,45 once it has been decrypted at the C++ end.
I need to take this byte array and convert it to the BSTR "ABCDEF" so that I can use it elsewhere and I can't find a way to acomplish this last step.
Can anybody help?
If you really have an array of arbitrary bytes, use SysAllocStringByteLen. But it looks like, despite being in a byte array, your data is really a UTF-16-encoded Unicode string, so in that case, you're probably better off using SysAllocStringLen instead. Pass the byte-array pointer to the function (type-cast to OLECHAR*), and the characters will be copied into the new string for you, along with an additional null character at the end.
The "decrypted string" is just a Unicode string - latin characters contain the first byte equal to null when represented in Unicode. So you don't need any real conversion, just cook a BSTR out of that buffer.
Knowing the number of Unicode characters - it will be half the length of the buffer - call SysAllocStringLen() to allocate a long enough null-terminated uninitialized BSTR. Then copy your array onto the allocated string with memcpy(). Alternatively you can call SysAllocStringLen() and pass it the byte buffer so that it does the copy for you and skip the memcpy(). Don't forget to call SysFreeString() when you no longer need the BSTR.