Reading / Writing Control Characters in binary file - c++

I'm currently processing a binary file using C++...
At some point I read a byte in and the char * read is "\x3" which seems to be a control character.
But when i got to write it back out using:
const char *control = "\x3";
fout.write(control, sizeof(control));
And then i read the binary file back in the read value is "\x11C".
How does one write the control character array back out to file the correct way?

Your code is writing 4-8 characters to the binary file instead of the 1 you seem to be expecting. control is treated as a normal pointer, and sizeof(control) is interpreting said pointer without considering the data it points to, and is returning a value of 4-8.
The best way to fix this is to declare control as a single character, which is what you seem to intend:
char control = '\x3';
fout.write(&control, sizeof(control));
The other way, if you actually need to write multiple characters, is like this:
const std::string control = "\x3";
fout.write(control.data(), control.size());
Either method will correctly output the number of characters you expect.

Another method to write string literals, is by declaring them as an array:
static char const data[] = "Hello World!\n";
fout.write(data, sizeof(data) - 1U);
The - 1U is so that the terminating NUL is not written. Remove as you wish.
Since the data array is declared with no capacity, so the compiler determines the length based on the content.
The sizeof can be used since the size of a character is 1 (by definition).
A nice advantage of this method is that the size is known at compile time. No searching for the length is required.

Related

How to hard code binary data to string

I want to test serialized data conversion in my application, currently the object is stored in file and read the binary file and reloading the object.
In my unit test case I want to test this operation. As the file operations are costly I want to hard code the binary file content in the code itself.
How can I do this?
Currently I am trying like this,
std::string FileContent = "\00\00\00\00\00.........";
and it is not working.
You're right that a string can contain '\0', but here you're still initializing it from const char*, which, by definition, stops at the first '\0'. I'd recommend you to use uint8_t[] or even uint32_t[] (that is, without passing to std::string), even if the second might have up to 3 bytes of overhead (but it's more compact when in source). That's e.g. how X bitmaps are usually stored.
Another possibility is base64 encoding, which is printable but needs (a relatively quick) decoding.
If you really want to put the const char[] to a std::string, first convert the pointer to const char*, then use the two-iterator constructor of std::string. While it's true that std::string can hold '\0', it's somewhat an antipattern to store binary in a string, thus I'm not giving the exact code, just the hint.
The following should do what you need, however probably not recommended as most people wouldn't expect an std::string to contain null bytes.
std::string FileContent { "\x00\x00\x00\x00\x00", 5 };

Returning string from function having multiple NULL '\0' in C++

I am compressing string. And the compressed string sometimes having NULL character inside before the end NULL. I want to return the string till the end null.But the compressor function is returning the sting till the occurring of the first NULL. I made a question for c before about it. But consecutively I need also the solution in C++ now, and in next C#. Please help me.Thanks.
char* compressor(char* str)
{
char *compressed_string;
//After some calculation
compressed_string="bk`NULL`dk";// at the last here is automatic an NULL we all know
return compressed_string;
}
void main()
{
char* str;
str=compressor("Muhammad Ashikuzzaman");
printf("Compressed Value = %s",str);
}
The output is : Compressed Value = bk;
And all other characters from compressor function is not here. Is there any way to show all the string.
The fundamental problem that you have is that compression algorithms operate on binary data rather than text. If you compress something, then expect some of the compressed bytes to be zero. Thus the compressed data cannot be stored in a null-terminated string.
You need to change your mindset to work with binary data.
To compress do the following:
Convert from text to binary using some well-defined encoding. For instance, UTF-8. This will yield an array of unsigned char.
Compress the unsigned char, which will again yield an array of unsigned char, but now compressed.
To decompress you just reverse these steps.
Since you are writing C++ code you would be well advised to use standard containers. Such as std::string or std::wstring and std::vector<T>.
The exact same principles apply in all languages. When you come to code this in C#, you need to convert from text to binary. Use Encoding.GetBytes() to do that. That yields a byte array, byte[]. Compress that to another byte array. And so on.
But you really must first overcome this desire to attempt to store binary data in text data types.

Trouble creating char * object to send to c++/c shared object library

First off I am new to ctypes and did search for an answer to my question. Definitely will appreciate any insight from here.
I have a byte string supplied to me by another tool. It contains what appears to be hex and other values. I'm creating the c_char_p object as follows:
mybytestring = b'something with a lot of hex \x00\x00\xc7\x87\x9bb and other alphanumeric and non-word characters' # Length of this is very long let's say 480
mycharp = c_char_p(mybytestring)
I also create a c_char_Array as follows:
mybuff = create_string_buffer(mybytestring)
The problem is when I send either mycharp or mybuff to a c++ library .so function, the string gets cut off at the NULL terminator (first occurrence of '\x00')
I'm loading the c++ library and calling the function as follows:
lib_handle = cdll.LoadLibrary(mylib.so)
lib_handle.myfunction(mycharp)
lib_handle.myfunction(mybuff)
The c++ function expects a char *
Does someone know how to be able to send the whole string with NULL terminators ('\x00') included?
Thanks
Add your original data to a vector<char> vec, and send vec.data()
But the actual problem is
The c++ function expects a char *.
You will need to change this (to accept a second arg=length of the buffer, or for example, to accept a vector<char>) if you want it to accept an array of char including null.
Alternatively you can figure out what do you actually want the c++ function to do, and make self a "preprocessing" of the char array, adding a null-terminator to each new array, and after that send to the c++ function.
For example, you may decide that the “input” array is actually a set of c-string: you will need to do a simple parse to “split” and send to the c++ in a cycle one, after other.
Or maybe you decide that the input could be a string in an UTF16 and not UTF8. Then you need to, as good as possible, convert it to UTF8 and send to the c++ function.

wchar_t* to char* conversion problems

I have a problem with wchar_t* to char* conversion.
I'm getting a wchar_t* string from the FILE_NOTIFY_INFORMATION structure, returned by the ReadDirectoryChangesW WinAPI function, so I assume that string is correct.
Assume that wchar string is "New Text File.txt"
In Visual Studio debugger when hovering on variable in shows "N" and some unknown Chinese letters. Though in watches string is represented correctly.
When I try to convert wchar to char with wcstombs
wcstombs(pfileName, pwfileName, fileInfo.FileNameLength);
it converts just two letters to char* ("Ne") and then generates an error.
Some internal error in wcstombs.c at function _wcstombs_l_helper() at this block:
if (*pwcs > 255) /* validate high byte */
{
errno = EILSEQ;
return (size_t)-1; /* error */
}
It's not thrown up as exception.
What can be the problem?
In order to do what you're trying to do The Right Way, there are several nontrivial things that you need to take into account. I'll do my best to break them down for you here.
Let's start with the definition of the count parameter from the wcstombs() function's documentation on MSDN:
The maximum number of bytes that can be stored in the multibyte output string.
Note that this does NOT say anything about the number of wide characters in the wide character input string. Even though all of the wide characters in your example input string ("New Text File.txt") can be represented as single-byte ASCII characters, we cannot assume that each wide character in the input string will generate exactly one byte in the output string for every possible input string (if this statement confuses you, you should check out Joel's article on Unicode and character sets). So, if you pass wcstombs() the size of the output buffer, how does it know how long the input string is? The documentation states that the input string is expected to be null-terminated, as per the standard C language convention:
If wcstombs encounters the wide-character null character (L'\0') either before or when count occurs, it converts it to an 8-bit 0 and stops.
Though this isn't explicitly stated in the documentation, we can infer that if the input string isn't null-terminated, wcstombs() will keep reading wide characters until it has written count bytes to the output string. So if you're dealing with a wide character string that isn't null-terminated, it isn't enough to just know how long the input string is; you would have to somehow know exactly how many bytes the output string would need to be (which is impossible to determine without doing the conversion) and pass that as the count parameter to make wcstombs() do what you want it to do.
Why am I focusing so much on this null-termination issue? Because the FILE_NOTIFY_INFORMATION structure's documentation on MSDN has this to say about its FileName field:
A variable-length field that contains the file name relative to the directory handle. The file name is in the Unicode character format and is not null-terminated.
The fact that the FileName field isn't null-terminated explains why it has a bunch of "unknown Chinese letters" at the end of it when you look at it in the debugger. The FILE_NOTIFY_INFORMATION structure's documentation also contains another nugget of wisdom regarding the FileNameLength field:
The size of the file name portion of the record, in bytes.
Note that this says bytes, not characters. Therefore, even if you wanted to assume that each wide character in the input string will generate exactly one byte in the output string, you shouldn't be passing fileInfo.FileNameLength for count; you should be passing fileInfo.FileNameLength / sizeof(WCHAR) (or use a null-terminated input string, of course). Putting all of this information together, we can finally understand why your original call to wcstombs() was failing: it was reading past the end of the string and choking on invalid data (thereby triggering the EILSEQ error).
Now that we've elucidated the problem, it's time to talk about a possible solution. In order to do this The Right Way, the first thing you need to know is how big your output buffer needs to be. Luckily, there is one final tidbit in the documentation for wcstombs() that will help us out here:
If the mbstr argument is NULL, wcstombs returns the required size in bytes of the destination string.
So the idiomatic way to use the wcstombs() function is to call it twice: the first time to determine how big your output buffer needs to be, and the second time to actually do the conversion. The final thing to note is that as we stated previously, the wide character input string needs to be null-terminated for at least the first call to wcstombs().
Putting this all together, here is a snippet of code that does what you are trying to do:
size_t fileNameLengthInWChars = fileInfo.FileNameLength / sizeof(WCHAR); //get the length of the filename in characters
WCHAR *pwNullTerminatedFileName = new WCHAR[fileNameLengthInWChars + 1]; //allocate an intermediate buffer to hold a null-terminated version of fileInfo.FileName; +1 for null terminator
wcsncpy(pwNullTerminatedFileName, fileInfo.FileName, fileNameLengthInWChars); //copy the filename into a the intermediate buffer
pwNullTerminatedFileName[fileNameLengthInWChars] = L'\0'; //null terminate the new buffer
size_t fileNameLengthInChars = wcstombs(NULL, pwNullTerminatedFileName, 0); //first call to wcstombs() determines how long the output buffer needs to be
char *pFileName = new char[fileNameLengthInChars + 1]; //allocate the final output buffer; +1 to leave room for null terminator
wcstombs(pFileName, pwNullTerminatedFileName, fileNameLengthInChars + 1); //finally do the conversion!
Of course, don't forget to call delete[] pwNullTerminatedFileName and delete[] pFileName when you're done with them to clean up.
ONE LAST THING
After writing this answer, I reread your question a bit more closely and thought of another mistake you may be making. You say that wcstombs() fails after just converting the first two letters ("Ne"), which means that it's hitting uninitialized data in the input string after the first two wide characters. Did you happen to use the assignment operator to copy one FILE_NOTIFY_INFORMATION variable to another? For example,
FILE_NOTIFY_INFORMATION fileInfo = someOtherFileInfo;
If you did this, it would only copy the first two wide characters of someOtherFileInfo.FileName to fileInfo.FileName. In order to understand why this is the case, consider the declaration of the FILE_NOTIFY_INFORMATION structure:
typedef struct _FILE_NOTIFY_INFORMATION {
DWORD NextEntryOffset;
DWORD Action;
DWORD FileNameLength;
WCHAR FileName[1];
} FILE_NOTIFY_INFORMATION, *PFILE_NOTIFY_INFORMATION;
When the compiler generates code for the assignment operation, it does't understand the trickery that is being pulled with FileName being a variable length field, so it just copies sizeof(FILE_NOTIFY_INFORMATION) bytes from someOtherFileInfo to fileInfo. Since FileName is declared as an array of one WCHAR, you would think that only one character would be copied, but the compiler pads the struct to be an extra two bytes long (so that its length is an integer multiple of the size of an int), which is why a second WCHAR is copied as well.
My guess is that the wide string that you are passing is invalid or incorrectly defined.
How is pwFileName defined? It seems you have a FILE_NOTIFY_INFORMATION structure defined as fileInfo, so why are you not using fileInfo.FileName, as shown below?
wcstombs(pfileName, fileInfo.FileName, fileInfo.FileNameLength);
the error you get says it all, it found a character that it cannot convert to MB (cause it has no representation in MB), source:
If wcstombs encounters a wide character it cannot convert to a
multibyte character, it returns –1 cast to type size_t and sets errno
to EILSEQ
In cases like this you should avoid 'assumed' input, and give an actual test case that fails.

Using C++, how do I read a string of a specific length, from a non-binary file?

The cplusplus.com example for reading text files shows that a line can be read using the getline function. However, I don't want to get an entire line; I want to get only a certain number of characters. How can this be done in a way that preserves character encoding?
I need a function that does something like this:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
resultStream << getstring(fileStream, 10); // read first 10 chars
file.ftell(10); // move to the next item
resultStream << getstring(fileStream, 10); // read 10 more chars
I thought about reading to a char buffer, but wouldn't this change the character encoding?
I really suspect that there's some confusion here regarding the term "character." Judging from the OP's question, he is using the term "character" to refer to a char (as opposed to a logical "character", like a multi-byte UTF-8 character), and thus for the purpose of reading from a text-file the term "character" is interchangeable with "byte."
If that is the case, you can read a certain number of bytes from disk using ifstream::read(), e.g.
ifstream fileStream;
fileStream.open("file.txt", ios::in);
char buffer[1024];
fileStream.read(buffer, sizeof(buffer));
Reading into a char buffer won't affect the character encoding at all. The exact sequence of bytes stored on disk will be copied into the buffer.
However, it is a different story if you are using a multi-byte character set where each character is variable-length. If characters are not fixed-size, there's no way to read exactly N characters from disk with a single disk read. This is not a limitation of C++, this is simply the reality of dealing with block devices (disks). At the lowest levels of your OS, block devices are addressed in terms of blocks, which in turn are made up of bytes. So you can always read an exact number of bytes from disk, but you can't read an exact number of logical characters from disk, unless each character is a fixed number of bytes. For character-sets like UTF-8 where each character is variable length, you'll have to either read in the entire file, or else perform speculative reads and parse the read buffer after each read to determine if you need to read more.
C++ itself doesn't have a concept of character encoding. chars are always the same size, as are wchar_ts. So if you need to read X chars of a multibyte char set (such as utf-8) then you'll either have to read a (single byte) char at a time (e.g. using getchar() - or X chars, speculatively, using istream::getline() ) and test the MBCS signals yourself, or use a third-party library to do it.
If the charset is a fixed width encoding, and you don't mind stopping when you get to a newline, then getline(), which allows you to specify the maximum number of chars to read, is probably what you want.
As a few people have mentioned, the C/C++ Standard Libraries don't really provide anything that operates above essentially byte level. So if you're wanting to do this using only the core libraries you don't have a ready made option.
Which leaves either checking if your chosen platform(s) provide another library that implements this capability, writing your own parser for handling character encodings, or punching something like "c++ utf8 library" or "posix unicode" into Google and taking a look at what turns up.
Possible interesting hits:
UTF-8 and Unicode FAQ
UTF-CPP
I'll leave further investigation to the reader.
I think you can use the sgetn member function of the streams associated streambuf...
char buf[32];
streamsize i = fileStream.rdbuf()->sgetn( &buf[0], 10 );
Which will read 10 chars into buf (if there are 10 available to read), returning the number of chars read.