C++ Parsing data bytes into struct

C++ Parsing data bytes into struct - c++

I am new in here and I have a question
I have a struct, let's say overall size is 8 bytes, here the struct:
struct Header
{
int ID; // 4 bytes
char Title [4]; // 4 bytes too
}; // so it 8 bytes right?
and I have a file with 8 bytes too...
I just want to ask, how to parse data on that file into the struct of that
I have tried this one:
Header* ParseHeader(char* filename)
{
char* buffer = new char[8];
fstream fs(filename);
if (fs.is_open() != true)
throw new exception("Couldn't Open file for Parsing Header.");
fs.read(buffer, 8);
if (!fs)
{
delete[] buffer;
throw new exception("Couldn't Read header OJN file.\nHeader data was corrupted");
}
Header* header = (Header*)((void*)buffer);
delete[] buffer;
fs.close();
return header;
}
but it fail, and return invalid data than what I was expect (I can make you sure, this is not file fault, the file structured correctly)
Can someone help me?
Thanks

Seems like you do everything fine until this point:
Header* header = (Header*)((void*)buffer);
delete[] buffer;
fs.close();
notice you delete the buffer after the casting, meaning that header points to a deleted location -> junk, you need to either not delete or copy the data if you like to still use it.
Also, to be quite honest, I don't understand how your code compiles, your function states it returns a Header, while you return a Header*..

You are deleting the data that is being returned. Therefore header is no longer accessible.
I think you meant the line to be:
Header header = *(Header*)((void*)buffer);
This will actually copy the header.

The fact that your 8 bytes file correctly maps to your struct Header is mere luck as far as C++ is involved. The structure could have internal padding that make it bigger than 8 bytes, and the data endianness could be different between your file and your CPU.
I realize your code works with your particular compiler version, on your operating system and on your CPU but you should not get into the habit of coding like that, otherwise you'll probably get into big trouble as soon as you change any of those parameters (or maybe even just some compiler flags). In other words, what you are doing is extremely bad practice. In C++ you don't even have the guarantee that an int is actually 4 bytes.
The Right Way™ to load such binary data from a file is to load each field individually and ensure proper endianness conversion depending on the CPU you're using (eg. through hton* / ntoh* or similar functions). Using a fixed-size type like int32_t also helps.

Just define your structure in 1-byte boundary as:
#pragma pack(1)
struct Header
{
int ID; // 4 bytes
char Title [4]; // 4 bytes too
};
#pragma pack()
First pack statement instructs the compiler to use one-byte padding for members in structure. This way size of Header will be 8 bytes. Second pack statement instructs to go back to default setting. You may need to use push and pop instructions (See enter link description here) - but this isn't required for you.
Secondly, and more importantly, you should not use hard-code values like 8. Always use sizeof to read or write a structure. Also, this statement is absolutely not needed:
char* buffer = new char[8];
...
Just declare Header variable itself, and read on it:
Header header;
...
fs.read(&header, sizeof(Header));

Related

Rewrite file with 0's. What am I doing wrong?

I want rewrite file with 0's. It only write a few bytes.
My code:
int fileSize = boost::filesystem::file_size(filePath);
int zeros[fileSize] = { 0 };
boost::filesystem::path rewriteFilePath{filePath};
boost::filesystem::ofstream rewriteFile{rewriteFilePath, std::ios::trunc};
rewriteFile << zeros;
Also... Is this enough to shred the file? What should I do next to make the file unrecoverable?
EDIT: Ok. I rewrited my code to this. Is this code ok to do this?
int fileSize = boost::filesystem::file_size(filePath);
boost::filesystem::path rewriteFilePath{filePath};
boost::filesystem::ofstream rewriteFile{rewriteFilePath, std::ios::trunc};
for(int i = 0; i < fileSize; i++) {
rewriteFile << 0;
}

There are several problems with your code.
int zeros[fileSize] = { 0 };
You are creating an array that is sizeof(int) * fileSize bytes in size. For what you are attempting, you need an array that is fileSize bytes in size instead. So you need to use a 1-byte data type, like (unsigned) char or uint8_t.
But, more importantly, since the value of fileSize is not known until runtime, this type of array is known as a "Variable Length Array" (VLA), which is a non-standard feature in C++. Use std::vector instead if you need a dynamically allocated array.
boost::filesystem::ofstream rewriteFile{rewriteFilePath, std::ios::trunc};
The trunc flag truncates the size of an existing file to 0. What that entails is to update the file's metadata to reset its tracked byte size, and to mark all of the file's used disk sectors as available for reuse. The actual file bytes stored in those sectors are not wiped out until overwritten as sectors get reused over time. But any bytes you subsequently write to the truncated file are not guaranteed to (and likely will not) overwrite the old bytes on disk. So, do not truncate the file at all.
rewriteFile << zeros;
ofstream does not have an operator<< that takes an int[], or even an int*, as input. But it does have an operator<< that takes a void* as input (to output the value of the memory address being pointed at). An array decays into a pointer to the first element, and void* accepts any pointer. This is why only a few bytes are being written. You need to use ofstream::write() instead to write the array to file, and be sure to open the file with the binary flag.
Try this instead:
int fileSize = boost::filesystem::file_size(filePath);
std::vector<char> zeros(fileSize, 0);
boost::filesystem::path rewriteFilePath(filePath);
boost::filesystem::ofstream rewriteFile(rewriteFilePath, std::ios::binary);
rewriteFile.write(zeros.data()/*&zeros[0]*/, fileSize);
That being said, you don't need a dynamically allocated array at all, let alone one that is allocated to the full size of the file. That is just a waste of heap memory, especially for large files. You can do this instead:
int fileSize = boost::filesystem::file_size(filePath);
const char zeros[1024] = {0}; // adjust size as desired...
boost::filesystem::path rewriteFilePath(filePath);
boost::filesystem::ofstream rewriteFile(rewriteFilePath, std::ios::binary);
int loops = fileSize / sizeof(zeros);
for(int i = 0; i < loops; ++i) {
rewriteFile.write(zeros, sizeof(zeros));
}
rewriteFile.write(zeros, fileSize % sizeof(zeros));
Alternatively, if you open a memory-mapped view of the file (MapViewOfFile() on Windows, mmap() on Linux, etc) then you can simply use std::copy() or std::memset() to zero out the bytes of the entire file directly on disk without using an array at all.
Also... Is this enough to shred the file?
Not really, no. At the physical hardware layer, overwriting the file just one time with zeros can still leave behind remnant signals in the disk sectors, which can be recovered with sufficient tools. You should overwrite the file multiple times, with varying types of random data, not just zeros. That will more thoroughly scramble the signals in the sectors.

I cannot stress strongly enough the importance of the comments that overwriting a file's contents does not guarantee that any of the original data is overwritten. ALL OTHER ANSWERS TO THIS QUESTION ARE THEREFORE IRRELEVANT ON ANY RECENT OPERATING SYSTEM.
Modern filing systems are extents based, meaning that files are stored as a linked list of allocated chunks. Updating a chunk may be faster for the filing system to write a whole new chunk and simply adjust the linked list, so that's what they do. Indeed copy-on-write filing systems always write a copy of any modified chunk and update their B-tree of currently valid extents.
Furthermore, even if your filing system doesn't do this, your hard drive may use the exact same technique also for performance, and any SSD almost certainly always uses this technique due to how flash memory works. So overwriting data to "erase" it is meaningless on modern systems. Can't be done. The only safe way to keep old data hidden is full disk encryption. Anything else you are deceiving yourself and your users.

Just for fun, overwriting with random data:
Live On Coliru
#include <boost/iostreams/device/mapped_file.hpp>
#include <random>
namespace bio = boost::iostreams;
int main() {
bio::mapped_file dst("main.cpp");
std::mt19937 rng { std::random_device{} () };
std::uniform_int_distribution<char> dist;
std::generate_n(dst.data(), dst.size(), [&] { return dist(rng); });
}
Note that it scrambles its own source file after compilation :)

What's the proper way to read/populate data buffers using message structs in c++?

I have a program that deals char[] buffers to send/receive messages. Until now, this is how it has been handled:
#pramga pack(1)
struct messageType
{
uint8_t data0:4;
uint8_t data1:4;
uint8_t data2;
//etc...
};
#pragma pack()
void MyClass::processMessage(char* buf)
{
// I already know buf is big enough to hold a messageType
messageType* msg = reinterpret_cast<messageType*>(buf);
//populate class member variables
m_data0 = msg->data0;
m_data1 = msg->data1;
m_data2 = msg->data2;
//etc
}
Now from what I've gathered from reading around is that this is technically undefined behavior due to strict aliasing, and that memcpy should be used instead? What I don't quite understand is, what potential issues does copying buf byte for byte to messageType msgNotPtr, then reading from that msgNotPtr, actually avoid?
Regarding sending, instead of doing this:
void MyClass::sendMessage()
{
char buf[max_tx_size];
messageType* msg = reinterpret_cast<messageType*>(buf);
msg->data0 = m_data0;
//etc...
send(buf);
}
I've read that I should be using placement new isntead, ala:
messageType* msg = new(buf) messageType;
If I do it this way, do I need to add additional cleanup, given that the struct messageType only contains POD types (such as manually firing the destructor)?
Edit: Now that I think about it, is sendMessage still undefined? Do I need to also swap out the last command with something like send(reinterpret_cast<char*>(msg)) to make sure the compiler does not optimize out the call?

In my opinion, the proper method to load a class instance from a uint8_t buffer is to load each member separately.
The class should know the positions of its members within the buffer and where and the size of any padding or reserved areas.
One of the issues with mapping structures to buffers is that the compiler can add space between members. If you pack the structure to eliminate the padding, you slow down your program every time you access a member.
So, bite the performance on input and output by placing the members where you want them in the buffer (or extracting the members according to a specification). The rest of the program can access the members the way the compiler aligned them.
This includes bit fields.
Also, by having the members loaded individually, the Endianess of the fields in the buffer can be account for much easier.

Parsing binary data from file

and thank you in advance for your help!
I am in the process of learning C++. My first project is to write a parser for a binary-file format we use at my lab. I was able to get a parser working fairly easily in Matlab using "fread", and it looks like that may work for what I am trying to do in C++. But from what I've read, it seems that using an ifstream is the recommended way.
My question is two-fold. First, what, exactly, are the advantages of using ifstream over fread?
Second, how can I use ifstream to solve my problem? Here's what I'm trying to do. I have a binary file containing a structured set of ints, floats, and 64-bit ints. There are 8 data fields all told, and I'd like to read each into its own array.
The structure of the data is as follows, in repeated 288-byte blocks:
Bytes 0-3: int
Bytes 4-7: int
Bytes 8-11: float
Bytes 12-15: float
Bytes 16-19: float
Bytes 20-23: float
Bytes 24-31: int64
Bytes 32-287: 64x float
I am able to read the file into memory as a char * array, with the fstream read command:
char * buffer;
ifstream datafile (filename,ios::in|ios::binary|ios::ate);
datafile.read (buffer, filesize); // Filesize in bytes
So, from what I understand, I now have a pointer to an array called "buffer". If I were to call buffer[0], I should get a 1-byte memory address, right? (Instead, I'm getting a seg fault.)
What I now need to do really ought to be very simple. After executing the above ifstream code, I should have a fairly long buffer populated with a number of 1's and 0's. I just want to be able to read this stuff from memory, 32-bits at a time, casting as integers or floats depending on which 4-byte block I'm currently working on.
For example, if the binary file contained N 288-byte blocks of data, each array I extract should have N members each. (With the exception of the last array, which will have 64N members.)
Since I have the binary data in memory, I basically just want to read from buffer, one 32-bit number at a time, and place the resulting value in the appropriate array.
Lastly - can I access multiple array positions at a time, a la Matlab? (e.g. array(3:5) -> [1,2,1] for array = [3,4,1,2,1])

Firstly, the advantage of using iostreams, and in particular file streams, relates to resource management. Automatic file stream variables will be closed and cleaned up when they go out of scope, rather than having to manually clean them up with fclose. This is important if other code in the same scope can throw exceptions.
Secondly, one possible way to address this type of problem is to simply define the stream insertion and extraction operators in an appropriate manner. In this case, because you have a composite type, you need to help the compiler by telling it not to add padding bytes inside the type. The following code should work on gcc and microsoft compilers.
#pragma pack(1)
struct MyData
{
int i0;
int i1;
float f0;
float f1;
float f2;
float f3;
uint64_t ui0;
float f4[64];
};
#pragma pop(1)
std::istream& operator>>( std::istream& is, MyData& data ) {
is.read( reinterpret_cast<char*>(&data), sizeof(data) );
return is;
}
std::ostream& operator<<( std::ostream& os, const MyData& data ) {
os.write( reinterpret_cast<const char*>(&data), sizeof(data) );
return os;
}

char * buffer;
ifstream datafile (filename,ios::in|ios::binary|ios::ate);
datafile.read (buffer, filesize); // Filesize in bytes
you need to allocate a buffer first before you read into it:
buffer = new filesize[filesize];
datafile.read (buffer, filesize);
as to the advantages of ifstream, well it is a matter of abstraction. You can abstract the contents of your file in a more convenient way. You then do not have to work with buffers but instead can create the structure using classes and then hide the details about how it is stored in the file by overloading the << operator for instance.

You might perhaps look for serialization libraries for C++. Perhaps s11n might be useful.

This question shows how you can convert data from a buffer to a certain type. In general, you should prefer using a std::vector<char> as your buffer. This would then look like this:
#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
int main() {
std::ifstream input("your_file.dat");
std::vector<char> buffer;
std::copy(std::istreambuf_iterator<char>(input),
std::istreambuf_iterator<char>(),
std::back_inserter(buffer));
}
This code will read the entire file into your buffer. The next thing you'd want to do is to write your data into valarrays (for the selection you want). valarray is constant in size, so you have to be able to calculate the required size of your array up-front. This should do it for your format:
std::valarray array1(buffer.size()/288); // each entry takes up 288 bytes
Then you'd use a normal for-loop to insert the elements into your arrays:
for(int i = 0; i < buffer.size()/288; i++) {
array1[i] = *(reinterpret_cast<int *>(buffer[i*288])); // first position
array2[i] = *(reinterpret_cast<int *>(buffer[i*288]+4)); // second position
}
Note that on a 64-bit system this is unlikely to work as you expect, because an integer would take up 8 bytes there. This question explains a bit about C++ and sizes of types.
The selection you describe there can be achieved using valarray.

Struct Padding

I am trying to read chunks of data from a file directly into a struct but the padding is causing too much data to be read and the data to be misaligned.
Do I have to manually read each part into the struct or is there an easier way to do this?
My code:
The structs
typedef unsigned char byte;
struct Header
{
char ID[10];
int version;
};
struct Vertex //cannot rearrange the order of the members
{
byte flags;
float vertex[3];
char bone;
byte referenceCount;
};
How I am reading in the data:
std::ifstream in(path.c_str(), std::ifstream::in | std::ifstream::binary);
Header header;
in.read((char*)&header.ID, sizeof(header.ID));
header.ID[9] = '\0';
in.read((char*)&header.version, sizeof(header.version));
std::cout << header.ID << " " << header.version << "\n";
in.read((char*)&NumVertices, sizeof(NumVertices));
std::cout << NumVertices << "\n";
std::vector<Vertex> Vertices(NumVertices);
for(std::vector<Vertex>::iterator it = Vertices.begin(); it != Vertices.end(); ++it)
{
Vertex& v = (*it);
in.read((char*)&v.flags, sizeof(v.flags));
in.read((char*)&v.vertex, sizeof(v.vertex));
in.read((char*)&v.bone, sizeof(v.bone));
in.read((char*)&v.referenceCount, sizeof(v.referenceCount));
}
I tried doing: in.read((char*)&Vertices[0], sizeof(Vertices[0]) * NumVertices); but this produces incorrect results because of what I believe to be the padding.
Also: at the moment I am using C-style casts, what would be the correct C++ cast to use in this scenario or is a C-style cast okay?

If you're writing the entire structure out in binary, you don't need to read it as if you had stored each variable separately. You would just read in the size of the structure from file into the struct you have defined.
Header header;
in.read((char*)&header, sizeof(Header));
If you're always running on the same architecture or the same machine, you won't need to worry about endian issues as you'll be writing them out the same way your application needs to read them in. If you are creating the file on one architecture and expect it to be portable/usable on another, then you will need to swap bytes accordingly. The way I have done this in the past is to create a swap method of my own. (for example Swap.h)
Swap.h - This is the header you use within you're code
void swap(unsigned char *x, int size);
------------------
SwapIntel.cpp - This is what you would compile and link against when building for Intel
void swap(unsigned char *x, int size)
{
return; // Do nothing assuming this is the format the file was written for Intel (little-endian)
}
------------------
SwapSolaris.cpp - This is what you would compile and link against when building for Solaris
void swap(unsigned char *x, int size)
{
// Byte swapping code here to switch from little-endian to big-endian as the file was written on Intel
// and this file will be the implementation used within the Solaris build of your product
return;
}

No, you don't have to read each field separately. This is called alignment/packing. See http://en.wikipedia.org/wiki/Data_structure_alignment
C-style cast is equivalent to reinterpret_cast. In this case you use it correctly. You may use a C++-specific syntax, but it is a lot more typing.

You can change padding by explicitly asking your compiler to align structs on 1 byte instead of 4 or whatever its default is. Depending on environment, this can be done in many different ways, sometimes file by file ('compilation unit') or even struct by struct (with pragmas and such) or only on the whole project.

header.ID[10] = '\0';
header.ID[9] is the last element of the array.

If you are using a Microsoft compiler then explore the align pragma. There are also the alignment include files:
#include <pshpack1.h>
// your code here
#include <poppack.h>
GNU gcc has a different system that allows you to add alignment/padding to the structure definition.

If you are reading and writing this file yourself, try Google Protobuf library. It will handle all byteorder, alignment, padding and language interop issues.
http://code.google.com/p/protobuf/

boost memorybuffer and char array

I'm currently unpacking one of blizzard's .mpq file for reading.
For accessing the unpacked char buffer, I'm using a boost::interprocess::stream::memorybuffer.
Because .mpq files have a chunked structure always beginning with a version header (usually 12 bytes, see http://wiki.devklog.net/index.php?title=The_MoPaQ_Archive_Format#2.2_Archive_Header), the char* array representation seems to truncate at the first \0, even if the filesize (something about 1.6mb) remains constant and (probably) always allocated.
The result is a streambuffer with an effective length of 4 ('REVM' and byte nr.5 is \0). When attempting to read further, an exception is thrown. Here an example:
// (somewhere in the code)
{
MPQFile curAdt(FilePath);
size_t size = curAdt.getSize(); // roughly 1.6 mb
bufferstream memorybuf((char*)curAdt.getBuffer(), curAdt.getSize());
// bufferstream.m_buf.m_buffer is now 'REVM\0' (Debugger says so),
// but internal length field still at 1.6 mb
}
//////////////////////////////////////////////////////////////////////////////
// wrapper around a file oof the mpq_archive of libmpq
MPQFile::MPQFile(const char* filename) // I apologize my naming inconsistent convention :P
{
for(ArchiveSet::iterator i=gOpenArchives.begin(); i!=gOpenArchives.end();++i)
{
// gOpenArchives points to MPQArchive, wrapper around the mpq_archive, has mpq_archive * mpq_a as member
mpq_archive &mpq_a = (*i)->mpq_a;
// if file exists in that archive, tested via hash table in file, not important here, scroll down if you want
mpq_hash hash = (*i)->GetHashEntry(filename);
uint32 blockindex = hash.blockindex;
if ((blockindex == 0xFFFFFFFF) || (blockindex == 0)) {
continue; //file not found
}
uint32 fileno = blockindex;
// Found!
size = libmpq_file_info(&mpq_a, LIBMPQ_FILE_UNCOMPRESSED_SIZE, fileno);
// HACK: in patch.mpq some files don't want to open and give 1 for filesize
if (size<=1) {
eof = true;
buffer = 0;
return;
}
buffer = new char[size]; // note: size is 1.6 mb at this time
// Now here comes the tricky part... if I step over the libmpq_file_getdata
// function, I'll get my truncated char array, which I absolutely don't want^^
libmpq_file_getdata(&mpq_a, hash, fileno, (unsigned char*)buffer);
return;
}
}
Maybe someone could help me. I'm really new to STL and boost programming and also inexperienced in C++ programming anyways :P Hope to get a convenient answer (plz not suggest to rewrite libmpq and the underlying zlib architecture^^).
The MPQFile class and the underlying uncompress methods are acutally taken from a working project, so the mistake is either somewhere in the use of the buffer with the streambuffer class or something internal with char array arithmetic I haven't a clue of.
By the way, what is the difference between using signed/unsigned chars as data buffers? Has it anything to do with my problem (you might see, that in the code randomly char* unsigned char* is taken as function arguments)
If you need more infos, feel free to ask :)

How are you determining that your char* array is being 'truncated' as you call it? If you're printing it or viewing it in a debugger it will look truncated because it will be treated like a string, which is terminated by \0. The data in 'buffer' however (assuming libmpq_file_getdata() does what it's supposed to do) will contain the whole file or data chunk or whatever.

Sorry, messed up a bit with these terms (not memorybuffer actually, streambuffer is meant as in the code)
Yeah you where right... I had a mistake in my exception handling. Right after that first bit of code comes this:
// check if the file has been open
//if (!mpf.is_open())
pair<char*, size_t> temp = memorybuf.buffer();
if(temp.first)
throw AdtException(ADT_PARSEERR_EFILE);//Can't open the File
notice the missing ! before temp.first . I was surprized by the exception thrown, looked at the streambuffer .. internal buffer at was confused of its length (C# background :P).
Sorry for that, it's working as expected now.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js