Writing aligned data in binary file - c++

I am creating a file with some data objects inside. data object have different sizes and are something like this (very simplified):
struct Data{
uint64_t size;
char blob[MAX_SIZE];
// ... methods here:
}
At some later step, the file will be mmap() in memory,
so I want the beginning of every data objects to starts on memory address aligned by 8 bytes where uint64_t size will be stored (let's ignore endianness).
Code looks more or less to this (currently hardcoded 8 bytes):
size_t calcAlign(size_t const value, size_t const align_size){
return align_size - value % align_size;
}
template<class ITERATOR>
void process(std::ofstream &file_data, ITERATOR begin, ITERATOR end){
for(auto it = begin; it != end; ++it){
const auto &data = *it;
size_t bytesWriten = data.writeToFile(file_data);
size_t const alignToBeAdded = calcAlign(bytesWriten, 8);
if (alignToBeAdded != 8){
uint64_t const placeholder = 0;
file_data.write( (const char *) & placeholder, (std::streamsize) alignToBeAdded);
}
}
}
Is this the best way to achieve alignment inside a file?

you don't need to rely on writeToFile to return the size, you can use ofstream::tellp
const auto beginPos = file_data.tellp();
// write stuff to file
const auto alignSize = (file_data.tellp()-beginPos)%8;
if(alignSize)
file_data.write("\0\0\0\0\0\0\0\0",8-alignSize);
EDIT post OP comment:
Tested on a minimal example and it works.
#include <iostream>
#include <fstream>
int main(){
using namespace std;
ofstream file_data;
file_data.open("tempfile.dat", ios::out | ios::binary);
const auto beginPos = file_data.tellp();
file_data.write("something", 9);
const auto alignSize = (file_data.tellp() - beginPos) % 8;
if (alignSize)
file_data.write("\0\0\0\0\0\0\0\0", 8 - alignSize);
file_data.close();
return 0;
}

You can optimize the process by manipulating the input buffer instead of the file handling. Modify your Data struct so the code that fills the buffer takes care of the alignment.
struct Data{
uint64_t size;
char blob[MAX_SIZE];
// ... other methods here
// Ensure buffer alignment
static_assert(MAX_SIZE % 8 != 0, "blob size must be aligned to 8 bytes to avoid Buffer Overflow.");
uint64_t Fill(const char* data, uint64_t dataLength) {
// Validations...
memcpy(this->blob, data, dataLength);
this->size = dataLength;
const auto paddingLen = calcAlign(dataLength, 8) % 8;
if (padding > 0) {
memset(this->blob + dataLength, 0, paddingLen);
}
// Return the aligned size
return dataLength + paddingLen;
}
};
Now when you pass the data to the "process" function simply use the size returned from Fill, which ensures 8 byte alignment.
This way you still takes care of the alignment manually but you don't have to write twice to the file.
note: This code assumes you use Data also as the input buffer. You should use the same principals if your code uses some another object to hold the buffer before it is written to the file.
If you can use POSIX, see also pwrite

Related

How to write custom binary file handler in c++ with serialisation of custom objects?

I have some structures I want to serialise and deserialise to be able to pass them from program to program (as a save file), and to be manipulated by other programs (make minor changes....).
I've read through:
Document that describes isocpp serialisation explanation
SO questions that show how to read blocks
SO question how to reading and writing binary files
Benchmarking different file handlers speed and reliance
Serialisation "intro"
But I didn't found anywhere how to pass that step from having some class or struct to serialised structure that you can then read, write, manipulate... be it singular (1 structure per file) to in sequence (multiple lists of multiple structure types per file).
How to write custom binary file handler in c++ with serialisation of custom objects ?
Before we start
Most of new users aren't familiar with different data types in C++ and often use plain int, char and etc in their code. To successfully do serialisation, one needs to thoroughly think about their data types. Therefore these are your first steps if you have an int lying down somewhere.
Know your data
What is maximum value that variable should hold?
Can it be negative?
Limit your data
Implement decisions from above
Limit amount of objects your file can hold.
Know your data
If you have an struct or a class with some data as:
struct cat {
int weight = 0; // In kg (pounds)
int length = 0; // In cm (or feet)
std::string voice = "meow.mp3";
cat() {}
cat(int weight, int length): weight(weight), length(length) {}
}
Can your cat really weight around 255 kg (maximum size for the 1 byte integer)? Can it be as long as 255 cm (2.5 m)? Does the voice of your cat change with every object of cat?
Objects that don't change should be declared static, and you should limit your object size to best fit your needs. So in these examples answers to the questions above is no.
So our cat struct now looks like this:
struct cat {
uint8_t weight = 0; // Non negative 8 bit (1 byte) integer (or unsigned char)
uint8_t length = 0; // Same for length
static std::string voice;
cat() {}
cat(uint8_t w, uint8_t l):weight(w), length(l) {}
};
static cat::voice = std::string("meow.mp3");
Files are written byte by byte (often as character sets) and your data can vary so you need to presume or limit maximum value your data can handle.
But not every project (or structure) is the same, so let's talk about differences of your code data and binary structured data. When thinking about serialisation you need to think in this manner "what is bare minimum of data that this structure needs to be unique?".
For our cat object, it can represent anything beside:
tigers: max 390 kg, and 340 cm
lions : max 315 kg, and 365 cm
Anything else is eligable. So you can influence your "meow.mp3" depending on the size and weight, and then most important data that makes a cat unique is its length and weight. Those are data we need to save to our file.
Limit your data
The largest zoo in the world has 5000 animals and 700 species, which means that in average each species in the zoo contains a population around 10 per species. Which means that per our species of cat we can store maximum of 1 byte worth of cats and don't fear that it will go over it.
So it is safe to assume that our zoo project should hold up to 200 elements per species. This leaves us with two different byte sized data, so our serialised data for our struct is maximum two bytes.
Approach to serialisation
Constructing our cat block
For starters, this is the great way to start. It helps you approach custom serialisation with the right foundation. Now all that is left is to define a structured binary format. For that we need a way to recognise if our two bytes are part of the cat or some other structure, which it can be done with same type collection (every two bytes are cats) or by an identifier.
If we have single file (or part of the file) that holds all cats. We need just start offset of the file and the size of the cat bytes, and then read read every two bytes from start offset to get all cats.
Identifier is a way we can identify depending on the starting character if the object is a cat or something else. This is commonly done by the TLV (Type Length Value) format where type would be Cats, length would be two bytes, and value would be those two bytes.
As you can see, the first option contains fewer bytes and therefore it is more compact, but with the second option we have ability to store multiple animals in our file and make a zoo. How you will structure your binary files depends a lot on your project. For now, since the "single file" option is the most logical to work with, I will implement the second one.
The most important this about "identifier" approach is to first make it logical for us, and then make it logical for our machine. I come from a world where reading from left to right is an norm. So it is logical that the first thing I want to read about cats is its type, then length, and then value.
char type = 'C'; // C shorten for Cat, 0x43
uint8_t length = 2; // It holds 2 bytes, 0x02
uint8_t c_length = '?'; // cats length
uint8_t c_weight = '?'; // cats weight
And to represent it as an chunk(block);
+00 4B 43-02-LL-WW ('C\x02LW')
Where this means:
+00: offset form the start, 0 means it is start of the file
4B: size of our data block, 4 bytes.
43-02-LL-WW: actual value of cat
43: hexadecimal representation of character 'C'
02: hexadecimal representation of length of this type (2)
LL: length of this cat of 1 byte value
WW: weight of this cat of 1 byte value
But since it is easier for me to read data from left to right, this means my data should be written as little endian, and most of standalone computers are big endian.
Endianess and importance of them
The main issue here is endianness of our machine and because of our struct/class and endianness we need an base type. The way we wrote it defines an little endian OS, but OS's can be all kind of endianness and you can find out how to find which your machine has here.
For users experienced with bit fields I would strongly suggest that you use them for this. But for unfamiliar users:
#include <iostream> // Just for std::ostream, std::cout, and std::endl
bool is_big() {
union {
uint16_t w;
uint8_t p[2];
} p;
p.w = 0x0001;
return p.p[0] == 0x1;
}
union chunk {
uint32_t space;
uint8_t parts[4];
};
chunk make_chunk(uint32_t VAL) {
union chunk ret;
ret.space = VAL;
return ret;
}
std::ostream& operator<<(std::ostream& os, union chunk &c) {
if(is_big()) {
return os << c.parts[3] << c.parts[2] << c.parts[1] << c.parts[0];
}else {
return os << c.parts[0] << c.parts[1] << c.parts[2] << c.parts[3];
}
}
void read_as_binary(union chunk &t, uint32_t VAL) {
t.space = VAL;
if(is_big()) {
t.space = (t.parts[3] << 24) | (t.parts[2] << 16) | (t.parts[1] << 8) | t.parts[0];
}
}
void write_as_binary(union chunk t, uint32_t &VAL) {
if(is_big()) {
t.space = (t.parts[3] << 24) | (t.parts[2] << 16) | (t.parts[1] << 8) | t.parts[0];
}
VAL = t.space;
}
So now we have our chunk that will print out characters in the order we can recognise it at first glance. Now we need a set of casting functionality from uint32_t to our cat since our chunk size is 4 bytes or uint32_t.
struct cat {
uint8_t weight = 0; // Non negative 8 bit (1 byte) integer (or unsigned char)
uint8_t length = 0; // The same for length
static std::string voice;
cat() {}
cat(uint8_t w, uint8_t l): weight(w), length(l) {}
cat(union chunk cat_chunk) {
if((cat_chunk.space & 0x43020000) == 0x43020000) {
this->length = cat_chunk.space & 0xff; // To circumvent the endianness bit shifts are best solution for that
this->weight = (cat_chunk.space >> 8) & 0xff;
}
// Some error handling
this->weight = 0;
this->length = 0;
}
operator uint32_t() {
return 0x4302000 | (this->weight << 8) | this->length;
}
};
static cat::voice = std::string("meow.mp3");
Zoo file structure
So now we have our cat object ready to be casted back and forth from chunk to cat. Now we need to structure an whole file with Header, footer, data, and checksums*. Let's say we are building an application for keeping track between zoo facility showing how many animals they have. Data of our zoo is what animals they have and how much, The footer of our zoo can be omitted (or it can represent the timestamp of when file was created), and in the header we save instructions on how to read our file, versioning and checking for corruption.
For more information how I structured these files you can find sources here and this shameless plug.
// File structure: all little endian
------------
HEADER:
+00 4B 89-5A-4F-4F ('\221ZOO') Our magic number for the zoo file
+04 4B XX-XX-XX-XX ('????') Whole file checksum
+08 4B 47-0D-1A-0A ('\r\n\032\n') // CRLF <-> LF conversion and END OF FILE 032
+12 4B YY-YY-00-ZZ ('??\0?') Versioning and usage
+16 4B AA-BB-BB-BB ('X???') Start offset + data length
------------
DATA:
Animals: // For each animal type (block identifier)
+20+?? 4B ??-XX-XX-LL ('????') : ? animal type identifier, X start offset from header, Y animals in struct objects
+24+??+4 4B XX-XX-XX-XX ('????') : Checksum for animal type
For checksums, you can use the normal ones (manually add each byte) or among others CRC-32. The choice is yours, and it depends on the size of your files and data. So now we have data for our file. Of course, I must warn you:
Having only one structure or class that requires serialisation means that in general this type of serialisation isn't needed. You can just cast the whole object to the integer of desirable size and then to a binary character sequence, and then read that character sequence of some size into an integer and back to the object. The real value of serialisation is that we can store multiple data and find our way in that binary mess.
But since Zoo can have more data than which animals we have, that can vary in size in chunks. We need to make an interface or abstract class for file handling.
#include <fstream> // File input output ...
#include <vector> // Collection for writing data
#include <sys/types.h> // Gets the types for struct stat
#include <sys/stat.h> // Struct stat
#include <string> // String manipulations
struct handle {
// Members
protected: // Inherited in private
std::string extn = "o";
bool acces = false;
struct stat buffer;
std::string filename = "";
std::vector<chunk> data;
public: // Inherited in public
std::string name = "genesis";
std::string path = "";
// Methods
protected:
void remake_name() {
this->filename = this->path;
if(this->filename != "") {
this->filename.append("//");
}
this->filename.append(this->name);
this->filename.append(".");
this->filename.append(this->extn);
}
void recheck() {
this->acces = (
stat(
this->filename.c_str(),
&this->buffer
) == 0);
}
// To be overwritten later on [override]
virtual bool check_header() { return true;}
virtual bool check_footer() { return true;}
virtual bool load_header() { return true;}
virtual bool load_footer() { return true;}
public:
handle()
: acces(false),
name("genesis"),
extn("o"),
filename(""),
path(""),
data(0) {}
void operator()(const char *name, const char *ext, const char *path) {
this->path = std::string(path);
this->name = std::string(name);
this->extn = std::string(ext);
this->remake_name();
this->recheck();
}
void set_prefix(const char *prefix) {
std::string prn(prefix);
prn.append(this->name);
this->name = prn;
this->remake_name();
}
void set_suffix(const char *suffix) {
this->name.append(suffix);
this->remake_name();
}
int write() {
this->remake_name();
this->recheck();
if(!this->load_header()) {return 0;}
if(!this->load_footer()) {return 0;}
std::fstream file(this->filename.c_str(), std::ios::out | std::ios::binary);
uint32_t temp = 0;
for(int i = 0; i < this->data.size(); i++) {
write_as_binary(this->data[i], temp);
file.write((char *)(&temp), sizeof(temp));
}
if(!this->check_header()) { file.close();return 0; }
if(!this->check_footer()) { file.close();return 0; }
file.close();
return 1;
}
int read() {
this->remake_name();
this->recheck();
if(!this->acces) {return 0;}
std::fstream file(this->filename.c_str(), std::ios::in | std::ios::binary);
uint32_t temp = 0;
chunk ctemp;
size_t fsize = this->buffer.st_size/4;
for(int i = 0; i < fsize; i++) {
file.read((char*)(&temp), sizeof(temp));
read_as_binary(ctemp, temp);
this->data.push_back(ctemp);
}
if(!this->check_header()) {
file.close();
this->data.clear();
return 0;
}
if(!this->check_footer()) {
file.close();
this->data.clear();
return 0;
}
return 1;
}
// Friends
friend std::ostream& operator<<(std::ostream& os, const handle& hand);
friend handle& operator<<(handle& hand, chunk& c);
friend handle& operator>>(handle& hand, chunk& c);
friend struct zoo_file;
};
std::ostream& operator<<(std::ostream& os, const handle& hand) {
for(int i = 0; i < hand.data.size(); i++) {
os << "\t" << hand.data[i] << "\n";
}
return os;
}
handle& operator<<(handle& hand, chunk& c) {
hand.data.push_back(c);
return hand;
}
handle& operator>>(handle& hand, chunk& c) {
c = hand.data[ hand.data.size() - 1 ];
hand.data.pop_back();
return hand;
}
From which we can initialise our zoo object and later on which ever we need. File handle is an just a file template containing a data block (handle.data) and headers and/are implemented footers later on.
Since headers are describing whole files, checking and loading can have added functionality that your specific case needs. If you have two different objects, you need to add to file, instead of changing headers/footers, one type of data insert at the start of the data, and other type push_back at the end of the data via overloaded operator<</operator>>.
For multiple objects that have no relationship between each other, you can add more private members in inheritance, for storing current position of individual segments while keeping things neat and organised for the file writing and reading.
struct zoo_file: public handle {
zoo_file() {this->extn = "zoo";}
void operator()(const char *name, const char *path) {
this->path = std::string(path);
this->name = std::string(name);
this->remake_name();
this->recheck();
}
protected:
virtual bool check_header() {
chunk temp = this->data[0];
uint32_t checksums = 0;
// Magic number
if(chunk.space != 0x895A4F4F) {
this->data.clear();
return false;
}else {
this->data.erase(this->data.begin());
}
// Checksum
temp = this->data[0];
checksums = temp.space;
this->data.erase(this->data.begin());
// Valid load number
temp = this->data[0];
if(chunk.space != 0x470D1A0A) {
this->data.clear();
return false;
}else {
this->data.erase(this->data.begin());
}
// Version + flag
temp = this->data[0];
if((chunk.space & 0x01000000) != 0x01000000) { // If not version 1.0
this->data.clear();
return false;
}else {
this->data.erase(this->data.begin());
}
temp = this->data[0];
int opt_size = (temp.space >> 24);
if(opt_size != 20) {
this->data.clear();
return false;
}
opt_size = temp.space & 0xffffff;
return (opt_size == this->data.size());
}
virtual bool load_header() {
chunk magic, checksum, vload, ver_flag, off_data;
magic = 0x895A4F4F;
checksum = 0;
vload = 0x470D1A0A;
ver_flag = 0x01000001; // 1.0, usage 1 (normal)
off_data = (20 << 24) | ((this->data.size()-1)-4);
for(int i = 0; i < this->data.size(); i++) {
checksum.space += this->data[i].parts[0];
checksum.space += this->data[i].parts[1];
checksum.space += this->data[i].parts[2];
checksum.space += this->data[i].parts[3];
}
this->data.insert(this->data.begin(), off_data);
this->data.insert(this->data.begin(), ver_flag);
this->data.insert(this->data.begin(), vload);
this->data.insert(this->data.begin(), checksum);
this->data.insert(this->data.begin(), magic);
return true;
}
friend zoo_file& operator<<(zoo_file& zf, cat sc);
friend zoo_file& operator>>(zoo_file& zf, cat sc);
friend zoo_file& operator<<(zoo_file& zf, elephant se);
friend zoo_file& operator>>(zoo_file& zf, elephant se);
};
zoo_file& operator<<(zoo_file& zf, cat &sc) {
union chunk temp;
temp = (uint32_t)sc;
zf.data.push_back(temp);
return zf;
}
zoo_file& operator>>(zoo_file& zf, cat &sc) {
size_t pos = zf.data.size() - 1;
union chunk temp;
while (1) {
if((zf[pos].space & 0x4302000) != 0x4302000) {
pos --;
}else {
temp = zf[pos];
break;
}
if(pos == 0) {break;}
}
zf.data.erase(zf.data.begin() + pos);
sc = (uint32_t)temp;
return zf;
}
// same for elephants, koyotes, giraffes .... whatever you need
Please don't just copy code. The handle object is meant as a template, so how you structure your data block is up to you. If you have a different structure and just copy code of course it won't work.
And now we can have zoo with only cats. And building a file is easy as:
// All necessary includes
// Writing the zoo file
zoo_file my_zoo;
// Push back to the std::vector some cats in
my_zoo("superb_zoo");
my_zoo.write();
// Reading the zoo file
zoo_file my_zoo;
my_zoo("superb_zoo");
my_zoo.read();

C++ Struct to Byte* throwing error

I have attached my code below. I do not see what I am doing wrong. I have a struct that I am trying to serialize into a byte array. I have wrote some some simple code to test it. It all appears to work during runtime when I print out the values of objects, but once I hit return 0 it throws the error:
Run-Time Check Failure #2 - Stack around the variable 'command' was corrupted.
I do not see the issue. I appreciate all help.
namespace CommIO
{
enum Direction {READ, WRITE};
struct CommCommand
{
int command;
Direction dir;
int rwSize;
BYTE* wData;
CommCommand(BYTE* bytes)
{
int offset = 0;
int intsize = sizeof(int);
command = 0;
dir = READ;
rwSize = 0;
memcpy(&command, bytes + offset, intsize);
offset += intsize;
memcpy(&dir, bytes + offset, intsize);
offset += intsize;
memcpy(&rwSize, bytes + offset, intsize);
offset += intsize;
wData = new BYTE[rwSize];
if (dir == WRITE)
{
memcpy(&wData, bytes + offset, rwSize);
}
}
CommCommand() {}
}
int main()
{
CommIO::CommCommand command;
command.command = 0x6AEA6BEB;
command.dir = CommIO::WRITE;
command.rwSize = 128;
command.wData = new BYTE[command.rwSize];
for (int i = 0; i < command.rwSize; i++)
{
command.wData[i] = i;
}
command.print();
CommIO::CommCommand command2(reinterpret_cast<BYTE*>(&command));
command2.print();
cin.get();
return 0;
}
The following points mentioned in comments are most likely the causes of your problem.
You seem to be assuming that the size of Direction is the same as the size of an int. That may indeed be the case, but C++ does not guarantee it.
You also seem to be assuming that the members of CommIO::CommCommand will be laid out in memory without any padding between, which again may happen to be the case, but is not guaranteed.
There are couple of ways to fix the that.
Make sure that you fill up the BYTE array in the calling function with matching objects, or
Simply cast the BYTE* to CommCommand* and access the members directly.
For (1), you can use:
int command = 0x6AEA6BEB;
int dir = CommIO::WRITE;
int rwSize = 128;
totatlSize = rwSize + 3*sizeof(int);
BYTE* data = new BYTE[totalSize];
int offset = 0;
memcpy(data + offset, &comand, sizeof(int));
offset += sizeof(int);
memcpy(data + offset, &dir, sizeof(int));
offset += sizeof(int);
memcpy(data + offset, &rwSize, sizeof(int));
offset += sizeof(int);
for (int i = 0; i < rwSize; i++)
{
data[i + offset] = i;
}
CommIO::CommCommand command2(data);
For (2), you can use:
CommCommand(BYTE* bytes)
{
CommCommand* in = reinterpret_cast<CommCommand*>(bytes);
command = in->command;
dir = in->dir;
rwSize = in->size;
wData = new BYTE[rwSize];
if (dir == WRITE)
{
memcpy(wData, in->wData, rwSize);
}
}
The other error is that you are using
memcpy(&wData, bytes + offset, rwSize);
That is incorrect since you are treating the address of the variable as though it can hold the data. It cannot.
You need to use:
memcpy(wData, bytes + offset, rwSize);
The memory for your struct is laid out without padding, this can be rectified by adding the macro #pragma pack(1) at the start of the struct and #pragma pop() at the end of the struct - check its syntax though.
For your struct to byte conversion, I would use something simple as:
template<typename T, typename IteratorForBytes>
void ConvertToBytes(const T& t, IteratorForBytes bytes, std::size_t pos = 0)
{
std::advance(bytes, pos);
const std::size_t length = sizeof(t);
const uint8_t* temp = reinterpret_cast<const uint8_t*>(&t);
for (std::size_t i = 0; i < length; ++i)
{
(*bytes) = (*temp);
++temp;
++bytes;
}
}
Where T is the is the struct in your case your Command struct and bytes would be the array.
CommIO::CommCommand command;
command.wData = new BYTE[command.rwSize];
ConvertToBytes(command, command.wData);
The resulting array would contain the expected bytes You could specify the offset as well as an extra parameter if you want to start filling your byte array from a particular location
The main problem is here:
memcpy(&wData, bytes + offset, rwSize);
Member wData is a BYTE *, and you seem to mean to copy bytes into the space to which it points. Instead, you are copying data into the memory where the pointer value itself is stored. Therefore, if you copy more bytes than the size of the pointer then you will overrun its bounds and produce undefined behavior. In any case, you are trashing the original pointer value. You probably want this, instead:
memcpy(wData, bytes + offset, rwSize);
Additionally, although the rest of the deserialization code may be right for your actual serialization format, it is not safe to assume that it is right for the byte sequence you present to it in your test program via
CommIO::CommCommand command2(reinterpret_cast<BYTE*>(&command));
As detailed in comments, you are making assumptions about the layout in memory of a CommIO::CommCommand that C++ does not guarantee will hold.
At
memcpy(&wData, bytes + offset, rwSize);
you copy from the location of the wData pointer and to the location of the wData pointer of the new CommCommand. But you want to copy from and to the location that the pointer points to. You need to dereference. You corrupt the heap, because you have only sizeof(BYTE*) space (plus some extra, because heap blocks cannot be arbitrarily small), but you copy rwSize bytes, which is 128 bytes. What you probably meant to write is:
memcpy(wData, *(BYTE*)(bytes + offset), rwSize);
which would take use the pointer stored at bytes + offset, rather than the value of bytes + offset itself.
You also assume that your struct is tightly packed. However, C++ does not guarantee that. Is there a reason why you do not override the default copy constructor rather than write this function?

Is there a better way to handle incomplete data in a buffer and reading?

I am processing a binary file that is built up of events. Each event can have a variable length. Since my read buffer is a fixed size I handle things as follows:
const int bufferSize = 0x500000;
const int readSize = 0x400000;
const int eventLengthMask = 0x7FFE0000;
const int eventLengthShift = 17;
const int headerLengthMask = 0x1F000;
const int headerLengthShift = 12;
const int slotMask = 0xF0;
const int slotShift = 4;
const int channelMask = 0xF;
...
//allocate the buffer we allocate 5 MB even though we read in 4MB chunks
//to deal with unprocessed data from the end of a read
char* allocBuff = new char[bufferSize]; //inFile reads data into here
unsigned int* buff = reinterpret_cast<unsigned int*>(allocBuff); //data is interpretted from here
inFile.open(fileName.c_str(),ios_base::in | ios_base::binary);
int startPos = 0;
while(!inFile.eof())
{
int index = 0;
inFile.read(&(allocBuff[startPos]), readSize);
int size = ((readSize + startPos)>>2);
//loop to process the buffer
while (index<size)
{
unsigned int data = buff[index];
int eventLength = ((data&eventLengthMask)>>eventLengthShift);
int headerLength = ((data&headerLengthMask)>>headerLengthShift);
int slot = ((data&slotMask)>>slotShift);
int channel = data&channelMask;
//now check if the full event is in the buffer
if( (index+eventLength) > size )
{//the full event is not in the buffer
break;
}
++index;
//further processing of the event
}
//move the data at the end of the buffer to the beginning and set start position
//for the next read
for(int i = index; i<size; ++i)
{
buff[i-index] = buff[i];
}
startPos = ((size-index)<<2);
}
My question is this: Is there a better to handle having unprocessed data at the end of the buffer?
You could improve it by using a circular buffer rather than a simple array. That, or a circular iterator over the array. Then you don't need to do all that copying — the "start" of the array moves.
Other than that, no, not really.
When I encountered this problem in the past, I simply copied the
unprocessed data down, and then read from the end of it. This
is a valid solution (and by far the simplest) if the individual
elements are fairly small and the buffer is large. (On a modern
machine, "fairly small" can easily be anything up to a couple of
hundred KB.) Of course, you'll have to keep track of how much
you've copied down, to adjust the pointer and the size of the
next read.
Beyond that:
You'd be better off using std::vector<char> for the buffer.
You can't convert four bytes read from a disk into an
unsigned int just by casting its address; you have to insert
each of the bytes into the unsigned int where it belongs.
And finally: you don't check that the read has succeeded
before processing the data. Using unbuffered input with an
istream is a bit tricky: your loop should probably be
something like
while ( inFile.read( addr, len ) || inFile.gcount() != 0 )....

C++ defensive programming: reading from a buffer with type safety

Let's say I have a class that I don't own: DataBuffer. It provides various get member functions:
get(uint8_t *value);
get(uint16_t *value);
...
When reading from a structure contained in this buffer, I know the order and size of fields, and I want to reduce the chance of future code changes causing an error:
struct Record
{
uint16_t Header;
uint16_t Content;
}
void ReadIntoRecord(Record* r)
{
DataBuffer buf( initialized from the network with bytes )
buf.get(&r->Header); // Good!
buf.get(&r->Content);
}
Then someone checks in a change to do something with the header before writing it:
uint8_t customHeader;
buf.get(&customHeader); // Wrong, stopped reading after only 1 byte
r->Header = customHeader + 1;
buf.get(&r->Content); // now we're reading from the wrong part of the buffer.
Is the following an acceptable way to harden the code against changes? Remember, I can't change the function names to getByte, getUShort, etc. I could inherit from DataBuffer, but that seems like overkill.
buf.get(static_cast<uint16_t*>(&r->Header)); // compiler will catch incorrect variable type
buf.get(static_cast<uint16_t*>(&r->Content))
Updated with not-eye-safe legacy code example:
float dummy_float;
uint32_t dummy32;
uint16_t dummy16;
uint8_t dummy8;
uint16_t headTypeTemp;
buf.get(static_cast<uint16_t*>(&headTypeTemp));
m_headType = HeadType(headTypeTemp);
buf.get(static_cast<uint8_t*>(&hid));
buf.get(m_Name);
buf.get(m_SerialNumber);
float start;
buf.get(static_cast<float*>(&start));
float stop;
buf.get(static_cast<float*>(&stop));
buf.get(static_cast<float*>(&dummy_float));
setStuffA(dummy_float);
buf.get(static_cast<uint16_t*>(&dummy16));
setStuffB(float(dummy16)/1000);
buf.get(static_cast<uint8_t*>(&dummy8)); //reserved
buf.get(static_cast<uint32_t*>(&dummy32));
Entries().setStart( dummy32 );
buf.get(static_cast<uint32_t*>(&dummy32));
Entries().setStop( dummy32 );
buf.get(static_cast<float*>(&dummy_float));
Entries().setMoreStuff( dummy_float );
uint32_t datalength;
buf.get(static_cast<uint32_t*>(&datalength));
Entries().data().setLength(datalength);
RetVal ret = ReturnCode::SUCCESS;
Entry* data_ptr = Entries().data().data();
for (unsigned int i = 0; i < datalength && ret == ReturnCode::SUCCESS; i++)
{
ret = buf.get(static_cast<float*>(&dummy_float));
data_ptr[i].FieldA = dummy_float;
}
for (unsigned int i = 0; i < datalength && ret == ReturnCode::SUCCESS; i++)
{
ret = buf.get(static_cast<float*>(&dummy_float));
data_ptr[i].FieldB = dummy_float;
}
// Read in the normalization vector
Util::SimpleVector<float> norm;
buf.get(static_cast<uint32_t*>(&datalength));
norm.setLength(datalength);
for (unsigned int i=0; i<datalength; i++)
{
norm[i] = buf.getFloat();
}
setNormalization(norm);
return ReturnCode::SUCCESS;
}
Don't use overloading. Why not have get_word and get_dword calls? The interface isn't going to be any uglier but at least the mistake is a lot harder to make.
wouldn't it be better to read the whole struct from the network? Letting the user do all the socket operations seems like a bad idea to me (not encapsulated). Encapsulate the stuff you want to send on the network to operate on file descriptors instead of letting the user put raw buffer data to the file descriptors.
I can imagine something like
void readHeader(int filedes, struct Record * Header);
so you can do something like this
struct Record
{
uint16_t Header;
uint16_t Content;
uint16_t getHeader() const { return Header; }
uint16_t getContent() const { return Content; }
};
/* socket stuff to get filedes */
struct Record x;
readHeader(fd, &x);
x.getContent();
You can't read from buffer with type safety unless the buffer contains information about the content. One simple method is to add length to each structure and check that at least the data being read is still the sane length. You could also use XML or ASN.1 or something similar where type information is provided. Of course I'm assuming that you also write to that buffer.

Allocate chunk of memory for array of structs

I need an array of this struct allocated in one solid chunk of memory. The length of "char *extension" and "char *type" are not known at compile time.
struct MIMETYPE
{
char *extension;
char *type;
};
If I used the "new" operator to initialize each element by itself, the memory may be scattered. This is how I tried to allocate a single contiguous block of memory for it:
//numTypes = total elements of array
//maxExtension and maxType are the needed lengths for the (char*) in the struct
//std::string ext, type;
unsigned int size = (maxExtension+1 + maxType+1) * numTypes;
mimeTypes = (MIMETYPE*)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, size);
But, when I try to load the data in like this, the data is all out of order and scattered when I try to access it later.
for(unsigned int i = 0; i < numTypes; i++)
{
//get data from file
getline(fin, line);
stringstream parser.str(line);
parser >> ext >> type;
//point the pointers at a spot in the memory that I allocated
mimeTypes[i].extension = (char*)(&mimeTypes[i]);
mimeTypes[i].type = (char*)((&mimeTypes[i]) + maxExtension);
//copy the data into the elements
strcpy(mimeTypes[i].extension, ext.c_str());
strcpy(mimeTypes[i].type, type.c_str());
}
can anyone help me out?
EDIT:
unsigned int size = (maxExtension+1 + maxType+1);
mimeTypes = (MIMETYPE*)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, size * numTypes);
for(unsigned int i = 0; i < numTypes; i++)
strcpy((char*)(mimeTypes + (i*size)), ext.c_str());
strcpy((char*)(mimeTypes + (i*size) + (maxExtension+1)), type.c_str());
You mix 2 allocation:
1) manage array of MIMETYPE and
2) manage array of characters
May be (I don't really understand your objectives):
struct MIMETYPE
{
char extension[const_ofmaxExtension];
char type[maxType];
};
would be better to allocate linear items in form:
new MIMETYPE[numTypes];
I'll put aside the point that this is premature optimization (and that you ought to just use std::string, std::vector, etc), since others have already stated that.
The fundamental problem I'm seeing is that you're using the same memory for both the MIMETYPE structs and the strings that they'll point to. No matter how you allocate it, a pointer itself and the data it points to cannot occupy the exact same place in memory.
Lets say you needed an array of 3 types and had MIMETYPE* mimeTypes pointing to the memory you allocated for them.
That means you're treating that memory as if it contains:
8 bytes: mime type 0
8 bytes: mime type 1
8 bytes: mime type 2
Now, consider what you're doing in this next line of code:
mimeTypes[i].extension = (char*)(&mimeTypes[i]);
extension is being set to point to the same location in memory as the MIMETYPE struct itself. That is not going to work. When subsequent code writes to the location that extension points to, it overwrites the MIMETYPE structs.
Similarly, this code:
strcpy((char*)(mimeTypes + (i*size)), ext.c_str());
is writing the string data in the same memory that you otherwise want to MIMETYPE structs to occupy.
If you really want store all the necessary memory in one contiguous space, then doing so is a bit more complicated. You would need to allocate a block of memory to contain the MIMETYPE array at the start of it, and then the string data afterwards.
As an example, lets say you need 3 types. Lets also say the max length for an extension string (maxExtension) is 3 and the max length for a type string (maxType) is 10. In this case, your block of memory needs to be laid out as:
8 bytes: mime type 0
8 bytes: mime type 1
8 bytes: mime type 2
4 bytes: extension string 0
11 bytes: type string 0
4 bytes: extension string 1
11 bytes: type string 1
4 bytes: extension string 2
11 bytes: type string 2
So to allocate, setup, and fill it all correctly you would want to do something like:
unsigned int mimeTypeStringsSize = (maxExtension+1 + maxType+1);
unsigned int totalSize = (sizeof(MIMETYPE) + mimeTypeStringsSize) * numTypes;
char* data = (char*)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, totalSize);
MIMETYPE* mimeTypes = (MIMETYPE*)data;
char* stringData = data + (sizeof(MIMETYPE) * numTypes);
for(unsigned int i = 0; i < numTypes; i++)
{
//get data from file
getline(fin, line);
stringstream parser.str(line);
parser >> ext >> type;
// set pointers to proper locations
mimeTypes[i].extension = stringData + (mimeTypeStringsSize * i);
mimeTypes[i].type = stringData + (mimeTypeStringsSize * i) + maxExtension+1;
//copy the data into the elements
strcpy(mimeTypes[i].extension, ext.c_str());
strcpy(mimeTypes[i].type, type.c_str());
}
(Note: I've based my byte layout explanations on typical behavior of 32-bit code. 64-bit code would have more space used for the pointers, but the principle is the same. Furthermore, the actual code I've written here should work regardless of 32/64-bit differences.)
What you need to do is get a garbage collector and manage the heap. A simple collector using RAII for object destruction is not that difficult to write. That way, you can simply allocate off the collector and know that it's going to be contiguous. However, you should really, REALLY profile before determining that this is a serious problem for you. When that happens, you can typedef many std types like string and stringstream to use your custom allocator, meaning that you can go back to just std::string instead of the C-style string horrors you have there.
You really have to know the length of extension and type in order to allocate MIMETYPEs contiguously (if "contiguously" means that extension and type are actually allocated within the object). Since you say that the length of extension and type are not known at compile time, you cannot do this in an array or a vector (the overall length of a vector can be set and changed at runtime, but the size of the individual elements must be known at compile time, and you can't know that size without knowing the length of extension and type).
I would personally recommend using a vector of MIMETYPEs, and making the extension and type fields both strings. You're requirements sound suspiciously like premature optimization guided by a gut feeling that dereferencing pointers is slow, especially if the pointers cause cache misses. I wouldn't worry about that until you have actual data that reading these fields is an actual bottleneck.
However, I can think of a possible "solution": you can allocate the extension and type strings inside the MIMETYPE object when they are shorter than a particular threshold and allocate them dynamically otherwise:
#include <algorithm>
#include <cstring>
#include <new>
template<size_t Threshold> class Kinda_contig_string {
char contiguous_buffer[Threshold];
char* value;
public:
Kinda_contig_string() : value(NULL) { }
Kinda_contig_string(const char* s)
{
size_t length = std::strlen(s);
if (s < Threshold) {
value = contiguous_buffer;
}
else {
value = new char[length];
}
std::strcpy(value, s);
}
void set(const char* s)
{
size_t length = std::strlen(s);
if (length < Threshold && value == contiguous_buffer) {
// simple case, both old and new string fit in contiguous_buffer
// and value points to contiguous_buffer
std::strcpy(contiguous_buffer, s);
return;
}
if (length >= Threshold && value == contiguous_buffer) {
// old string fit in contiguous_buffer, new string does not
value = new char[length];
std::strcpy(value, s);
return;
}
if (length < Threshold && value != contiguous_buffer) {
// old string did not fit in contiguous_buffer, but new string does
std::strcpy(contiguous_buffer, s);
delete[] value;
value = contiguous_buffer;
return;
}
// old and new strings both too long to fit in extension_buffer
// provide strong exception guarantee
char* temp_buffer = new char[length];
std::strcpy(temp_buffer, s);
std::swap(temp_buffer, value);
delete[] temp_buffer;
return;
}
const char* get() const
{
return value;
}
}
class MIMETYPE {
Kinda_contig_string<16> extension;
Kinda_contig_string<64> type;
public:
const char* get_extension() const
{
return extension.get();
}
const char* get_type() const
{
return type.get();
}
void set_extension(const char* e)
{
extension.set(e);
}
// t must be NULL terminated
void set_type(const char* t)
{
type.set(t);
}
MIMETYPE() : extension(), type() { }
MIMETYPE(const char* e, const char* t) : extension(e), type(t) { }
};
I really can't endorse this without feeling guilty.
Add one byte in between strings... extension and type are not \0-terminated the way do it.
here you allocate allowing for an extra \0 - OK
unsigned int size = (maxExtension+1 + maxType+1) * numTypes;
mimeTypes = (MIMETYPE*)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, size);
here you don't leave any room for extension's ending \0 (if string len == maxExtension)
//point the pointers at a spot in the memory that I allocated
mimeTypes[i].extension = (char*)(&mimeTypes[i]);
mimeTypes[i].type = (char*)((&mimeTypes[i]) + maxExtension);
instead i think it should be
mimeTypes[i].type = (char*)((&mimeTypes[i]) + maxExtension + 1);