Cross Platform Custom File Header in C/C++ - c++

I'm currently working on a project that encrypts files and adds them to the application's library. I need to version the file format so that I'm planning to prepend a file header to the encrypted file. The project is in Qt and currently for Windows. Later will make app for android and mac as well.
FOr this I made these structures, the version 1 file.
struct Header_Meta
{
char signature [4];
char version [4];
};
struct Header_v1
{
char id [12];
char flag [8];
char name [128];
long size;
};
union File_v1
{
Header_Meta meta;
Header_v1 header;
byte null [512 - sizeof (Header_Meta) - sizeof (Header_v1)];
byte data [MAX_HEADERv1];
};
The file is binary file.
Now in the getDetails() function, I'll read the MAX_HEADERv1 bytes to file_v1.data and will get the details in the member variables.
My questions are
Is there a better approach?
Is there any problem writing the long size of Header_v1 to file, in cases of platform differences?
The logic should work the same way in all devices with file from another platform. Will this hold?

There is a slight possibility that you will end up having a lot of #ifdef BIG/LITTLE_ENDIAN's in the code depending on the platform you are trying to deploy your product. I would use for the long size to be like: unsigned char size[8] (this would yield a 64 (=8*8) bit value) and then you could use a formula in your code, like:
uint64_t real_size = size[0] + size[1] << 8 + size[2] << 16 + ....
and when calculating individual size bytes you could do it like:
size[0] = real_size && 0xFF;
size[1] = (real_size && 0xFF00) >> 8;
size[2] = (real_size && 0xFF0000) >> 16;
and so on...
and from this point on you just need to worry about correctly writing out the bytes of size to their corresponding position.
Regarding the version string you want to add to the header (char version[4]) it all depends on what do you want to store there. If you want to put textual information (such as: "v1.0") you will limit the possible version you can have, so I would recommend again putting in a binary version, such as:
version[0] = VERSION // customers usually pay for an increase in this
version[1] = RELEASE // new functionality, it's up to you if customer pays or not :)
version[2] = MAINTENANCE // planned maintenance, usually customers don't pay for this
version[3] = PATCH // emergency patch, hopefully you never have to use this
This will allow for version numbers in the form of VERSION.RELEASE.MAINTENACE.PATCH and you can go up to 255.255.255.255
Also, please pay attention to #Ben's comment, the union just feels wrong. Usually these fields should come one after the other, but with the union they all will overlap each other, starting at the same location.

Related

QJsonDocument::fromRawData(const char *data, int size) data has to be aligned to a 4 byte boundary

void test()
{
QFile f("..\\data\\NAVHistory2.txt");
if (!f.open(QFile::ReadOnly))
{
return;
}
QByteArray data = f.readAll();
int iLeft = data.indexOf('[');
int iRight = data.lastIndexOf(']');
QJsonDocument::fromRawData(data.data() + iLeft, iRight - iLeft + 1);// got error
}
I want to cut a part of QByteArray and send it to a QJsonDocument. The simplest way is to use QByteArray::mid and create a new copy of QByteArray. And QJsonDocument::fromJson(QByteArray) works well.
However, it only needs to cut a small part of data away. So to create a new QBytedata would lost performance. There is a better way QJsonDocument::fromRawData(char*). But I got an error:
QJsonDocument::fromRawData: data has to have 4 byte alignment
I looked up the Qt document for this. It says data has to be aligned to a 4 byte boundary.
Qt source
My application is a x64 project, so the char* is a 8-byte boundary. How do I get through it?
I see two options:
Just take the copy. Quick and easy.
If you don't need anything else from data, just use data.remove(0, iLeft) to make your JSON snippet start at the beginning of the QByteArray (which will be aligned to at least 4 bytes).
According to the Qt document:
It assumes data contains a binary encoded JSON document.
The data should be a binary encoded. My document is just a normal text. So it doesn't work. I didn't noticed that before.
It seems I have to use QJsonDocument::fromJson.

How to optimize c++ binary file reading?

I have a complex interpreter reading in commands from (sometimes) multiples files (the exact details are out of scope) but it requires iterating over these multiple files (some could be GB is size, preventing nice buffering) multiple times.
I am looking to increase the speed of reading in each command from a file.
I have used the RDTSC (program counter) register to micro benchmark the code enough to know about >80% of the time is spent reading in from the files.
Here is the thing: the program that generates the input file is literally faster than to read in the file in my small interpreter. i.e. instead of outputting the file i could (in theory) just link the generator of the data to the interpreter and skip the file but that shouldn't be faster, right?
What am I doing wrong? Or is writing suppose to be 2x to 3x (at least) faster than reading from a file?
I have considered mmap but some of the results on http://lemire.me/blog/archives/2012/06/26/which-is-fastest-read-fread-ifstream-or-mmap/ appear to indicate it is no faster than ifstream. or would mmap help in this case?
details:
I have (so far) tried adding a buffer, tweaking parameters, removing the ifstream buffer (that slowed it down by 6x in my test case), i am currently at a loss for ideas after searching around.
The important section of the code is below. It does the following:
if data is left in buffer, copy form buffer to memblock (where it is then used)
if data is not left in the buffer, check to see how much data is left in the file, if more than the size of the buffer, copy a buffer sized chunk
if less than the file
//if data in buffer
if(leftInBuffer[activefile] > 0)
{
//cout <<bufferloc[activefile] <<"\n";
memcpy(memblock,(buffer[activefile])+bufferloc[activefile],16);
bufferloc[activefile]+=16;
leftInBuffer[activefile]-=16;
}
else //buffers blank
{
//read in block
long blockleft = (cfilemax -cfileplace) / 16 ;
int read=0;
/* slow block starts here */
if(blockleft >= MAXBUFELEMENTS)
{
currentFile->read((char *)(&(buffer[activefile][0])),16*MAXBUFELEMENTS);
leftInBuffer[activefile] = 16*MAXBUFELEMENTS;
bufferloc[activefile]=0;
read =16*MAXBUFELEMENTS;
}
else //read in part of the block
{
currentFile->read((char *)(&(buffer[activefile][0])),16*(blockleft));
leftInBuffer[activefile] = 16*blockleft;
bufferloc[activefile]=0;
read =16*blockleft;
}
/* slow block ends here */
memcpy(memblock,(buffer[activefile])+bufferloc[activefile],16);
bufferloc[activefile]+=16;
leftInBuffer[activefile]-=16;
}
edit: this is on a mac, osx 10.9.5, with an i7 with a SSD
Solution:
as was suggested below, mmap was able to increase the speed by about 10x.
(for anyone else who searches for this)
specifically open with:
uint8_t * openMMap(string name, long & size)
{
int m_fd;
struct stat statbuf;
uint8_t * m_ptr_begin;
if ((m_fd = open(name.c_str(), O_RDONLY)) < 0)
{
perror("can't open file for reading");
}
if (fstat(m_fd, &statbuf) < 0)
{
perror("fstat in openMMap failed");
}
if ((m_ptr_begin = (uint8_t *)mmap(0, statbuf.st_size, PROT_READ, MAP_SHARED, m_fd, 0)) == MAP_FAILED)
{
perror("mmap in openMMap failed");
}
uint8_t * m_ptr = m_ptr_begin;
size = statbuf.st_size;
return m_ptr;
}
read by:
uint8_t * mmfile = openMMap("my_file", length);
uint32_t * memblockmm;
memblockmm = (uint32_t *)mmfile; //cast file to uint32 array
uint32_t data = memblockmm[0]; //take int
mmfile +=4; //increment by 4 as I read a 32 bit entry and each entry in mmfile is 8 bits.
This should be a comment, but I don't have 50 reputation to make a comment.
What is the value of MAXBUFELEMENTS? From my experience, many smaller reads is far slower than one read of larger size. I suggest to read the entire file in if possible, some files could be GBs, but even reading in 100MB at once would perform better than reading 1 MB 100 times.
If that's still not good enough, next thing you can try is to compress(zlib) input files(may have to break them into chunks due to size), and decompress them in memory. This method is usually faster than reading in uncompressed files.
As #Tony Jiang said, try experimenting with the buffer size to see if that helps.
Try mmap to see if that helps.
I assume that currentFile is a std::ifstream? There's going to be some overhead for using iostreams (for example, an istream will do its own buffering, adding an extra layer to what you're doing); although I wouldn't expect the overhead to be huge, you can test by using open(2) and read(2) directly.
You should be able to run your code through dtruss -e to verify how long the read system calls take. If those take the bulk of your time, then you're hitting OS and hardware limits, so you can address that by piping, mmap'ing, or adjusting your buffer size. If those take less time than you expect, then look for problems in your application logic (unnecessary work on each iteration, etc.).

How to read big endian integers from file in C++? [duplicate]

This question already has answers here:
How to read little endian integers from file in C++?
(5 answers)
Closed 7 years ago.
Say I have a binary file; it contains positive binary numbers, but written in big endian as 32-bit integers
How do I read this file? I have this right now.
int main() {
FILE * fp;
char buffer[4];
int num = 0;
fp=fopen("file.bin","rb");
while ( fread(&buffer, 1, 4,fp) != 0) {
// I think buffer should be 32 bit integer I read,
// how can I let num equal to 32 bit big endian integer?
}
return 0;
}
Declare your buffer as:
unsigned char buffer[4];
and you may use this to convert endianess:
int num = (int)buffer[0] | (int)buffer[1]<<8 | (int)buffer[2]<<16 | (int)buffer[3]<<24;
BTW
Of course this applies to x86 architectures that are little endian - otherwise your platform endianess may match your file's endianess so no conversion needed. This way you could read directly into your int without convesions.
You need to find out your endianess first:
How can I find Endian-ness of my PC programmatically using C?
Then you need to act accordingly. If you're the same as the file, you can read the value as is and if you are in a different endianess you need to reorder the bytes:
union Num
{
char buffer[4];
int num;
} num ;
void swapChars(char* pChar1, char* pChar2)
{
char temp = *pChar1;
*pChar1 = *pChar2;
*pChar2 = temp;
}
int swapOrder(Num num)
{
swapChar( &num.buffer[0], &num.buffer[3]);
swapChar( &num.buffer[1], &num.buffer[2]);
return num.num;
}
while ( fread(&num.buffer, 1, 4,fp) != 0)
{
int convertedNum;
if (1 == amIBigEndian)
{
convertedNum = num.num
}
else
{
convertedNum = swapOrder(num);
}
// Do what ever you want with convertedNum here...
}
It is operating system and processor architecture specific.
You might perhaps use routines like htonl(3) or ntohl etc...
but you really should have serialized in a well defined format.
On current machines (where I/O is very slow, w.r.t. CPU speed) I am in favor of using textual serialization formats like JSON, YAML, .... But you could also use binary serialization (and libraries) like BSON, XDR, ASN.1 or the s11n library....
If possible, improve the producer code (the one writing your file.bin file), and the consumer code accordingly.
Binary data is inherently brittle, because it is system and architecture specific. At the very least, document extremely well its format, and preferably give some tools to convert it from and to textual formats.
There are several JSON libraries for C++, like jsoncpp and rapidjson and for C like jansson etc etc...

Endian-ness in a char array containing binary characters

I'm building some code to read a RIFF wav file and I've bumped into something odd.
The first 4 bytes of the file header are the word RIFF in big-endian ascii coding:
0x5249 0x4646
I read this first element using:
char *fileID = new char[4];
filestream.read(fileID,4);
When I write this to screen the results are as expected:
std::cout << fileID << std::endl;
>> RIFF
Now, the next 4 bytes give the size of the file, but crucially they're little-endian.
So, I write a little function to flip the bytes, based on a union:
int flip4bytes(char* input){
union flip {int flip_int; char flip_char[4];};
flip.flip_char[0] = input[3];
flip.flip_char[1] = input[2];
flip.flip_char[2] = input[1];
flip.flip_char[3] = input[0];
return flip.flip_int;
}
This looks good to me, except when I call it, the value returned is totally wrong. Interestingly, the following code (where the bytes are not reversed!) works correctly:
int flip4bytes(char* input){
union flip {int flip_int; char flip_char[4];};
flip.flip_char[0] = input[0];
flip.flip_char[1] = input[1];
flip.flip_char[2] = input[2];
flip.flip_char[3] = input[3];
return flip.flip_int;
}
This has thoroughly confused me. Is the union somehow reversing the bytes for me?! If not, how are the bytes being converted to int correctly without being reversed?
I think there's some facet of endian-ness here that I'm ignorant to..
You are simply on a little-endian machine, and the "RIFF" string is just a string and thus neither little- nor big-endian, but just a sequence of chars. You don't need to reverse the bytes on a little-endian machine, but you need to when operating on a big-endian.
You need to figure of the endianess of your machine. #include <sys/param.h> will help you do that.
You could also use the fact that network byte order is big ended (if my memory serves me correctly - you need to check). In which case convert to big ended and use the ntohs function. That should work on any machine that you compile the code on.

How to send float over serial

What's the best way to send float, double, and int16 over serial on Arduino?
The Serial.print() only sends values as ASCII encoded. But I want to send the values as bytes. Serial.write() accepts byte and bytearrays, but what's the best way to convert the values to bytes?
I tried to cast an int16 to an byte*, without luck. I also used memcpy, but that uses to many CPU cycles. Arduino uses plain C/C++. It's an ATmega328 microcontroller.
hm. How about this:
void send_float (float arg)
{
// get access to the float as a byte-array:
byte * data = (byte *) &arg;
// write the data to the serial
Serial.write (data, sizeof (arg));
}
Yes, to send these numbers you have to first convert them to ASCII strings. If you are working with C, sprintf() is, IMO, the handiest way to do this conversion:
[Added later: AAAGHH! I forgot that for ints/longs, the function's input argument wants to be unsigned. Likewise for the format string handed to sprintf(). So I changed it below. Sorry about my terrible oversight, which would have been a hard-to-find bug. Also, ulong makes it a little more general.]
char *
int2str( unsigned long num ) {
static char retnum[21]; // Enough for 20 digits plus NUL from a 64-bit uint.
sprintf( retnum, "%ul", num );
return retnum;
}
And similar for floats and doubles. The code doing the conversion has be known in advance. It has to be told - what kind of an entity it's converting, so you might end up with functions char *float2str( float float_num) and char *dbl2str( double dblnum).
You'll get a NUL-terminated left-adjusted (no leading blanks or zeroes) character string out of the conversion.
You can do the conversion anywhere/anyhow you like; these functions are just illustrations.
Use the Firmata protocol. Quote:
Firmata is a generic protocol for communicating with microcontrollers
from software on a host computer. It is intended to work with any host
computer software package. Right now there is a matching object in a
number of languages. It is easy to add objects for other software to
use this protocol. Basically, this firmware establishes a protocol for
talking to the Arduino from the host software. The aim is to allow
people to completely control the Arduino from software on the host
computer.
The jargon word you need to look up is "serialization".
It is an interesting problem over a serial connection which might have restrictions on what characters can go end to end, and might not be able to pass eight bits per character either.
Restrictions on certain character codes are fairly common. Here's a few off the cuff:
If software flow control is in use, then conventionally the control characters DC1 and DC3 (Ctrl-Q and Ctrl-S, also sometimes called XON and XOFF) cannot be transmitted as data because they are sent to start and stop the sender at the other end of the cable.
On some devices, NUL and/or DEL characters (0x00 and 0x7F) may simply vanish from the receiver's FIFO.
If the receiver is a Unix tty, and the termio modes are not set correctly, then the character Ctrl-D (EOT or 0x04) can cause the tty driver to signal an end-of-file to the process that has the tty open.
A serial connection is usually configurable for byte width and possible inclusion of a parity bit. Some connections will require that a 7-bit byte with a parity are used, rather than an 8-bit byte. It is even possible for connection to (seriously old) legacy hardware to configure many serial ports for 5-bit and 6-bit bytes. If less than 8-bits are available per byte, then a more complicated protocol is required to handle binary data.
ASCII85 is a popular technique for working around both 7-bit data and restrictions on control characters. It is a convention for re-writing binary data using only 85 carefully chosen ASCII character codes.
In addition, you certainly have to worry about byte order between sender and receiver. You might also have to worry about floating point format, since not every system uses IEEE-754 floating point.
The bottom line is that often enough choosing a pure ASCII protocol is the better answer. It has the advantage that it can be understood by a human, and is much more resistant to issues with the serial connection. Unless you are sending gobs of floating point data, then inefficiency of representation may be outweighed by ease of implementation.
Just be liberal in what you accept, and conservative about what you emit.
Does size matter? If it does, you can encode each 32 bit group into 5 ASCII characters using ASCII85, see http://en.wikipedia.org/wiki/Ascii85.
This simply works. Use Serial.println() function
void setup() {
Serial.begin(9600);
}
void loop() {
float x = 23.45585888;
Serial.println(x, 10);
delay(1000);
}
And this is the output:
Perhaps that is best Way to convert Float to Byte and Byte to Float,-Hamid Reza.
int breakDown(int index, unsigned char outbox[], float member)
{
unsigned long d = *(unsigned long *)&member;
outbox[index] = d & 0x00FF;
index++;
outbox[index] = (d & 0xFF00) >> 8;
index++;
outbox[index] = (d & 0xFF0000) >> 16;
index++;
outbox[index] = (d & 0xFF000000) >> 24;
index++;
return index;
}
float buildUp(int index, unsigned char outbox[])
{
unsigned long d;
d = (outbox[index+3] << 24) | (outbox[index+2] << 16)
| (outbox[index+1] << 8) | (outbox[index]);
float member = *(float *)&d;
return member;
}
regards.
`
Structures and unions solve that issue. Use a packed structure with a byte sized union matching the structure. Overlap the pointers to the structure and union (or add the union in the structure). Use Serial.write to send the stream. Have a matching structure/union on receiving end. As long as byte order matches no issue otherwise you can unpack using the "C" hto(s..l) functions. Add "header" info to decode different structures/unions.
For Arduino IDE:
float buildUp(int index, unsigned char outbox[])
{
unsigned long d;
d = (long(outbox[index +3]) << 24) | \
(long(outbox[index +2]) << 16) | \
(long(outbox[index +1]) << 8) | \
(long(outbox[index]));
float member = *(float *)&d;
return member;
}
otherwise not working.