Parsing Message with Varying Fields - c++

I have a byte stream that represents a message in my application. There are 5 fields in the message for demonstration. The first byte in the stream indicates which message fields are present for the current stream. For instance 0x2 in the byte-0 means only the Field-1 is present for the current stream.
The mask field might have 2^5=32 different values. To parse this varying width of message, I wrote the example structure and parser below. My question is, is there any other way to parse such dynamically changing fields? If the message had 64 fields with I would have to write 64 cases, which is cumbersome.
#include <iostream>
typedef struct
{
uint8_t iDummy0;
int iDummy1;
}__attribute__((packed, aligned(1)))Field4;
typedef struct
{
int iField0;
uint8_t ui8Field1;
short i16Field2;
long long i64Field3;
Field4 stField4;
}__attribute__((packed, aligned(1)))MessageStream;
char* constructIncomingMessage()
{
char* cpStream = new char(1+sizeof(MessageStream)); // Demonstrative message byte array
// 1 byte for Mask, 20 bytes for messageStream
cpStream[0] = 0x1F; // the 0-th byte is a mask marking
// which fields are present for the messageStream
// all 5 fields are present for the example
return cpStream;
}
void deleteMessage( char* cpMessage)
{
delete cpMessage;
}
int main() {
MessageStream messageStream; // Local storage for messageStream
uint8_t ui8FieldMask; // Mask to indicate which fields of messageStream
// are present for the current incoming message
const uint8_t ui8BitIsolator = 0x01;
uint8_t ui8FieldPresent; // ANDed result of Mask and Isolator
std::size_t szParsedByteCount = 0; // Total number of parsed bytes
const std::size_t szMaxMessageFieldCount = 5; // There can be maximum 5 fields in
// the messageStream
char* cpMessageStream = constructIncomingMessage();
ui8FieldMask = (uint8_t)cpMessageStream[0];
szParsedByteCount += 1;
for(std::size_t i = 0; i<szMaxMessageFieldCount; ++i)
{
ui8FieldPresent = ui8FieldMask & ui8BitIsolator;
if(ui8FieldPresent)
{
switch(i)
{
case 0:
{
memcpy(&messageStream.iField0, cpMessageStream+szParsedByteCount, sizeof(messageStream.iField0));
szParsedByteCount += sizeof(messageStream.iField0);
break;
}
case 1:
{
memcpy(&messageStream.ui8Field1, cpMessageStream+szParsedByteCount, sizeof(messageStream.ui8Field1));
szParsedByteCount += sizeof(messageStream.ui8Field1);
break;
}
case 2:
{
memcpy(&messageStream.i16Field2, cpMessageStream+szParsedByteCount, sizeof(messageStream.i16Field2));
szParsedByteCount += sizeof(messageStream.i16Field2);
break;
}
case 3:
{
memcpy(&messageStream.i64Field3, cpMessageStream+szParsedByteCount, sizeof(messageStream.i64Field3));
szParsedByteCount += sizeof(messageStream.i64Field3);
break;
}
case 4:
{
memcpy(&messageStream.stField4, cpMessageStream+szParsedByteCount, sizeof(messageStream.stField4));
szParsedByteCount += sizeof(messageStream.stField4);
break;
}
default:
{
std::cerr << "Undefined Message field number: " << i << '\n';
break;
}
}
}
ui8FieldMask >>= 1; // shift the mask
}
delete deleteMessage(cpMessageStream);
return 0;
}

The first thing I'd change is to drop the __attribute__((packed, aligned(1))) on Field4. This is a hack to create structures which mirror a packed wire-format, but that's not the format you're dealing with anyway.
Next, I'd make MessageStream a std::tuple of std::optional<T> fields.
You now know that there are std::tuple_size<MessageStream> possible bits in the mask. Obviously you can't fit 64 bits in a ui8FieldMask but I'll assume that's a trivial problem to solve.
You can write a for-loop from 0 to std::tuple_size<MessageStream> to extract the bits from ui8FieldMask to see which bits are set. The slight problem with that logic is that you'll need compile-time constants I for std::get<size_t I>(MessageStream), and a for-loop only gives you run-time variables.
Hence, you'll need a recursive template <size_t I> extract(char const*& cpMessageStream, MessageStream&), and of course a specialization extract<0>. In extract<I>, you can use typename std::tuple_element<I, MessageStream>::type to get the std::optional<T> at the I'th position in your MessageStream.

Related

Using bitwise or operator in switch case

I have enum,
enum ENUM_MSG_TEXT_CHANGE {COLOR=0,SIZE,UNDERLINE};
void Func(int nChange)
{
bool bColor=false, bSize=false;
switch(nChange)
{
case COLOR:bColor=true;break;
case SIZE:bSize=true;break;
case COLOR|SIZE:bSize=true; bColor=true;break;
}
}
case SIZE: and case COLOR|SIZE: both gives value 1, so I am getting the error C2196: case value '1' already used. How to differentiate these two cases in switch case?
Thanks
If you want to make a bitmask, every element of your enum has to correspond to a number that is a power of 2, so it has exactly 1 bit set. If you number them otherwise, it won't work. So, the first element should be 1, then 2, then 4, then 8, then 16, and so on, so you won't get overlaps when orring them. Also, you should just test every bit individually instead of using a switch:
if (nChange & COLOR) {
bColor = true;
}
if (nChange & SIZE) {
bSize = true;
}
These two labels
case SIZE:bSize=true;break;
case COLOR|SIZE:bSize=true; bColor=true;break;
evaluates to 1 because SIZE is defined as having the value 1 and the bit-wise operator | used in the label COLOR|SIZE also yields 1.
Usually such enumerations are declared as bitmask types like
enum ENUM_MSG_TEXT_CHANGE { COLOR = 1 << 0, SIZE = 1 << 1, UNDERLINE = 1 << 2 };
In this case this label
case COLOR|SIZE:bSize=true; bColor=true;break;
will be equal to 3.
When using the binary OR (operator|) you need to assign values to the individual bits (or combinations of bits). Since COLOR has the value 0 it can't be extracted from a bitfield like you try to do.
Also, for the three enums you have, there are 8 possible combinations. To use a switch, you'd need 8 case labels.
Consider this as an alternative:
#include <iostream>
enum ENUM_MSG_TEXT_CHANGE : unsigned {
COLOR = 1U << 0U, // 0b001
SIZE = 1U << 1U, // 0b010
UNDERLINE = 1U << 2U // 0b100
};
void Func(unsigned nChange) {
// extract the set bits
bool bColor = nChange & COLOR;
bool bSize = nChange & SIZE;
bool bUnderline = nChange & UNDERLINE;
// print out what was extracted
std::cout << bUnderline << bSize << bColor << '\n';
}
int main() {
// test all combinations
for(unsigned change = 0; change <= (COLOR | SIZE | UNDERLINE); ++change) {
Func(change);
}
}
Output:
000
001
010
011
100
101
110
111

LZW Decompression

I am implementing a LZW algorithm in C++.
The size of the dictionary is a user input, but the minimum is 256, so it should work with binary files. If it reaches the end of the dictionary it goes around to the index 0 and works up overwriting it from there.
For example, if i put in a alice in wonderland script and compress it with a dictionary size 512 i get this dictionary.
But i have a problem with decompression and the output dictionary from decompressing the compressed file looks like this.
And my code for decompressing looks like this
struct dictionary
{
vector<unsigned char> entry;
vector<bool> bits;
};
void decompress(dictionary dict[], vector<bool> file, int dictionarySize, int numberOfBits)
{
//in this example
//dictionarySize = 512, tells the max size of the dictionary, and goes back to 0 if it reaches 513
//numberOfBits = log2(512) = 9
//dictionary dict[] contains bits and strings (strings can be empty)
// dict[0] =
// entry = (unsigned char)0
// bits = (if numberOfBits = 9) 000000001
// dict[255] =
// entry = (unsigned char)255
// bits = (if numberOfBits = 9) 011111111
// so the next entry will be dict[next] (next is currently 256)
// dict[256] =
// entry = what gets added in the code below
// bits = 100000000
// all the bits are already set previously (dictionary size is int dictionarySize) so in this case all the bits from 0 to 511 are already set, entries are set from 0 to 255, so extended ASCII
vector<bool> currentCode;
vector<unsigned char> currentString;
vector<unsigned char> temp;
int next=256;
bool found=false;
for(int i=0;i<file.size();i+=numberOfBits)
{
for(int j=0;j<numberOfBits;j++)
{
currentCode.push_back(file[i+j]);
}
for(int j=0;j<dictionarySize;j++)
{
// when the currentCode (size numberOfBits) gets found in the dictionary
if(currentCode==dict[j].bits)
{
currentString = dict[j].entry;
// if the current string isnt empty, then it means it found the characted in the dictionary
if(!currentString.empty())
{
found = true;
}
}
}
//if the currentCode in the dictionary has a string value attached to it
if(found)
{
for(int j=0;j<currentString.size();j++)
{
cout<<currentString[j];
}
temp.push_back(currentString[0]);
// so it doesnt just push 1 character into the dictionary
// example, if first read character is 'r', it is already in the dictionary so it doesnt get added
if(temp.size()>1)
{
// if next is more than 511, writing to that index would cause an error, so it resets back to 0 and goes back up
if(next>dictionarySize-1) //next > 512-1
{
next = 0;
}
dict[next].entry.clear();
dict[next].entry = temp;
next++;
}
//temp = currentString;
}
else
{
currentString = temp;
currentString.push_back(temp[0]);
for(int j=0;j<currentString.size();j++)
{
cout<<currentString[j];
}
// if next is more than 511, writing to that index would cause an error, so it resets back to 0 and goes back up
if(next>dictionarySize-1)
{
next = 0;
}
dict[next].entry.clear();
dict[next].entry = currentString;
next++;
//break;
}
temp = currentString;
// currentCode gets cleared, and written into in the next iteration
currentCode.clear();
//cout<<endl;
found = false;
}
}
Im am currently stuck and dont know what to fix here to fix the output.
I have also noticed, that if i put a dictionary big enough, so it doesnt go around the dictionary (it doesnt reach the end and begin again at 0) it works.
start small
you are using files that is too much data to debug. Start small with strings. I took this nice example from Wikli:
Input: "abacdacacadaad"
step input match output new_entry new_index
a 0
b 1
c 2
d 3
1 abacdacacadaad a 0 ab 4
2 bacdacacadaad b 1 ba 5
3 acdacacadaad a 0 ac 6
4 cdacacadaad c 2 cd 7
5 dacacadaad d 3 da 8
6 acacadaad ac 6 aca 9
7 acadaad aca 9 acad 10
8 daad da 8 daa 11
9 ad a 0 ad 12
10 d d 3
Output: "0102369803"
So you can debug your code step by step with cross matching both input/output and dictionary contents. Once that is done correctly then you can do the same for decoding:
Input: "0102369803"
step input output new_entry new_index
a 0
b 1
c 2
d 3
1 0 a
2 1 b ab 4
3 0 a ba 5
4 2 c ac 6
5 3 d cd 7
6 6 ac da 8
7 9 aca aca 9
8 8 da acad 10
9 0 a daa 11
10 3 d ad 12
Output: "abacdacacadaad"
Only then move to files and clear dictionary handling.
bitstream
once you succesfully done the LZW on small alphabet you can try to use the full alphabet and bit encoding. You know the LZW stream can be encoded at any bitlength (not just 8/16/32/64 bits) which can greatly affect compression ratios (in respect to used data properties). So I would try to do univeral access to data at variable (or predefined bitlength).
Was a bit curious so I encoded a simple C++/VCL example for the compression:
//---------------------------------------------------------------------------
// LZW
const int LZW_bits=12; // encoded bitstream size
const int LZW_size=1<<LZW_bits; // dictinary size
// bitstream R/W
DWORD bitstream_tmp=0;
//---------------------------------------------------------------------------
// return LZW_bits from dat[adr,bit] and increment position (adr,bit)
DWORD bitstream_read(BYTE *dat,int siz,int &adr,int &bit,int bits)
{
DWORD a=0,m=(1<<bits)-1;
// save tmp if enough bits
if (bit>=bits){ a=(bitstream_tmp>>(bit-bits))&m; bit-=bits; return a; }
for (;;)
{
// insert byte
bitstream_tmp<<=8;
bitstream_tmp&=0xFFFFFF00;
bitstream_tmp|=dat[adr]&255;
adr++; bit+=8;
// save tmp if enough bits
if (bit>=bits){ a=(bitstream_tmp>>(bit-bits))&m; bit-=bits; return a; }
// end of data
if (adr>=siz) return 0;
}
}
//---------------------------------------------------------------------------
// write LZW_bits from a to dat[adr,bit] and increment position (adr,bit)
// return true if buffer is full
bool bitstream_write(BYTE *dat,int siz,int &adr,int &bit,int bits,DWORD a)
{
a<<=32-bits; // align to MSB
// save tmp if aligned
if ((adr<siz)&&(bit==32)){ dat[adr]=(bitstream_tmp>>24)&255; adr++; bit-=8; }
if ((adr<siz)&&(bit==24)){ dat[adr]=(bitstream_tmp>>16)&255; adr++; bit-=8; }
if ((adr<siz)&&(bit==16)){ dat[adr]=(bitstream_tmp>> 8)&255; adr++; bit-=8; }
if ((adr<siz)&&(bit== 8)){ dat[adr]=(bitstream_tmp )&255; adr++; bit-=8; }
// process all bits of a
for (;bits;bits--)
{
// insert bit
bitstream_tmp<<=1;
bitstream_tmp&=0xFFFFFFFE;
bitstream_tmp|=(a>>31)&1;
a<<=1; bit++;
// save tmp if aligned
if ((adr<siz)&&(bit==32)){ dat[adr]=(bitstream_tmp>>24)&255; adr++; bit-=8; }
if ((adr<siz)&&(bit==24)){ dat[adr]=(bitstream_tmp>>16)&255; adr++; bit-=8; }
if ((adr<siz)&&(bit==16)){ dat[adr]=(bitstream_tmp>> 8)&255; adr++; bit-=8; }
if ((adr<siz)&&(bit== 8)){ dat[adr]=(bitstream_tmp )&255; adr++; bit-=8; }
}
return (adr>=siz);
}
//---------------------------------------------------------------------------
bool str_compare(char *s0,int l0,char *s1,int l1)
{
if (l1<l0) return false;
for (;l0;l0--,s0++,s1++)
if (*s0!=*s1) return false;
return true;
}
//---------------------------------------------------------------------------
AnsiString LZW_encode(AnsiString raw)
{
AnsiString lzw="";
int i,j,k,l;
int adr,bit;
DWORD a;
const int siz=32; // bitstream buffer
BYTE buf[siz];
AnsiString dict[LZW_size]; // dictionary
int dicts=0; // actual size of dictionary
// init dictionary
for (dicts=0;dicts<256;dicts++) dict[dicts]=char(dicts); // full 8bit binary alphabet
// for (dicts=0;dicts<4;dicts++) dict[dicts]=char('a'+dicts); // test alphabet "a,b,c,d"
l=raw.Length();
adr=0; bit=0;
for (i=0;i<l;)
{
i&=i;
// find match in dictionary
for (j=dicts-1;j>=0;j--)
if (str_compare(dict[j].c_str(),dict[j].Length(),raw.c_str()+i,l-i))
{
i+=dict[j].Length();
if (i<l) // add new entry in dictionary (if not end of input)
{
// clear dictionary if full
if (dicts>=LZW_size) dicts=256; // full 8bit binary alphabet
// if (dicts>=LZW_size) dicts=4; // test alphabet "a,b,c,d"
else{
dict[dicts]=dict[j]+AnsiString(raw[i+1]); // AnsiString index starts from 1 hence the +1
dicts++;
}
}
a=j; j=-1; break; // full binary output
// a='0'+j; j=-1; break; // test ASCII output
}
// store result to bitstream
if (bitstream_write(buf,siz,adr,bit,LZW_bits,a))
{
// append buf to lzw
k=lzw.Length();
lzw.SetLength(k+adr);
for (j=0;j<adr;j++) lzw[j+k+1]=buf[j];
// reset buf
adr=0;
}
}
if (bit)
{
// store the remainding bits with zeropad
bitstream_write(buf,siz,adr,bit,LZW_bits-bit,0);
}
if (adr)
{
// append buf to lzw
k=lzw.Length();
lzw.SetLength(k+adr);
for (j=0;j<adr;j++) lzw[j+k+1]=buf[j];
}
return lzw;
}
//---------------------------------------------------------------------------
AnsiString LZW_decode(AnsiString lzw)
{
AnsiString raw="";
int adr,bit,siz,ix;
DWORD a;
AnsiString dict[LZW_size]; // dictionary
int dicts=0; // actual size of dictionary
// init dictionary
for (dicts=0;dicts<256;dicts++) dict[dicts]=char(dicts); // full 8bit binary alphabet
// for (dicts=0;dicts<4;dicts++) dict[dicts]=char('a'+dicts); // test alphabet "a,b,c,d"
siz=lzw.Length();
adr=0; bit=0; ix=-1;
for (adr=0;(adr<siz)||(bit>=LZW_bits);)
{
a=bitstream_read(lzw.c_str(),siz,adr,bit,LZW_bits);
// a-='0'; // test ASCII input
// clear dictionary if full
if (dicts>=LZW_size){ dicts=4; ix=-1; }
// new dictionary entry
if (ix>=0)
{
if (a>=dicts){ dict[dicts]=dict[ix]+AnsiString(dict[ix][1]); dicts++; }
else { dict[dicts]=dict[ix]+AnsiString(dict[a ][1]); dicts++; }
} ix=a;
// update decoded output
raw+=dict[a];
}
return raw;
}
//---------------------------------------------------------------------------
and output using // test ASCII input lines:
txt="abacdacacadaad"
enc="0102369803"
dec="abacdacacadaad"
where AnsiString is the only VCL stuff I used and its just self allocating string variable beware its indexes starts at 1.
AnsiString s;
s[5] // character access (1 is first character)
s.Length() // returns size
s.c_str() // returns char*
s.SetLength(size) // resize
So just use any string you got ...
In case you do not have BYTE,DWORD use unsigned char and unsigned int instead ...
Looks like its working for long texts too (bigger than dictionary and or bitstream buffer sizes). However beware that the clearing might be done in few different places of code but must be synchronized in both encoder/decoder otherwise after clearing the data would corrupt.
The example can use either just "a,b,c,d" alphabet or full 8it one. Currently is set for 8bit. If you want to change it just un-rem the // test ASCII input lines and rem out the // full 8bit binary alphabet lines in the code.
To test crossing buffers and boundary you can play with:
const int LZW_bits=12; // encoded bitstream size
const int LZW_size=1<<LZW_bits; // dictinary size
and also with:
const int siz=32; // bitstream buffer
constants... The also affect performance so tweak to your liking.
Beware the bitstream_write is not optimized and can be speed up considerably ...
Also in order to debug 4bit aligned coding I am using hex print of encoded data (hex string is twice as long as its ASCII version) like this (ignore the VCL stuff):
AnsiString txt="abacdacacadaadddddddaaaaaaaabcccddaaaaaaaaa",enc,dec,hex;
enc=LZW_encode(txt);
dec=LZW_decode(enc);
// convert to hex
hex=""; for (int i=1,l=enc.Length();i<=l;i++) hex+=AnsiString().sprintf("%02X",enc[i]);
mm_log->Lines->Add("\""+txt+"\"");
mm_log->Lines->Add("\""+hex+"\"");
mm_log->Lines->Add("\""+dec+"\"");
mm_log->Lines->Add(AnsiString().sprintf("ratio: %i%",(100*enc.Length()/dec.Length())));
and result:
"abacdacacadaadddddddaaaaaaaabcccddaaaaaaaaa"
"06106206106306410210510406106410FFFFFF910A10706110FFFFFFD10E06206311110910FFFFFFE11410FFFFFFD0"
"abacdacacadaadddddddaaaaaaaabcccddaaaaaaaaa"
ratio: 81%

cast void* based on enum value in C++

I am writing a Python Library in C++ using the python C Api. There I have about 25 functions, that all accept two strings. Since Python might save strings in utf8/16/32 (the moment on char requires a bigger size the whole string will use the bigger size). When checking which kind the string is you get a enum value between 0 and 4. 0/4 should be handled as utf32, 1 as utf8 and 2 as utf16. So I currently have a nested switch for each combination:
The following example shows how the elements are handled in my code. random_func is different for each of my functions and is a template, that accepts a string_view of any type. This way to write the code results in about 100 lines of boilerplate for each function that accepts two strings.
Is there a way to handle all these cases without this immense code duplication and without sacrificing performance?
double result = 0;
Py_ssize_t len_s1 = PyUnicode_GET_LENGTH(py_s1);
void* s1 = PyUnicode_DATA(py_s1);
Py_ssize_t len_s2 = PyUnicode_GET_LENGTH(py_s2);
void* s2 = PyUnicode_DATA(py_s2);
int s1_kind = PyUnicode_KIND(py_s1);
int s2_kind = PyUnicode_KIND(py_s2);
switch (s1_kind) {
case PyUnicode_1BYTE_KIND:
switch (s2_kind) {
case PyUnicode_1BYTE_KIND:
result = random_func(
basic_string_view<char>(static_cast<char*>(s1), len_s1),
basic_string_view<char>(static_cast<char*>(s2), len_s2));
break;
case PyUnicode_2BYTE_KIND:
result = random_func(
basic_string_view<char>(static_cast<char*>(s1), len_s1),
basic_string_view<char16_t>(static_cast<char16_t*>(s2), len_s2));
break;
default:
result = random_func(
basic_string_view<char>(static_cast<char*>(s1), len_s1),
basic_string_view<char32_t>(static_cast<char32_t*>(s2), len_s2));
break;
}
break;
case PyUnicode_2BYTE_KIND:
switch (s2_kind) {
case PyUnicode_1BYTE_KIND:
result = random_func(
basic_string_view<char16_t>(static_cast<char16_t*>(s1), len_s1),
basic_string_view<char>(static_cast<char*>(s2), len_s2));
break;
case PyUnicode_2BYTE_KIND:
result = random_func(
basic_string_view<char16_t>(static_cast<char16_t*>(s1), len_s1),
basic_string_view<char16_t>(static_cast<char16_t*>(s2), len_s2));
break;
default:
result = random_func(
basic_string_view<char16_t>(static_cast<char16_t*>(s1), len_s1),
basic_string_view<char32_t>(static_cast<char32_t*>(s2), len_s2));
break;
}
break;
default:
switch (s2_kind) {
case PyUnicode_1BYTE_KIND:
result = random_func(
basic_string_view<char32_t>(static_cast<char32_t*>(s1), len_s1),
basic_string_view<char>(static_cast<char*>(s2), len_s2));
break;
case PyUnicode_2BYTE_KIND:
result = random_func(
basic_string_view<char32_t>(static_cast<char32_t*>(s1), len_s1),
basic_string_view<char16_t>(static_cast<char16_t*>(s2), len_s2));
break;
default:
result = random_func(
basic_string_view<char32_t>(static_cast<char32_t*>(s1), len_s1),
basic_string_view<char32_t>(static_cast<char32_t*>(s2), len_s2));
break;
}
break;
}
Put the complexity away in a function using variants
using python_string_view = std::variant<std::basic_string_view<char>,
std::basic_string_view<char16_t>,
std::basic_string_view<char32_t>;
python_string_view decode_python_string(python_string py_str)
{
Py_ssize_t len_s = PyUnicode_GET_LENGTH(py_str);
void* s = PyUnicode_DATA(py_str);
int s_kind = PyUnicode_KIND(py_str);
switch (s_kind) {
//return correct string_view here
}
}
int main()
{
python_string s1 = ..., s2 = ...;
auto v1 = decode_python_string(s1);
auto v2 = decode_python_string(s2);
std::visit([](auto&& val1, auto&& val2) {
random_func(val1, val2);
}, v1, v2);
}
I'm unsure about the performance though.
For what it is worth:
The difference it makes to have different char types is at the moment you extract the character values inside random_func (requiring nine template specializations, if I am right).
You would be close to a solution by fetching the chars in all cases using the largest type and masking out or shifting out the extra bytes where necessary. Instead of templating, you would pass a suitable mask and a stride information. Something like
for (char32_t* c= (char32_t*)s1; c &= mask, c != 0; c= (char32_t*)((char*)c + stride))
{
…
}
Unfortunately, not counting the extra masking operation, you hit a wall because you may have to fetch too many bytes at one end of the string, causing an illegal memory access.

How to write custom binary file handler in c++ with serialisation of custom objects?

I have some structures I want to serialise and deserialise to be able to pass them from program to program (as a save file), and to be manipulated by other programs (make minor changes....).
I've read through:
Document that describes isocpp serialisation explanation
SO questions that show how to read blocks
SO question how to reading and writing binary files
Benchmarking different file handlers speed and reliance
Serialisation "intro"
But I didn't found anywhere how to pass that step from having some class or struct to serialised structure that you can then read, write, manipulate... be it singular (1 structure per file) to in sequence (multiple lists of multiple structure types per file).
How to write custom binary file handler in c++ with serialisation of custom objects ?
Before we start
Most of new users aren't familiar with different data types in C++ and often use plain int, char and etc in their code. To successfully do serialisation, one needs to thoroughly think about their data types. Therefore these are your first steps if you have an int lying down somewhere.
Know your data
What is maximum value that variable should hold?
Can it be negative?
Limit your data
Implement decisions from above
Limit amount of objects your file can hold.
Know your data
If you have an struct or a class with some data as:
struct cat {
int weight = 0; // In kg (pounds)
int length = 0; // In cm (or feet)
std::string voice = "meow.mp3";
cat() {}
cat(int weight, int length): weight(weight), length(length) {}
}
Can your cat really weight around 255 kg (maximum size for the 1 byte integer)? Can it be as long as 255 cm (2.5 m)? Does the voice of your cat change with every object of cat?
Objects that don't change should be declared static, and you should limit your object size to best fit your needs. So in these examples answers to the questions above is no.
So our cat struct now looks like this:
struct cat {
uint8_t weight = 0; // Non negative 8 bit (1 byte) integer (or unsigned char)
uint8_t length = 0; // Same for length
static std::string voice;
cat() {}
cat(uint8_t w, uint8_t l):weight(w), length(l) {}
};
static cat::voice = std::string("meow.mp3");
Files are written byte by byte (often as character sets) and your data can vary so you need to presume or limit maximum value your data can handle.
But not every project (or structure) is the same, so let's talk about differences of your code data and binary structured data. When thinking about serialisation you need to think in this manner "what is bare minimum of data that this structure needs to be unique?".
For our cat object, it can represent anything beside:
tigers: max 390 kg, and 340 cm
lions : max 315 kg, and 365 cm
Anything else is eligable. So you can influence your "meow.mp3" depending on the size and weight, and then most important data that makes a cat unique is its length and weight. Those are data we need to save to our file.
Limit your data
The largest zoo in the world has 5000 animals and 700 species, which means that in average each species in the zoo contains a population around 10 per species. Which means that per our species of cat we can store maximum of 1 byte worth of cats and don't fear that it will go over it.
So it is safe to assume that our zoo project should hold up to 200 elements per species. This leaves us with two different byte sized data, so our serialised data for our struct is maximum two bytes.
Approach to serialisation
Constructing our cat block
For starters, this is the great way to start. It helps you approach custom serialisation with the right foundation. Now all that is left is to define a structured binary format. For that we need a way to recognise if our two bytes are part of the cat or some other structure, which it can be done with same type collection (every two bytes are cats) or by an identifier.
If we have single file (or part of the file) that holds all cats. We need just start offset of the file and the size of the cat bytes, and then read read every two bytes from start offset to get all cats.
Identifier is a way we can identify depending on the starting character if the object is a cat or something else. This is commonly done by the TLV (Type Length Value) format where type would be Cats, length would be two bytes, and value would be those two bytes.
As you can see, the first option contains fewer bytes and therefore it is more compact, but with the second option we have ability to store multiple animals in our file and make a zoo. How you will structure your binary files depends a lot on your project. For now, since the "single file" option is the most logical to work with, I will implement the second one.
The most important this about "identifier" approach is to first make it logical for us, and then make it logical for our machine. I come from a world where reading from left to right is an norm. So it is logical that the first thing I want to read about cats is its type, then length, and then value.
char type = 'C'; // C shorten for Cat, 0x43
uint8_t length = 2; // It holds 2 bytes, 0x02
uint8_t c_length = '?'; // cats length
uint8_t c_weight = '?'; // cats weight
And to represent it as an chunk(block);
+00 4B 43-02-LL-WW ('C\x02LW')
Where this means:
+00: offset form the start, 0 means it is start of the file
4B: size of our data block, 4 bytes.
43-02-LL-WW: actual value of cat
43: hexadecimal representation of character 'C'
02: hexadecimal representation of length of this type (2)
LL: length of this cat of 1 byte value
WW: weight of this cat of 1 byte value
But since it is easier for me to read data from left to right, this means my data should be written as little endian, and most of standalone computers are big endian.
Endianess and importance of them
The main issue here is endianness of our machine and because of our struct/class and endianness we need an base type. The way we wrote it defines an little endian OS, but OS's can be all kind of endianness and you can find out how to find which your machine has here.
For users experienced with bit fields I would strongly suggest that you use them for this. But for unfamiliar users:
#include <iostream> // Just for std::ostream, std::cout, and std::endl
bool is_big() {
union {
uint16_t w;
uint8_t p[2];
} p;
p.w = 0x0001;
return p.p[0] == 0x1;
}
union chunk {
uint32_t space;
uint8_t parts[4];
};
chunk make_chunk(uint32_t VAL) {
union chunk ret;
ret.space = VAL;
return ret;
}
std::ostream& operator<<(std::ostream& os, union chunk &c) {
if(is_big()) {
return os << c.parts[3] << c.parts[2] << c.parts[1] << c.parts[0];
}else {
return os << c.parts[0] << c.parts[1] << c.parts[2] << c.parts[3];
}
}
void read_as_binary(union chunk &t, uint32_t VAL) {
t.space = VAL;
if(is_big()) {
t.space = (t.parts[3] << 24) | (t.parts[2] << 16) | (t.parts[1] << 8) | t.parts[0];
}
}
void write_as_binary(union chunk t, uint32_t &VAL) {
if(is_big()) {
t.space = (t.parts[3] << 24) | (t.parts[2] << 16) | (t.parts[1] << 8) | t.parts[0];
}
VAL = t.space;
}
So now we have our chunk that will print out characters in the order we can recognise it at first glance. Now we need a set of casting functionality from uint32_t to our cat since our chunk size is 4 bytes or uint32_t.
struct cat {
uint8_t weight = 0; // Non negative 8 bit (1 byte) integer (or unsigned char)
uint8_t length = 0; // The same for length
static std::string voice;
cat() {}
cat(uint8_t w, uint8_t l): weight(w), length(l) {}
cat(union chunk cat_chunk) {
if((cat_chunk.space & 0x43020000) == 0x43020000) {
this->length = cat_chunk.space & 0xff; // To circumvent the endianness bit shifts are best solution for that
this->weight = (cat_chunk.space >> 8) & 0xff;
}
// Some error handling
this->weight = 0;
this->length = 0;
}
operator uint32_t() {
return 0x4302000 | (this->weight << 8) | this->length;
}
};
static cat::voice = std::string("meow.mp3");
Zoo file structure
So now we have our cat object ready to be casted back and forth from chunk to cat. Now we need to structure an whole file with Header, footer, data, and checksums*. Let's say we are building an application for keeping track between zoo facility showing how many animals they have. Data of our zoo is what animals they have and how much, The footer of our zoo can be omitted (or it can represent the timestamp of when file was created), and in the header we save instructions on how to read our file, versioning and checking for corruption.
For more information how I structured these files you can find sources here and this shameless plug.
// File structure: all little endian
------------
HEADER:
+00 4B 89-5A-4F-4F ('\221ZOO') Our magic number for the zoo file
+04 4B XX-XX-XX-XX ('????') Whole file checksum
+08 4B 47-0D-1A-0A ('\r\n\032\n') // CRLF <-> LF conversion and END OF FILE 032
+12 4B YY-YY-00-ZZ ('??\0?') Versioning and usage
+16 4B AA-BB-BB-BB ('X???') Start offset + data length
------------
DATA:
Animals: // For each animal type (block identifier)
+20+?? 4B ??-XX-XX-LL ('????') : ? animal type identifier, X start offset from header, Y animals in struct objects
+24+??+4 4B XX-XX-XX-XX ('????') : Checksum for animal type
For checksums, you can use the normal ones (manually add each byte) or among others CRC-32. The choice is yours, and it depends on the size of your files and data. So now we have data for our file. Of course, I must warn you:
Having only one structure or class that requires serialisation means that in general this type of serialisation isn't needed. You can just cast the whole object to the integer of desirable size and then to a binary character sequence, and then read that character sequence of some size into an integer and back to the object. The real value of serialisation is that we can store multiple data and find our way in that binary mess.
But since Zoo can have more data than which animals we have, that can vary in size in chunks. We need to make an interface or abstract class for file handling.
#include <fstream> // File input output ...
#include <vector> // Collection for writing data
#include <sys/types.h> // Gets the types for struct stat
#include <sys/stat.h> // Struct stat
#include <string> // String manipulations
struct handle {
// Members
protected: // Inherited in private
std::string extn = "o";
bool acces = false;
struct stat buffer;
std::string filename = "";
std::vector<chunk> data;
public: // Inherited in public
std::string name = "genesis";
std::string path = "";
// Methods
protected:
void remake_name() {
this->filename = this->path;
if(this->filename != "") {
this->filename.append("//");
}
this->filename.append(this->name);
this->filename.append(".");
this->filename.append(this->extn);
}
void recheck() {
this->acces = (
stat(
this->filename.c_str(),
&this->buffer
) == 0);
}
// To be overwritten later on [override]
virtual bool check_header() { return true;}
virtual bool check_footer() { return true;}
virtual bool load_header() { return true;}
virtual bool load_footer() { return true;}
public:
handle()
: acces(false),
name("genesis"),
extn("o"),
filename(""),
path(""),
data(0) {}
void operator()(const char *name, const char *ext, const char *path) {
this->path = std::string(path);
this->name = std::string(name);
this->extn = std::string(ext);
this->remake_name();
this->recheck();
}
void set_prefix(const char *prefix) {
std::string prn(prefix);
prn.append(this->name);
this->name = prn;
this->remake_name();
}
void set_suffix(const char *suffix) {
this->name.append(suffix);
this->remake_name();
}
int write() {
this->remake_name();
this->recheck();
if(!this->load_header()) {return 0;}
if(!this->load_footer()) {return 0;}
std::fstream file(this->filename.c_str(), std::ios::out | std::ios::binary);
uint32_t temp = 0;
for(int i = 0; i < this->data.size(); i++) {
write_as_binary(this->data[i], temp);
file.write((char *)(&temp), sizeof(temp));
}
if(!this->check_header()) { file.close();return 0; }
if(!this->check_footer()) { file.close();return 0; }
file.close();
return 1;
}
int read() {
this->remake_name();
this->recheck();
if(!this->acces) {return 0;}
std::fstream file(this->filename.c_str(), std::ios::in | std::ios::binary);
uint32_t temp = 0;
chunk ctemp;
size_t fsize = this->buffer.st_size/4;
for(int i = 0; i < fsize; i++) {
file.read((char*)(&temp), sizeof(temp));
read_as_binary(ctemp, temp);
this->data.push_back(ctemp);
}
if(!this->check_header()) {
file.close();
this->data.clear();
return 0;
}
if(!this->check_footer()) {
file.close();
this->data.clear();
return 0;
}
return 1;
}
// Friends
friend std::ostream& operator<<(std::ostream& os, const handle& hand);
friend handle& operator<<(handle& hand, chunk& c);
friend handle& operator>>(handle& hand, chunk& c);
friend struct zoo_file;
};
std::ostream& operator<<(std::ostream& os, const handle& hand) {
for(int i = 0; i < hand.data.size(); i++) {
os << "\t" << hand.data[i] << "\n";
}
return os;
}
handle& operator<<(handle& hand, chunk& c) {
hand.data.push_back(c);
return hand;
}
handle& operator>>(handle& hand, chunk& c) {
c = hand.data[ hand.data.size() - 1 ];
hand.data.pop_back();
return hand;
}
From which we can initialise our zoo object and later on which ever we need. File handle is an just a file template containing a data block (handle.data) and headers and/are implemented footers later on.
Since headers are describing whole files, checking and loading can have added functionality that your specific case needs. If you have two different objects, you need to add to file, instead of changing headers/footers, one type of data insert at the start of the data, and other type push_back at the end of the data via overloaded operator<</operator>>.
For multiple objects that have no relationship between each other, you can add more private members in inheritance, for storing current position of individual segments while keeping things neat and organised for the file writing and reading.
struct zoo_file: public handle {
zoo_file() {this->extn = "zoo";}
void operator()(const char *name, const char *path) {
this->path = std::string(path);
this->name = std::string(name);
this->remake_name();
this->recheck();
}
protected:
virtual bool check_header() {
chunk temp = this->data[0];
uint32_t checksums = 0;
// Magic number
if(chunk.space != 0x895A4F4F) {
this->data.clear();
return false;
}else {
this->data.erase(this->data.begin());
}
// Checksum
temp = this->data[0];
checksums = temp.space;
this->data.erase(this->data.begin());
// Valid load number
temp = this->data[0];
if(chunk.space != 0x470D1A0A) {
this->data.clear();
return false;
}else {
this->data.erase(this->data.begin());
}
// Version + flag
temp = this->data[0];
if((chunk.space & 0x01000000) != 0x01000000) { // If not version 1.0
this->data.clear();
return false;
}else {
this->data.erase(this->data.begin());
}
temp = this->data[0];
int opt_size = (temp.space >> 24);
if(opt_size != 20) {
this->data.clear();
return false;
}
opt_size = temp.space & 0xffffff;
return (opt_size == this->data.size());
}
virtual bool load_header() {
chunk magic, checksum, vload, ver_flag, off_data;
magic = 0x895A4F4F;
checksum = 0;
vload = 0x470D1A0A;
ver_flag = 0x01000001; // 1.0, usage 1 (normal)
off_data = (20 << 24) | ((this->data.size()-1)-4);
for(int i = 0; i < this->data.size(); i++) {
checksum.space += this->data[i].parts[0];
checksum.space += this->data[i].parts[1];
checksum.space += this->data[i].parts[2];
checksum.space += this->data[i].parts[3];
}
this->data.insert(this->data.begin(), off_data);
this->data.insert(this->data.begin(), ver_flag);
this->data.insert(this->data.begin(), vload);
this->data.insert(this->data.begin(), checksum);
this->data.insert(this->data.begin(), magic);
return true;
}
friend zoo_file& operator<<(zoo_file& zf, cat sc);
friend zoo_file& operator>>(zoo_file& zf, cat sc);
friend zoo_file& operator<<(zoo_file& zf, elephant se);
friend zoo_file& operator>>(zoo_file& zf, elephant se);
};
zoo_file& operator<<(zoo_file& zf, cat &sc) {
union chunk temp;
temp = (uint32_t)sc;
zf.data.push_back(temp);
return zf;
}
zoo_file& operator>>(zoo_file& zf, cat &sc) {
size_t pos = zf.data.size() - 1;
union chunk temp;
while (1) {
if((zf[pos].space & 0x4302000) != 0x4302000) {
pos --;
}else {
temp = zf[pos];
break;
}
if(pos == 0) {break;}
}
zf.data.erase(zf.data.begin() + pos);
sc = (uint32_t)temp;
return zf;
}
// same for elephants, koyotes, giraffes .... whatever you need
Please don't just copy code. The handle object is meant as a template, so how you structure your data block is up to you. If you have a different structure and just copy code of course it won't work.
And now we can have zoo with only cats. And building a file is easy as:
// All necessary includes
// Writing the zoo file
zoo_file my_zoo;
// Push back to the std::vector some cats in
my_zoo("superb_zoo");
my_zoo.write();
// Reading the zoo file
zoo_file my_zoo;
my_zoo("superb_zoo");
my_zoo.read();

Comparing an usart received uint8_t* data with a constant string

I'm working on an Arduino Due, trying to use DMA functions as I'm working on a project where speed is critical. I found the following function to receive through serial:
uint8_t DmaSerial::get(uint8_t* bytes, uint8_t length) {
// Disable receive PDC
uart->UART_PTCR = UART_PTCR_RXTDIS;
// Wait for PDC disable to take effect
while (uart->UART_PTSR & UART_PTSR_RXTEN);
// Modulus needed if RNCR is zero and RPR counts to end of buffer
rx_tail = (uart->UART_RPR - (uint32_t)rx_buffer) % DMA_SERIAL_RX_BUFFER_LENGTH;
// Make sure RPR follows (actually only needed if RRP is counted to the end of buffer and RNCR is zero)
uart->UART_RPR = (uint32_t)rx_buffer + rx_tail;
// Update fill counter
rx_count = DMA_SERIAL_RX_BUFFER_LENGTH - uart->UART_RCR - uart->UART_RNCR;
// No bytes in buffer to retrieve
if (rx_count == 0) { uart->UART_PTCR = UART_PTCR_RXTEN; return 0; }
uint8_t i = 0;
while (length--) {
bytes[i++] = rx_buffer[rx_head];
// If buffer is wrapped, increment RNCR, else just increment the RCR
if (rx_tail > rx_head) { uart->UART_RNCR++; } else { uart->UART_RCR++; }
// Increment head and account for wrap around
rx_head = (rx_head + 1) % DMA_SERIAL_RX_BUFFER_LENGTH;
// Decrement counter keeping track of amount data in buffer
rx_count--;
// Buffer is empty
if (rx_count == 0) { break; }
}
// Turn on receiver
uart->UART_PTCR = UART_PTCR_RXTEN;
return i;
}
So, as far as I understand, this function writes to the variable bytes, as a pointer, what is received as long as is no longer than length. So I'm calling it this way:
dma_serial1.get(data, 8);
without assigning its returning value to a variable. I'm thinking the received value is stored to the uint8_t* data but I might be wrong.
Finally, what I want to do is to check if the received data is a certain char to take decisions, like this:
if (data == "t"){
//do something//}
How could I make this work?
For comparing strings like intended by if (data == "t"), you'll need a string comparison function like, for example, strcmp. For this to work, you must ensure that the arguments are actually (0-terminated) C-strings:
uint8_t data[9];
uint8_t size = dma_serial1.get(data, 8);
data[size]='\0';
if (strcmp(data,"t")==0) {
...
}
In case that the default character type in your environment is signed char, to pass data directly to string functions, a cast is needed from unsigned to signed:
if (strcmp(reinterpret_cast<const char*>(data),"t")==0) {
...
}
So a complete MVCE could look as follows:
int get(uint8_t *data, int size) {
data[0] = 't';
return 1;
}
int main()
{
uint8_t data[9];
uint8_t size = get(data, 8);
data[size]='\0';
if (strcmp(reinterpret_cast<const char*>(data),"t")==0) {
cout << "found 't'" << endl;
}
}
Output:
found 't'