I'm trying to serialize a struct in C++ in Visual Studio with multiple data types into binary file and de-serialize them all at once. But facing problem with memory allocation with strings while reading the data back. I know what the problem is and that there are other open source libraries which I can use but I do not want to use them unless it is really necessary and I also know that I can write/read data types one by one but that method is too long for struct containing large number of data types. I want to perform write/read operations in one go without using any open source library.
Here is a struct for example:
struct Frame {
bool isPass{ true };
uint64_t address{ 0 };
uint32_t age{ 0 };
float marks{ 0.0 };
std::string userName;
};
is there any way to perform write/read operation in one go in binary format?
Thankyou
Not using existing libraries is NEVER good. Still,...
You could, for example, create a create pure virtual class like
class Serializable
{
public:
virtual std::vector<char> serialize() = 0;
}
Then:
You implement it for all your own classes that you have
You implement serialization methods for all STL and PoD types that you use (std::strings, PoD types and structs with only PoD types) inside of some static class. Basically, during serialization you can put there something like [size][type][data ~ [size][type][data][size][type][data]].
Then, when you process a class for serialization, you create a buffer, first put a size into it, then type identifier, then put all bytes from all members serialized by those you have implemented in 1) and 2)
When you read anything from such an array, you do the same backwards: read N bytes from an array (first field), determine its actual type (second field), read all members, deserialize all stuff included.
The process is recursive.
But... man, its really a bad idea. Use protobuf, or boost::serialization. Or anything else - there's a lot of serialization libraries on the internet. Read these precious comments under your question. People are right.
Assuming you have some other mechanism to keep track of your frame size rewrite your struct as:
struct Frame {
bool isPass{ true };
uint8_t pad1[3]{};
uint32_t age{ 0 };
uint64_t address{ 0 };
double marks{ 0.0 };
char userName[];
};
If we have a pointer Frame* frame. We can write this using write(fd, frame, frame_size). (frame_size > sizeof(frame)).
Assuming you have read the frame into a buffer, you can access the data using:
auto frame = reinterpret<const Frame*>(buf) The length of userName will therefore be frame_size - sizeof(Frame). You can now access the elements through your struct.
The is very C like and the approach is limited to only one variable length element at the end of the array.
I have a binary file with some layout I know. For example let format be like this:
2 bytes (unsigned short) - length of a string
5 bytes (5 x chars) - the string - some id name
4 bytes (unsigned int) - a stride
24 bytes (6 x float - 2 strides of 3 floats each) - float data
The file should look like (I added spaces for readability):
5 hello 3 0.0 0.1 0.2 -0.3 -0.4 -0.5
Here 5 - is 2 bytes: 0x05 0x00. "hello" - 5 bytes and so on.
Now I want to read this file. Currently I do it so:
load file to ifstream
read this stream to char buffer[2]
cast it to unsigned short: unsigned short len{ *((unsigned short*)buffer) };. Now I have length of a string.
read a stream to vector<char> and create a std::string from this vector. Now I have string id.
the same way read next 4 bytes and cast them to unsigned int. Now I have a stride.
while not end of file read floats the same way - create a char bufferFloat[4] and cast *((float*)bufferFloat) for every float.
This works, but for me it looks ugly. Can I read directly to unsigned short or float or string etc. without char [x] creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?
P.S.: while I wrote a question, the more clearer explanation raised in my head - how to cast arbitrary number of bytes from arbitrary position in char [x]?
Update: I forgot to mention explicitly that string and float data length is not known at compile time and is variable.
If it is not for learning purpose, and if you have freedom in choosing the binary format you'd better consider using something like protobuf which will handle the serialization for you and allow to interoperate with other platforms and languages.
If you cannot use a third party API, you may look at QDataStream for inspiration
Documentation
Source code
The C way, which would work fine in C++, would be to declare a struct:
#pragma pack(1)
struct contents {
// data members;
};
Note that
You need to use a pragma to make the compiler align the data as-it-looks in the struct;
This technique only works with POD types
And then cast the read buffer directly into the struct type:
std::vector<char> buf(sizeof(contents));
file.read(buf.data(), buf.size());
contents *stuff = reinterpret_cast<contents *>(buf.data());
Now if your data's size is variable, you can separate in several chunks. To read a single binary object from the buffer, a reader function comes handy:
template<typename T>
const char *read_object(const char *buffer, T& target) {
target = *reinterpret_cast<const T*>(buffer);
return buffer + sizeof(T);
}
The main advantage is that such a reader can be specialized for more advanced c++ objects:
template<typename CT>
const char *read_object(const char *buffer, std::vector<CT>& target) {
size_t size = target.size();
CT const *buf_start = reinterpret_cast<const CT*>(buffer);
std::copy(buf_start, buf_start + size, target.begin());
return buffer + size * sizeof(CT);
}
And now in your main parser:
int n_floats;
iter = read_object(iter, n_floats);
std::vector<float> my_floats(n_floats);
iter = read_object(iter, my_floats);
Note: As Tony D observed, even if you can get the alignment right via #pragma directives and manual padding (if needed), you may still encounter incompatibility with your processor's alignment, in the form of (best case) performance issues or (worst case) trap signals. This method is probably interesting only if you have control over the file's format.
Currently I do it so:
load file to ifstream
read this stream to char buffer[2]
cast it to unsigned short: unsigned short len{ *((unsigned short*)buffer) };. Now I have length of a string.
That last risks a SIGBUS (if your character array happens to start at an odd address and your CPU can only read 16-bit values that are aligned at an even address), performance (some CPUs will read misaligned values but slower; others like modern x86s are fine and fast) and/or endianness issues. I'd suggest reading the two characters then you can say (x[0] << 8) | x[1] or vice versa, using htons if needing to correct for endianness.
read a stream to vector<char> and create a std::string from this vector. Now I have string id.
No need... just read directly into the string:
std::string s(the_size, ' ');
if (input_fstream.read(&s[0], s.size()) &&
input_stream.gcount() == s.size())
...use s...
the same way read next 4 bytes and cast them to unsigned int. Now I have a stride.
while not end of file read floats the same way - create a char bufferFloat[4] and cast *((float*)bufferFloat) for every float.
Better to read the data directly over the unsigned ints and floats, as that way the compiler will ensure correct alignment.
This works, but for me it looks ugly. Can I read directly to unsigned short or float or string etc. without char [x] creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?
struct Data
{
uint32_t x;
float y[6];
};
Data data;
if (input_stream.read((char*)&data, sizeof data) &&
input_stream.gcount() == sizeof data)
...use x and y...
Note the code above avoids reading data into potentially unaligned character arrays, wherein it's unsafe to reinterpret_cast data in a potentially unaligned char array (including inside a std::string) due to alignment issues. Again, you may need some post-read conversion with htonl if there's a chance the file content differs in endianness. If there's an unknown number of floats, you'll need to calculate and allocate sufficient storage with alignment of at least 4 bytes, then aim a Data* at it... it's legal to index past the declared array size of y as long as the memory content at the accessed addresses was part of the allocation and holds a valid float representation read in from the stream. Simpler - but with an additional read so possibly slower - read the uint32_t first then new float[n] and do a further read into there....
Practically, this type of approach can work and a lot of low level and C code does exactly this. "Cleaner" high-level libraries that might help you read the file must ultimately be doing something similar internally....
I actually implemented a quick and dirty binary format parser to read .zip files (following Wikipedia's format description) just last month, and being modern I decided to use C++ templates.
On some specific platforms, a packed struct could work, however there are things it does not handle well... such as fields of variable length. With templates, however, there is no such issue: you can get arbitrarily complex structures (and return types).
A .zip archive is relatively simple, fortunately, so I implemented something simple. Off the top of my head:
using Buffer = std::pair<unsigned char const*, size_t>;
template <typename OffsetReader>
class UInt16LEReader: private OffsetReader {
public:
UInt16LEReader() {}
explicit UInt16LEReader(OffsetReader const or): OffsetReader(or) {}
uint16_t read(Buffer const& buffer) const {
OffsetReader const& or = *this;
size_t const offset = or.read(buffer);
assert(offset <= buffer.second && "Incorrect offset");
assert(offset + 2 <= buffer.second && "Too short buffer");
unsigned char const* begin = buffer.first + offset;
// http://commandcenter.blogspot.fr/2012/04/byte-order-fallacy.html
return (uint16_t(begin[0]) << 0)
+ (uint16_t(begin[1]) << 8);
}
}; // class UInt16LEReader
// Declined for UInt[8|16|32][LE|BE]...
Of course, the basic OffsetReader actually has a constant result:
template <size_t O>
class FixedOffsetReader {
public:
size_t read(Buffer const&) const { return O; }
}; // class FixedOffsetReader
and since we are talking templates, you can switch the types at leisure (you could implement a proxy reader which delegates all reads to a shared_ptr which memoizes them).
What is interesting, though, is the end-result:
// http://en.wikipedia.org/wiki/Zip_%28file_format%29#File_headers
class LocalFileHeader {
public:
template <size_t O>
using UInt32 = UInt32LEReader<FixedOffsetReader<O>>;
template <size_t O>
using UInt16 = UInt16LEReader<FixedOffsetReader<O>>;
UInt32< 0> signature;
UInt16< 4> versionNeededToExtract;
UInt16< 6> generalPurposeBitFlag;
UInt16< 8> compressionMethod;
UInt16<10> fileLastModificationTime;
UInt16<12> fileLastModificationDate;
UInt32<14> crc32;
UInt32<18> compressedSize;
UInt32<22> uncompressedSize;
using FileNameLength = UInt16<26>;
using ExtraFieldLength = UInt16<28>;
using FileName = StringReader<FixedOffsetReader<30>, FileNameLength>;
using ExtraField = StringReader<
CombinedAdd<FixedOffsetReader<30>, FileNameLength>,
ExtraFieldLength
>;
FileName filename;
ExtraField extraField;
}; // class LocalFileHeader
This is rather simplistic, obviously, but incredibly flexible at the same time.
An obvious axis of improvement would be to improve chaining since here there is a risk of accidental overlaps. My archive reading code worked the first time I tried it though, which was evidence enough for me that this code was sufficient for the task at hand.
I had to solve this problem once. The data files were packed FORTRAN output. Alignments were all wrong. I succeeded with preprocessor tricks that did automatically what you are doing manually: unpack the raw data from a byte buffer to a struct. The idea is to describe the data in an include file:
BEGIN_STRUCT(foo)
UNSIGNED_SHORT(length)
STRING_FIELD(length, label)
UNSIGNED_INT(stride)
FLOAT_ARRAY(3 * stride)
END_STRUCT(foo)
Now you can define these macros to generate the code you need, say the struct declaration, include the above, undef and define the macros again to generate unpacking functions, followed by another include, etc.
NB I first saw this technique used in gcc for abstract syntax tree-related code generation.
If CPP is not powerful enough (or such preprocessor abuse is not for you), substitute a small lex/yacc program (or pick your favorite tool).
It's amazing to me how often it pays to think in terms of generating code rather than writing it by hand, at least in low level foundation code like this.
You should better declare a structure (with 1-byte padding - how - depends on compiler). Write using that structure, and read using same structure. Put only POD in structure, and hence no std::string etc. Use this structure only for file I/O, or other inter-process communication - use normal struct or class to hold it for further use in C++ program.
Since all of your data is variable, you can read the two blocks separately and still use casting:
struct id_contents
{
uint16_t len;
char id[];
} __attribute__((packed)); // assuming gcc, ymmv
struct data_contents
{
uint32_t stride;
float data[];
} __attribute__((packed)); // assuming gcc, ymmv
class my_row
{
const id_contents* id_;
const data_contents* data_;
size_t len;
public:
my_row(const char* buffer) {
id_= reinterpret_cast<const id_contents*>(buffer);
size_ = sizeof(*id_) + id_->len;
data_ = reinterpret_cast<const data_contents*>(buffer + size_);
size_ += sizeof(*data_) +
data_->stride * sizeof(float); // or however many, 3*float?
}
size_t size() const { return size_; }
};
That way you can use Mr. kbok's answer to parse correctly:
const char* buffer = getPointerToDataSomehow();
my_row data1(buffer);
buffer += data1.size();
my_row data2(buffer);
buffer += data2.size();
// etc.
I personally do it this way:
// some code which loads the file in memory
#pragma pack(push, 1)
struct someFile { int a, b, c; char d[0xEF]; };
#pragma pack(pop)
someFile* f = (someFile*) (file_in_memory);
int filePropertyA = f->a;
Very effective way for fixed-size structs at the start of the file.
Use a serialization library. Here are a few:
Boost serialization and Boost fusion
Cereal (my own library)
Another library called cereal (same name as mine but mine predates theirs)
Cap'n Proto
The Kaitai Struct library provides a very effective declarative approach, which has the added bonus of working across programming languages.
After installing the compiler, you will want to create a .ksy file that describes the layout of your binary file. For your case, it would look something like this:
# my_type.ksy
meta:
id: my_type
endian: be # for big-endian, or "le" for little-endian
seq: # describes the actual sequence of data one-by-one
- id: len
type: u2 # unsigned short in C++, two bytes
- id: my_string
type: str
size: 5
encoding: UTF-8
- id: stride
type: u4 # unsigned int in C++, four bytes
- id: float_data
type: f4 # a four-byte floating point number
repeat: expr
repeat-expr: 6 # repeat six times
You can then compile the .ksy file using the kaitai struct compiler ksc:
# wherever the compiler is installed
# -t specifies the target language, in this case C++
/usr/local/bin/kaitai-struct-compiler my_type.ksy -t cpp_stl
This will create a my_type.cpp file as well as a my_type.h file, which you can then include in your C++ code:
#include <fstream>
#include <kaitai/kaitaistream.h>
#include "my_type.h"
int main()
{
std::ifstream ifs("my_data.bin", std::ifstream::binary);
kaitai::kstream ks(&ifs);
my_type_t obj(&ks);
std::cout << obj.len() << '\n'; // you can now access properties of the object
return 0;
}
Hope this helped! You can find the full documentation for Kaitai Struct here. It has a load of other features and is a fantastic resource for binary parsing in general.
I use ragel tool to generate pure C procedural source code (no tables) for microcontrollers with 1-2K of RAM. It did not use any file io, buffering, and produces both easy to debug code and .dot/.pdf file with state machine diagram.
ragel can also output go, Java,.. code for parsing, but I did not use these features.
The key feature of ragel is the ability to parse any byte-build data, but you can't dig into bit fields. Other problem is ragel able to parse regular structures but has no recursion and syntax grammar parsing.
I have a program that deals char[] buffers to send/receive messages. Until now, this is how it has been handled:
#pramga pack(1)
struct messageType
{
uint8_t data0:4;
uint8_t data1:4;
uint8_t data2;
//etc...
};
#pragma pack()
void MyClass::processMessage(char* buf)
{
// I already know buf is big enough to hold a messageType
messageType* msg = reinterpret_cast<messageType*>(buf);
//populate class member variables
m_data0 = msg->data0;
m_data1 = msg->data1;
m_data2 = msg->data2;
//etc
}
Now from what I've gathered from reading around is that this is technically undefined behavior due to strict aliasing, and that memcpy should be used instead? What I don't quite understand is, what potential issues does copying buf byte for byte to messageType msgNotPtr, then reading from that msgNotPtr, actually avoid?
Regarding sending, instead of doing this:
void MyClass::sendMessage()
{
char buf[max_tx_size];
messageType* msg = reinterpret_cast<messageType*>(buf);
msg->data0 = m_data0;
//etc...
send(buf);
}
I've read that I should be using placement new isntead, ala:
messageType* msg = new(buf) messageType;
If I do it this way, do I need to add additional cleanup, given that the struct messageType only contains POD types (such as manually firing the destructor)?
Edit: Now that I think about it, is sendMessage still undefined? Do I need to also swap out the last command with something like send(reinterpret_cast<char*>(msg)) to make sure the compiler does not optimize out the call?
In my opinion, the proper method to load a class instance from a uint8_t buffer is to load each member separately.
The class should know the positions of its members within the buffer and where and the size of any padding or reserved areas.
One of the issues with mapping structures to buffers is that the compiler can add space between members. If you pack the structure to eliminate the padding, you slow down your program every time you access a member.
So, bite the performance on input and output by placing the members where you want them in the buffer (or extracting the members according to a specification). The rest of the program can access the members the way the compiler aligned them.
This includes bit fields.
Also, by having the members loaded individually, the Endianess of the fields in the buffer can be account for much easier.
I am using sqlite3 on an embedded system with Modbus. I need to pack the information from sqlite3's select statement results into an array of shorts to be able to pass over Modbus.
Currently, I am only using 2 data types from sqlite3 (TEXT and INT). I am trying to pack the results of each columns results into an array of shorts. For example:
typedef struct
{
short unitSN[4];
short unitClass[1];
}UnitSettings;
UnitSettings unitSettings;
// prepare and execute select statement for table, then put into structs members
s = sqlite3_prepare(db, sqlstmt, strlen(sqlstmt), &stmt, &pzTest);
s = sqlite3_step( stmt );
// I want to do something like this:
unitSettings.unitSN[] = sqlite3_column_text(stmt, 0);
unitSettings.unitClass[] = sqlite3_column_int(stmt, 1);
I was thinking about creating functions to convert from unsigned char* (result of sqlite3_column_text) to short array and int to short array. Is this the way to go about it? Or is there are proper way to cast these results on the fly?
Also, was thinking of making the structs match the sqlite3 table types for easy copying and then at the end, have a function to run through each structs members and convert it into an array of shorts at the end.
EDIT: I just read about unions within structs and I think this would be exactly what I need:
typedef struct
{
union
{
unsigned char* unitSN;
short unitSNArr[4];
}
union
{
int unitClass;
short unitClassArr[1];
}
}UnitSettings;
It says that now they both look at the same piece of memory but can read it in different ways, which is what I want. This would be much easier than any kind of converting right?
sqlite will not provide these conversions for you automatically. You'd have to do the conversions yourself.
I would just use plain text access, and then write free functions to translate access to the short. Something like this, not really sure what would interface would make the most sense per your access.
void read_short(const char* data, size_t index, short& val) {
val = *(reintepret_cast<short*>(&data[index*2]));
}
Maybe your use case already has them in arrays of shorts or something? I'd probably still only do one short per field if that's actually how you use the data.
Personally, I would just put them into the database as integers if you can help it. You'd have to write special tools just to look at the database values, which isn't exactly friendly for maintenance.
I need to send some information on a VxWorks message queue. The information to be sent is decided at runtime and may be of different data types. I am using a structure for this -
struct structData
{
char m_chType; // variable to indicate the data type - long, float or string
long m_lData; // variable to hold long value
float m_fData; // variable to hold float value
string m_strData; // variable to hold string value
};
I am currently sending an array of structData over the message queue.
structData arrStruct[MAX_SIZE];
The problem here is that only one variable in the structure is useful at a time, the other two are useless. The message queue is therefore unneccessarily overloaded.
I can't use unions because the datatype and the value are required.
I tried using templates, but it doesn't solve the problem.I can only send an array of structures of one datatype at a time.
template <typename T>
struct structData
{
char m_chType;
T m_Data;
}
structData<int> arrStruct[MAX_SIZE];
Is there a standard way to hold such information?
I don't see why you cannot use a union. This is the standard way:
struct structData
{
char m_chType; // variable to indicate the data type - long, float or string
union
{
long m_lData; // variable to hold long value
float m_fData; // variable to hold float value
char *m_strData; // variable to hold string value
}
};
Normally then, you switch on the data type, and then access on the field which is valid for that type.
Note that you cannot put a string into a union, because the string type is a non-POD type. I have changed it to use a pointer, which could be a C zero-terminated string. You must then consider the possibility of allocating and deleting the string data as necessary.
You can use boost::variant for this.
There are many ways to handle different datatypes. Besides the union solution you can use a generic struct like :
typedef struct
{
char m_type;
void* m_data;
}
structData;
This way you know the type and you can cast the void* pointer into the right type.
This is like the union solution a more C than C++ way of doing things.
The C++ way would be something using inheritance. You define a base "Data" class an use inheritance to specialize the data. You can use RTTI to check for type if needed.
But as you stated, you need to send your data over a VxWork queue. I'm no specialist but if those queues are OS realtime queue, all the previous solutions are not good ones. Your problem is that your data have variable length (in particular string) and you need to send them through a queue that probably ask for something like a fixed length datastruct and the actual length of this datastruct.
In my experience, the right way to handle this is to serialize the data into something like a buffer class/struct. This way you can optimize the size (you only serialize what you need) and you can send your buffer through your queue.
To serialize you can use something like 1 byte for type then data. To handle variable length data, you can use 1 to n bytes to encode data length, so you can deserialize the data.
For a string :
1 byte to code the type (0x01 = string, ...)
2 bytes to code the string length (if you need less than 65536 bytes)
n data bytes
So the string "Hello" will be serialized as :
0x00 0x00 0x07 0x65 0x48 0x6c 0x6c
You need a buffer class and a serializer/deserializer class. Then you do something like :
serialize data
send serialized data into queue
and on the other side
receive data
deserialize data
I hope it helps and that I have not misunderstood your problem. The serialization part is overkill if the VxWorks queues are not what I think ...
Be very careful with the "string" member in the message queue. Under the hood, it's a pointer to some malloc'd memory that contains the actual string characters, so you're only passing the 'pointer' in your queue, not the real string.
The receiving process may potentially not be able to access the string memory, or -worse - it may have already been destroyed by the time your message reader tries to get it.
+1 for 1800 and Ylisar.
Using an union for this kind of things is probably the way to go. But, as others pointed out, it has several drawbacks:
inherently error prone.
not safely extensible.
can't handle members with constructors (although you can use pointers).
So unless you can built a nice wrapper, going the boost::variant way is probably safer.
This is a bit offtopic, but this issue is one of the reasons why languages of the ML family have such a strong appeal (at least for me). For example, your issue is elegantly solved in OCaml with:
(*
* LData, FData and StrData are constructors for this sum type,
* they can have any number of arguments
*)
type structData = LData of int | FData of float | StrData of string
(*
* the compiler automatically infers the function signature
* and checks the match exhaustiveness.
*)
let print x =
match x with
| LData(i) -> Printf.printf "%d\n" i
| FData(f) -> Printf.printf "%f\n" f
| StrData(s) -> Printf.printf "%s\n" s
Try QVariant in Qt