Read data with varying formats in C++ - c++

I'm creating my first real binary parser (a tiff reader) and have a question regarding how to allocate memory. I want to create a struct within my TiffSpec class for the IFD entries. These entries will always be 12 bytes, but depending upon the type specified in that particular entry, the values at the end could be of different types (or maybe just an address to another location in the file). What would be the best way to go about casting this sort of data? The smallest size memory I believe I would be dealing with would be 1 byte.

In C++ you should use a union.
This is a mechanism by which you can define several, overlapping data types, possibly with a common header.
See this article for how to use unions for exactly your problem -- a common header with different data underneath.

Related

Looking for a concept related to "either or" in a data structure (almost like std::variant)

I'm modernizing the code that reads and writes to a custom binary file format now.
I'm allowed to use C++17 and have already modernized large parts of the code base.
There are mainly two problems at hand.
binary selectors (my own name)
cased selectors (my own name as well)
For #1 it is as follows:
Given that 1 bit is set in a binary string. You either (read/write) two completely different structs.
For example, if bit 17 is set to true, it means bits 18+ should be streamed with Struct1.
But if bit 17 is false, bits 18+ should be streamed with Struct2.
Struct1 and Struct2 are completely different with minimal overlap.
For #2 it is basically the same but as follows:
Given that x bits in the bit stream are set. You have a potential pool of completely different structs. The number of structs is allowed to be between [0, 2**x] (inclusive range).
For instance, in one case you might have 3 bits and 5 structs.
But in another case, you might have 3 bits and 8 structs.
Again the overlap between the structs is minimal.
I'm currently using std::variant for this.
For case #1, it would be just two structs std::variant<Struct1, Struct2>
For case #2, it would be just a flat list of the structs again using std::variant.
The selector I use is naturally the index in the variant, but it needs to be remapped for a different bit pattern that actually needs to be written/read to/from the format.
Have any of you used or encountered some better strategies for these cases?
Is there a generally known approach to solve this in a much better way?
Is there a generally known approach to solve this in a much better way?
Nope, it's highly specific.
Have any of you used or encountered some better strategies for these cases?
The bit patterns should be encoded in the type, somehow. Almost all the (de)serialization can be generic so long as the required information is stored somewhere.
For example,
template <uint8_t BitPattern, typename T>
struct IdentifiedVariant;
// ...
using Field1 = std::variant< IdentifiedVariant<0x01, Field1a>,
IdentifiedVariant<0x02, Field1b> >;
I've absolutely used types like this in the past to automate all the boilerplate, but the details are extremely specific to the format and rarely reusable.
Note that even though you can't overlay your variant type on a buffer, there's no need for (de)serialization to proceed bit-by-bit. There's hardly any speed penalty so long as the data is already read into a buffer - if you really need to go full zero-copy, you can just have your FieldNx types keep a pointer into the buffer and decode fields on demand.
If you want your data to be bit-continuous you can't use std::variant You will need to use std::bitset or managing the memory completely manually to do that. But it isn't practical to do so because your structs will not be byte-aligned so you will need to do every read/write manually bit by bit. This will reduce speed greatly, so I only recommend this way if you want to save every bit of memory even at the cost of speed. And at this storage it will be hard to find the nth element you will need to iterate.
std::variant<T1,T2> will waste a bit of space because 1) it will always use enough space for storing the the bigger data, but using that over bit-manipulation will increase the speed and the readability of the code. (And will be easier to write)

How one can safely serialize std::basic_istream<char>::pos_type?

In one of my projects I have to cache positional information about certain data chunks found in large files. I've already implemented a small API built around std::basic_istream<char>::pos_type placed in maps.
Now I need to serialize these descriptors into a bytestream and write them on a disk for further usage (on other *nix-machines as well). I have read that this type is platform-dependent but still rather being a POD-type. So my questions are:
Whether it will be better to save something besides of just offsets? E.g. std::fpos<std::mbstate_t> keeping the state of reading structure?
How can I safely obtain and restore the offset data from std::basic_istream<char>::pos_type (and other info if it is need)?
Thank you in advance.
The structure of std::fpos<mbstate_t> is unspecified and there may be non-trivial state in the mbstate_t. You certainly can't portably serialize these objects. You can obtain a value of the offset type (std::streamoff) which is an integer type and its value can be serialized.

How do you save objects with variable length members to a binary file usefully?

I'm working with c++ and I am writing a budget program (I'm aware many are available--it's just a learning project).
I want to save what I call a book object that contains other objects such as 'pages'. Pages also contain cashflows and entries. The issue is that there can be any amount of entries or cashflows.
I have found a lot of information on saving data to text files but that is not what I want to do.
I have tried looking into using the boost library as I've been told serialization might be the solution to this problem. I'm not entirely sure which functions is boost are going to help or even what the proper ways are to use boost.
Most examples of binary files that I have seen are with objects that have fixed size members. For example, a point might contain an x value and a y value that are both doubles. This will always be the case so it is simple to just use sizeOf(Point).
So, I'm either looking for direct answers to this question or useful links to information on how to solve my problem. But please make sure you links are specific to the question.
I've also posted the same question on cplusplus
In general, there are two methods to store variable length records:
Store size integer first, followed by the data.
Store the data, append a sentinel character (or value) at the end.
C-style strings use the 2nd option.
For option one, the number contains the size of the data.
Optional Fields
If your considering relational database design for optional fields, you would have one table with the known or fixed records and another table containing an option field with the record ID.
A simpler route may be to go to something similar to XML: field labels.
Split your object into two sections: static fields and optional fields.
The static field section would be followed by an optional field section. The optional field section would contain the field name, followed by the field data. Read in the field name then the value.
I suggest you review your design to see if optional fields can be eliminated. Also, for complex fields, have them read in their own data.
Storing Binary Data
If the data is shared between platforms, consider using ASCII or textual representation.
Read up upon Endianess and also bit sizes. For example one platform could store its binary representation least significant byte first and use 32 bits (4 bytes). The receiving platform, 64-bit, most significant byte first, would have problems reading the data directly and would need to convert; thus losing any benefit from binary storage.
Similarly, floating point doesn't fare well in binary either. There is also the loss of precision when converting between floating point formats.
When using optional fields in binary, one would use a sentinel byte or number for the field ID rather than a textual name.
Also, data in textual format is much easier to debug than data in binary format.
Consider using a Database
See At what point is it worth using a database?
The boost::serialization documentation is here.
boost::serialization handles user-written classes as well as STL containers: std::deque, std::list, etc.

Why Serialization when a class object in memory is already binary (C/C++)?

My guess is that data is scattered in physical memory (even the data of a class object is sequential in virtual memory), so in order to send the data correctly it needs to be reassembled, and to be able to send over the network, one additional step is the transformation of host byte order to network byte order. Is it correct?
Proper serialization can be used to send data to arbitrary systems, that might not work under the same architecture as the source host.
Even an object that only consist of native types can be troublesome sharing between two systems because of the extra padding that might exists in between and after members, among other things. Sharing raw memory dumps of objects between programs compiled for the same architecture but with different compiler versions can also turn into a big hassle. There is no guarantee how variable type T actually is stored in memory.
If you are not working with pointers (references included), and the data is meant to be read by the same binary as it's dumped from, it's usually safe just to dump a raw struct to disk, but when sending data to another host.. drum roll serialization is the way to go.
I've heard developers talking about ntohl / htonl / ntohl / ntohs as methods of serializing/deserializing integers, and when you think about it saying that isn't that far from the truth.
The word "serialization" is often used to describe this "complicated method of storing data in a generic way", but then again; your first programming assignment where you were asked to save information about Dogs to file (hopefully*) made use of serialization, in some way or another.
* "hopefully" meaning that you didn't dump the raw memory representation of your Dog object to disk
Pointers!
If you've allocated memory on the heap you'll just end up with a serialised pointer pointing to an arbitrary area of memory. If you just have a few ints and chars then yes you can just write it out directly to a file, but that then becomes platform dependent because of the byte ordering that you mentioned.
Pointer and data pack(data align)
If you memcpy your object's memory, there is dangerous to copy a wild pointer value instead of it's data. There is another risk, if the sender and receiver have different data pack(data align) method, you will get rubbish after decoding.
Binary representations may be different between different architectures, compilers and even different versions of the same compiler. There's no guarantee that what system A sees as a signed integer will be seen as the same on system B. Byte ordering, word langths, struct padding etc will become hard to debug problems if you don't properly define the protocol or file format for exchanging the data.
Class (when we speak of C++) also includes virtual method pointers - and they must be reconstructed on receiving end.

When writing struct to a file too many bytes are written

I'm trying to write a simple TGA image file saver as a learning exercise in C++. I'm basing my code on an example TGA loader that declares a struct for the header and then uses fread() to load the entire header in one go.
My program isn't working right now, and it seems like there are two extra bytes being written to the file. I printed the sizeof my struct and it's two bytes too large (20 instead of the correct 18). After a little reading I think the problem is related to data alignment and padding (I'm not very familiar with how structs get stored).
My question is what's a good solution for this? I guess I could write the struct's components byte-by-byte, instead of using fwrite() to write the entire struct at once, which is what I'm going now. I assumed that if it worked when loading the header, it would also work when writing it. Was my assumption incorrect?
Compilers are allowed to, and frequently do, insert padding bytes into structures so that fields are aligned on appropriate memory addresses.
The simplest solution is to instruct the compiler to "pack" the structure, which means to not insert any padding bytes. However this will make data access to the structure slower, and the means of doing it are compiler-dependent. If you want to be portable and efficient, the only way to go is writing the fields individually.
It did not work even when loading it. You have a couple of options:
Use compiler specific directives (#pragma packed, etc.) to force your structure to be 18 bytes.
Write the code more portable by using offsets and pointers to get/set the buffer fields.
Elements of a struct are generally arranged on 4byte boundaries.
if you have shorts, or chars in your struct, the struct is larger than the sum of the individual element sizes.