How to interpret binary data in C++? - c++

I am sending and receiving binary data to/from a device in packets (64 byte). The data has a specific format, parts of which vary with different request / response.
Now I am designing an interpreter for the received data. Simply reading the data by positions is OK, but doesn't look that cool when I have a dozen different response formats. I am currently thinking about creating a few structs for that purpose, but I don't know how will it go with padding.
Maybe there's a better way?
Related:
Safe, efficient way to access unaligned data in a network packet from C

You need to use structs and or unions. You'll need to make sure your data is properly packed on both sides of the connection and you may want to translate to and from network byte order on each end if there is any chance that either side of the connection could be running with a different endianess.
As an example:
#pragma pack(push) /* push current alignment to stack */
#pragma pack(1) /* set alignment to 1 byte boundary */
typedef struct {
unsigned int packetID; // identifies packet in one direction
unsigned int data_length;
char receipt_flag; // indicates to ack packet or keep sending packet till acked
char data[]; // this is typically ascii string data w/ \n terminated fields but could also be binary
} tPacketBuffer ;
#pragma pack(pop) /* restore original alignment from stack */
and then when assigning:
packetBuffer.packetID = htonl(123456);
and then when receiving:
packetBuffer.packetID = ntohl(packetBuffer.packetID);
Here are some discussions of Endianness and Alignment and Structure Packing
If you don't pack the structure it'll end up aligned to word boundaries and the internal layout of the structure and it's size will be incorrect.

I've done this innumerable times before: it's a very common scenario. There's a number of things which I virtually always do.
Don't worry too much about making it the most efficient thing available.
If we do wind up spending a lot of time packing and unpacking packets, then we can always change it to be more efficient. Whilst I've not encountered a case where I've had to as yet, I've not been implementing network routers!
Whilst using structs/unions is the most efficient approach in term of runtime, it comes with a number of complications: convincing your compiler to pack the structs/unions to match the octet structure of the packets you need, work to avoid alignment and endianness issues, and a lack of safety since there is no or little opportunity to do sanity checks on debug builds.
I often wind up with an architecture including the following kinds of things:
A packet base class. Any common data fields are accessible (but not modifiable). If the data isn't stored in a packed format, then there's a virtual function which will produce a packed packet.
A number of presentation classes for specific packet types, derived from common packet type. If we're using a packing function, then each presentation class must implement it.
Anything which can be inferred from the specific type of the presentation class (i.e. a packet type id from a common data field), is dealt with as part of initialisation and is otherwise unmodifiable.
Each presentation class can be constructed from an unpacked packet, or will gracefully fail if the packet data is invalid for the that type. This can then be wrapped up in a factory for convenience.
If we don't have RTTI available, we can get "poor-man's RTTI" using the packet id to determine which specific presentation class an object really is.
In all of this, it's possible (even if just for debug builds) to verify that each field which is modifiable is being set to a sane value. Whilst it might seem like a lot of work, it makes it very difficult to have an invalidly formatted packet, a pre-packed packets contents can be easilly checked by eye using a debugger (since it's all in normal platform-native format variables).
If we do have to implement a more efficient storage scheme, that too can be wrapped in this abstraction with little additional performance cost.

It's hard to say what the best solution is without knowing the exact format(s) of the data. Have you considered using unions?

I agree with Wuggy. You can also use code generation to do this. Use a simple data-definition file to define all your packet types, then run a python script over it to generate prototype structures and serialiation/unserialization functions for each one.

This is an "out-of-the-box" solution, but I'd suggest to take a look at the Python construct library.
Construct is a python library for
parsing and building of data
structures (binary or textual). It is
based on the concept of defining data
structures in a declarative manner,
rather than procedural code: more
complex constructs are composed of a
hierarchy of simpler ones. It's the
first library that makes parsing fun,
instead of the usual headache it is
today.
construct is very robust and powerful, and just reading the tutorial will help you understand the problem better. The author also has plans for auto-generating C code from definitions, so it's definitely worth the effort to read about.

Related

Looking for a concept related to "either or" in a data structure (almost like std::variant)

I'm modernizing the code that reads and writes to a custom binary file format now.
I'm allowed to use C++17 and have already modernized large parts of the code base.
There are mainly two problems at hand.
binary selectors (my own name)
cased selectors (my own name as well)
For #1 it is as follows:
Given that 1 bit is set in a binary string. You either (read/write) two completely different structs.
For example, if bit 17 is set to true, it means bits 18+ should be streamed with Struct1.
But if bit 17 is false, bits 18+ should be streamed with Struct2.
Struct1 and Struct2 are completely different with minimal overlap.
For #2 it is basically the same but as follows:
Given that x bits in the bit stream are set. You have a potential pool of completely different structs. The number of structs is allowed to be between [0, 2**x] (inclusive range).
For instance, in one case you might have 3 bits and 5 structs.
But in another case, you might have 3 bits and 8 structs.
Again the overlap between the structs is minimal.
I'm currently using std::variant for this.
For case #1, it would be just two structs std::variant<Struct1, Struct2>
For case #2, it would be just a flat list of the structs again using std::variant.
The selector I use is naturally the index in the variant, but it needs to be remapped for a different bit pattern that actually needs to be written/read to/from the format.
Have any of you used or encountered some better strategies for these cases?
Is there a generally known approach to solve this in a much better way?
Is there a generally known approach to solve this in a much better way?
Nope, it's highly specific.
Have any of you used or encountered some better strategies for these cases?
The bit patterns should be encoded in the type, somehow. Almost all the (de)serialization can be generic so long as the required information is stored somewhere.
For example,
template <uint8_t BitPattern, typename T>
struct IdentifiedVariant;
// ...
using Field1 = std::variant< IdentifiedVariant<0x01, Field1a>,
IdentifiedVariant<0x02, Field1b> >;
I've absolutely used types like this in the past to automate all the boilerplate, but the details are extremely specific to the format and rarely reusable.
Note that even though you can't overlay your variant type on a buffer, there's no need for (de)serialization to proceed bit-by-bit. There's hardly any speed penalty so long as the data is already read into a buffer - if you really need to go full zero-copy, you can just have your FieldNx types keep a pointer into the buffer and decode fields on demand.
If you want your data to be bit-continuous you can't use std::variant You will need to use std::bitset or managing the memory completely manually to do that. But it isn't practical to do so because your structs will not be byte-aligned so you will need to do every read/write manually bit by bit. This will reduce speed greatly, so I only recommend this way if you want to save every bit of memory even at the cost of speed. And at this storage it will be hard to find the nth element you will need to iterate.
std::variant<T1,T2> will waste a bit of space because 1) it will always use enough space for storing the the bigger data, but using that over bit-manipulation will increase the speed and the readability of the code. (And will be easier to write)

OMNeT++ cPacket as std::bitset to apply Reed-Solomon encoding

Having a packet
cPacket *pk
how can I obtain the bit representation of it? For example, in the form of
std::bitset<pk->getBitLength()> pk_bits;
My final goal is to apply an encoding scheme to the packet, i.e. Reed-Solomon encoding.
As #rcgldr commented, a simple cPacket in itself does not hold any data, at least not in the sense of how real packets do. And it's not necessary in most models, because they operate on a higher, more abstract level, which makes working with them easier, and running them faster.
The information that travels between the nodes of the simulation is what you put in the fields of your messages (preferably custom made using the message compiler of OMNeT++, from .msg files).
This is, however, completely independent of the bitLength/byteLength properties of the cPacket class, which is just a number that can be set to any value for any message.
You can, of course, choose to model a realistic protocol by adding fields to your message that correspond to a real(istic) network protocol header, like TCP or IP, or even something you just made up. But this still doesn't provide any (reliable) byte-sequence-like access to the contents, because it is not always trivial how the individual fields should be serialized into simple octets.
To achieve this, INET for example has separate *[De]Serializer classes for a number of its custom message types. You can do the same with yours if you want.
A simpler solution would be to represent any payload in the packet by adding an std::vector<unsigned char> or even an std::bitset if you prefer that. And just treat that part separately from the easily accessible fields, applying any encoding on its content.
And finally, just like with any question like "how to add an encryption library to a simulation and use it to transform packets": Are you sure that adding a real byte-by-byte encoder/serializer/etc. to a simulation is the right choice to achieve what you're trying to do? I mean, it could be, and it's possible, but there might be better/simpler/faster ways. In terms of modeling.

Is it sufficient that a class is POD to read it from binary?

For a client/server application I need to send and recive c++ objects. I don't need the corresponding classes to do anything fancy but want to have maximal performance (regarding network traffic and computation). So I though of simply transferring them as binary strings. Basicly I want to be able to do the following
//Create original object
MyClass oldObj();
//save to char array
char* save = new char[sizeof(MyClass)];
memcpy(save, &oldObj, sizeof(MyClass));
//Somewhere of course there would be the transfer to the client/server
//Read back from char array
MyClass newObj();
memcpy(&newObj, save, sizeof(MyClass));
My question: What does my class need to fullfill in order for this to work?
Naturaly Pointers as members won't work when transferring to another application. But is it sufficient that my Class is considered POD (in c++03 and/or c++11) and does not have any pointers or equivalents (like STL containers) as members?
Both machines need to:
Have the same Endianess (for int)
The same floating point representation (double)
The same size for all types.
The Same compiler
The Same flags used to build the application.
Pointers dont transfer well.
BUT the network is going to be the slowest part here.
The cost of serializing most objects is going to be irrelevant compared to the cost of transfer. Of course the bigger your object the higher the cost but it takes a while before it is significant to make a dent.
The higher cost of maintenance is also you should factor in.
What does my class need to fulfill in order for this to work?
It must not have pointer members, you already mention that.
It must not have members whose size is implementation defined, like int.
It must not have integers members, due to different endianness.
It must not have floating point members, due to different representations.
...and probably more!
Basically, you cannot do that except for very particularly constrained scenarios. You will have to pick a protocol and make your data conform to it to send it through the network safely.
Is not a big deal since performance will be bounded by network speed and latency, not by the operations needed on your values to conform to the protocol.
How much control do you have over the hardware/OS that this runs on? Are you writing code that is super-portable, or will it ONLY run on 32- and 64-bit x86 Windows [for example]?
To be fully "super-portable", as explained above, you can't have any form of "implementation defined" sized objects (such as int that can be 16, 18, 32, 36 or 64 bits, for example). Such items need to be stored as bytes of defined number and order to make sure it will not get cut off/re-ordered when transferring. Floating point can be even worse...
A lot of "super-portable" applications store their data as text. It's a little slower, but it makes it trivially portable, since text is just a stream of bytes whatever architecture you run it on, and it's ordered the same way whichever machine you use (as long as you stick to 0-9, A-Za-z, !?<>,.()*& and a few other characters - and beware of EBCDIC encoded machines, but they tend to handle "ascii-to-ebcdic" conversion). The other end just need to conver the text back to strings/integers/floats/doubles, whatever you need. A conversion from integer to string of digits takes one divide per digit (using hex or base-36 makes that a bit better, but makes it much less human readable - sometimes a good thing, sometimes a bad thing). This is clearly slower than storing 4 bytes. THe other drawback is that it's (depending on values used) often longer to store a number in text than as binary. So your network packets will be a little larger. This will have a greater impact than the conversion, as processors can do a lot of math in the time it takes to send 1KB with a 10Gbit network card. And of course, you need a few extra bytes (spaces, commas, newlines or whatever it may be) so that you can tell the difference between one number 123456 and three 12, 34, 56. [Of course, no need to use ", " between each]. And you need some code to parse the whole thing at the other end once it has arrived.
If you know that your system(s) always have 32-bit integers and IEEE-754 floating point numbers [these are extremely common!], then you may well get away with just worrying about byte order. And if you know that it's always going to be on "x86" or some such, you don't have to worry about byte order either. But you now may have to modify your code when you decide that "running my code on an iphone would be a good idea". Of course, you could leave that to the iphone side of things to conform to whatever the rest requires.
Other answers have mentioned how it is possible to use a class for this purpose. Personally, I prefer to use a struct instead. In C++, a struct can have member methods/operators, constructor/destructor, supports inheritance, etc just like a class does. However, a struct has a well-defined and predictable memory layout and can have that layout explicitally aligned via #pragma statements to add/remove the compiler's implicit padding (I have never tried aligning a class before, but I think it is supported). I always use an 8bit-aligned struct for data that has to be exchanged outside of the app's process. For all intents and purposes, in modern compilers, a struct is basically identical to a class, just the default visibility of its members is public instead of private. But I like to keep struct and class separated for different purposes. A struct is just a raw container of data that you can freely manpulate, overwrite in memory, etc. A class is an object whose memory layout and padding is compiler-defined and should not be messed with.

Why Serialization when a class object in memory is already binary (C/C++)?

My guess is that data is scattered in physical memory (even the data of a class object is sequential in virtual memory), so in order to send the data correctly it needs to be reassembled, and to be able to send over the network, one additional step is the transformation of host byte order to network byte order. Is it correct?
Proper serialization can be used to send data to arbitrary systems, that might not work under the same architecture as the source host.
Even an object that only consist of native types can be troublesome sharing between two systems because of the extra padding that might exists in between and after members, among other things. Sharing raw memory dumps of objects between programs compiled for the same architecture but with different compiler versions can also turn into a big hassle. There is no guarantee how variable type T actually is stored in memory.
If you are not working with pointers (references included), and the data is meant to be read by the same binary as it's dumped from, it's usually safe just to dump a raw struct to disk, but when sending data to another host.. drum roll serialization is the way to go.
I've heard developers talking about ntohl / htonl / ntohl / ntohs as methods of serializing/deserializing integers, and when you think about it saying that isn't that far from the truth.
The word "serialization" is often used to describe this "complicated method of storing data in a generic way", but then again; your first programming assignment where you were asked to save information about Dogs to file (hopefully*) made use of serialization, in some way or another.
* "hopefully" meaning that you didn't dump the raw memory representation of your Dog object to disk
Pointers!
If you've allocated memory on the heap you'll just end up with a serialised pointer pointing to an arbitrary area of memory. If you just have a few ints and chars then yes you can just write it out directly to a file, but that then becomes platform dependent because of the byte ordering that you mentioned.
Pointer and data pack(data align)
If you memcpy your object's memory, there is dangerous to copy a wild pointer value instead of it's data. There is another risk, if the sender and receiver have different data pack(data align) method, you will get rubbish after decoding.
Binary representations may be different between different architectures, compilers and even different versions of the same compiler. There's no guarantee that what system A sees as a signed integer will be seen as the same on system B. Byte ordering, word langths, struct padding etc will become hard to debug problems if you don't properly define the protocol or file format for exchanging the data.
Class (when we speak of C++) also includes virtual method pointers - and they must be reconstructed on receiving end.

How to handle changing data structures on program version update?

I do embedded software, but this isn't really an embedded question, I guess. I don't (can't for technical reasons) use a database like MySQL, just C or C++ structs.
Is there a generic philosophy of how to handle changes in the layout of these structs from version to version of the program?
Let's take an address book. From program version x to x+1, what if:
a field is deleted (seems simple enough) or added (ok if all can use some new default)?
a string gets longer or shorter? An int goes from 8 to 16 bits of signed / unsigned?
maybe I combine surname/forename, or split name into two fields?
These are just some simple examples; I am not looking for answers to those, but rather for a generic solution.
Obviously I need some hard coded logic to take care of each change.
What if someone doesn't upgrade from version x to x+1, but waits for x+2? Should I try to combine the changes, or just apply x -> x+ 1 followed by x+1 -> x+2?
What if version x+1 is buggy and we need to roll-back to a previous version of the s/w, but have already "upgraded" the data structures?
I am leaning towards TLV (http://en.wikipedia.org/wiki/Type-length-value) but can see a lot of potential headaches.
This is nothing new, so I just wondered how others do it....
I do have some code where a longer string is puzzled together from two shorter segments if necessary. Yuck. Here's my experience after 12 years of keeping some data compatible:
Define your goals - there are two:
new versions should be able to read what old versions write
old versions should be able to read what new versions write (harder)
Add version support to release 0 - At least write a version header. Together with keeping (potentially a lot of) old reader code around that can solve the first case primitively. If you don't want to implement case 2, start rejecting new data right now!
If you need only case 1, and and the expected changes over time are rather minor, you are set. Anyway, these two things done before the first release can save you many headaches later.
Convert during serialization - at run time, only keep the data in the "new format" in memory. Do necessary conversions and tests at persistence limits (convert to newest when reading, implement backward compatibility when writing). This isolates version problems in one place, helping to avoid hard-to-track-down bugs.
Keep a set of test data from all versions around.
Store a subset of available types - limit the actually serialized data to a few data types, such as int, string, double. In most cases, the extra storage size is made up by reduced code size supporting changes in these types. (That's not always a tradeoff you can make on an embedded system, though).
e.g. don't store integers shorter than the native width. (you might need to do that when you need to store long integer arrays).
add a breaker - store some key that allows you to intentionally make old code display an error message that this new data is incompatible. You can use a string that is part of the error message - then your old version could display an error message it doesn't know about - "you can import this data using the ConvertX tool from our web site" is not great in a localized application but still better than "Ungültiges Format".
Don't serialize structs directly - that's the logical / physical separation. We work with a mix of two, both having their pros and cons. None of these can be implemented without some runtime overhead, which can pretty much limit your choices in an embedded environment. At any rate, don't use fixed array/string lengths during persistence, that should already solve half of your troubles.
(A) a proper serialization mechanism - we use a bianry serializer that allows to start a "chunk" when storing, which has its own length header. When reading, extra data is skipped and missing data is default-initialized (which simplifies implementing "read old data" a lot in the serializationj code.) Chunks can be nested. That's all you need on the physical side, but needs some sugar-coating for common tasks.
(B) use a different in-memory representation - the in-memory reprentation could basically be a map<id, record> where id woukld likely be an integer, and record could be
empty (not stored)
a primitive type (string, integer, double - the less you use the easier it gets)
an array of primitive types
and array of records
I initially wrote that so the guys don't ask me for every format compatibility question, and while the implementation has many shortcomings (I wish I'd recognize the problem with the clarity of today...) it could solve
Querying a non existing value will by default return a default/zero initialized value. when you keep that in mind when accessing the data and when adding new data this helps a lot: Imagine version 1 would calculate "foo length" automatically, whereas in version 2 the user can overrride that setting. A value of zero - in the "calculation type" or "length" should mean "calculate automatically", and you are set.
The following are "change" scenarios you can expect:
a flag (yes/no) is extended to an enum ("yes/no/auto")
a setting splits up into two settings (e.g. "add border" could be split into "add border on even days" / "add border on odd days".)
a setting is added, overriding (or worse, extending) an existing setting.
For implementing case 2, you also need to consider:
no value may ever be remvoed or replaced by another one. (But in the new format, it could say "not supported", and a new item is added)
an enum may contain unknown values, other changes of valid range
phew. that was a lot. But it's not as complicated as it seems.
There's a huge concept that the relational database people use.
It's called breaking the architecture into "Logical" and "Physical" layers.
Your structs are both a logical and a physical layer mashed together into a hard-to-change thing.
You want your program to depend on a logical layer. You want your logical layer to -- in turn -- map to physical storage. That allows you to make changes without breaking things.
You don't need to reinvent SQL to accomplish this.
If your data lives entirely in memory, then think about this. Divorce the physical file representation from the in-memory representation. Write the data in some "generic", flexible, easy-to-parse format (like JSON or YAML). This allows you to read in a generic format and build your highly version-specific in-memory structures.
If your data is synchronized onto a filesystem, you have more work to do. Again, look at the RDBMS design idea.
Don't code a simple brainless struct. Create a "record" which maps field names to field values. It's a linked list of name-value pairs. This is easily extensible to add new fields or change the data type of the value.
Some simple guidelines if you're talking about a structure use as in a C API:
have a structure size field at the start of the struct - this way code using the struct can always ensure they're dealing only with valid data (for example, many of the structures the Windows API uses start with a cbCount field so these APIs can handle calls made by code compiled against old SDKs or even newer SDKs that had added fields
Never remove a field. If you don't need to use it anymore, that's one thing, but to keep things sane for dealing with code that uses an older version of the structure, don't remove the field.
it may be wise to include a version number field, but often the count field can be used for that purpose.
Here's an example - I have a bootloader that looks for a structure at a fixed offset in a program image for information about that image that may have been flashed into the device.
The loader has been revised, and it supports additional items in the struct for some enhancements. However, an older program image might be flashed, and that older image uses the old struct format. Since the rules above were followed from the start, the newer loader is fully able to deal with that. That's the easy part.
And if the struct is revised further and a new image uses the new struct format on a device with an older loader, that loader will be able to deal with it, too - it just won't do anything with the enhancements. But since no fields have been (or will be) removed, the older loader will be able to do whatever it was designed to do and do it with the newer image that has a configuration structure with newer information.
If you're talking about an actual database that has metadata about the fields, etc., then these guidelines don't really apply.
What you're looking for is forward-compatible data structures. There are several ways to do this. Here is the low-level approach.
struct address_book
{
unsigned int length; // total length of this struct in bytes
char items[0];
}
where 'items' is a variable length array of a structure that describes its own size and type
struct item
{
unsigned int size; // how long data[] is
unsigned int id; // first name, phone number, picture, ...
unsigned int type; // string, integer, jpeg, ...
char data[0];
}
In your code, you iterate through these items (address_book->length will tell you when you've hit the end) with some intelligent casting. If you hit an item whose ID you don't know or whose type you don't know how to handle, you just skip it by jumping over that data (from item->size) and continue on to the next one. That way, if someone invents a new data field in the next version or deletes one, your code is able to handle it. Your code should be able to handle conversions that make sense (if employee ID went from integer to string, it should probably handle it as a string), but you'll find that those cases are pretty rare and can often be handled with common code.
I have handled this in the past, in systems with very limited resources, by doing the translation on the PC as a part of the s/w upgrade process. Can you extract the old values, translate to the new values and then update the in-place db?
For a simplified embedded db I usually don't reference any structs directly, but do put a very light weight API around any parameters. This does allow for you to change the physical structure below the API without impacting the higher level application.
Lately I'm using bencoded data. It's the format that bittorrent uses. Simple, you can easily inspect it visually, so it's easier to debug than binary data and is tightly packed. I borrowed some code from the high quality C++ libtorrent. For your problem it's so simple as checking that the field exist when you read them back. And, for a gzip compressed file it's so simple as doing:
ogzstream os(meta_path_new.c_str(), ios_base::out | ios_base::trunc);
Bencode map(Bencode::TYPE_MAP);
map.insert_key("url", url.get());
map.insert_key("http", http_code);
os << map;
os.close();
To read it back:
igzstream is(metaf, ios_base::in | ios_base::binary);
is.exceptions(ios::eofbit | ios::failbit | ios::badbit);
try {
torrent::Bencode b;
is >> b;
if( b.has_key("url") )
d->url = b["url"].as_string();
} catch(...) {
}
I have used Sun's XDR format in the past, but I prefer this now. Also it's much easier to read with other languages such as perl, python, etc.
Embed a version number in the struct or, do as Win32 does and use a size parameter.
if the passed struct is not the latest version then fix up the struct.
About 10 years ago I wrote a similar system to the above for a computer game save game system. I actually stored the class data in a seperate class description file and if i spotted a version number mismatch then I coul run through the class description file, locate the class and then upgrade the binary class based on the description. This, obviously required default values to be filled in on new class member entries. It worked really well and it could be used to auto generate .h and .cpp files as well.
I agree with S.Lott in that the best solution is to separate the physical and logical layers of what you are trying to do. You are essentially combining your interface and your implementation into one object/struct, and in doing so you are missing out on some of the power of abstraction.
However if you must use a single struct for this, there are a few things you can do to help make things easier.
1) Some sort of version number field is practically required. If your structure is changing, you will need an easy way to look at it and know how to interpret it. Along these same lines, it is sometimes useful to have the total length of the struct stored in a structure field somewhere.
2) If you want to retain backwards compatibility, you will want to remember that code will internally reference structure fields as offsets from the structure's base address (from the "front" of the structure). If you want to avoid breaking old code, make sure to add all new fields to the back of the structure and leave all existing fields intact (even if you don't use them). That way, old code will be able to access the structure (but will be oblivious to the extra data at the end) and new code will have access to all of the data.
3) Since your structure may be changing sizes, don't rely on sizeof(struct myStruct) to always return accurate results. If you follow #2 above, then you can see that you must assume that a structure may grow larger in the future. Calls to sizeof() are calculated once (at compile time). Using a "structure length" field allows you to make sure that when you (for example) memcpy the struct you are copying the entire structure, including any extra fields at the end that you aren't aware of.
4) Never delete or shrink fields; if you don't need them, leave them blank. Don't change the size of an existing field; if you need more space, create a new field as a "long version" of the old field. This can lead to data duplication problems, so make sure to give your structure a lot of thought and try to plan fields so that they will be large enough to accommodate growth.
5) Don't store strings in the struct unless you know that it is safe to limit them to some fixed length. Instead, store only a pointer or array index and create a string storage object to hold the variable-length string data. This also helps protect against a string buffer overflow overwriting the rest of your structure's data.
Several embedded projects I have worked on have used this method to modify structures without breaking backwards/forwards compatibility. It works, but it is far from the most efficient method. Before long, you end up wasting space with obsolete/abandoned structure fields, duplicate data, data that is stored piecemeal (first word here, second word over there), etc etc. If you are forced to work within an existing framework then this might work for you. However, abstracting away your physical data representation using an interface will be much more powerful/flexible and less frustrating (if you have the design freedom to use such a technique).
You may want to take a look at how Boost Serialization library deals with that issue.