Efficient lookup of a buffer with stack of data modifications applied - c++

I am trying to write a C++11 library as part of a wider project that implements a stack of changes (modification, insertion and deletion) implemented on top of an original buffer. Then, the aim is to be able to quickly look "through" the changes and get the modified data out.
My current approach is:
Maintain an ordered list of changes, ordered by offset of the start of the change
Also maintain a stack of the same changes, so they can be rolled back in order
New changes are pushed onto the stack and inserted into the list at the right place
The changes-by-offset list may be modified if the change interacts with others
For example, a modification of bytes 5-10 invalidates the start of an earlier modification from 8-12
Also, insertion or deletion changes will change the apparent offset of data occurring after them (deleting bytes 5-10 means that what used to be byte 20 is now found at 15)
To find the modified data, you can look though the list for the change that applies (and the offset within that change that applies - another change might have invalidated some of it), or find the right offset in the original data if no change touched that offset
The aim here is to make the lookup fast - adding a change might take some effort to mess with the list, but lookups later, which will outnumber the modifications greatly, in an ordered list should be pretty straightforward.
Also you don't need to continuously copy data - each change's data is kept with it, and the original data is untouched
Undo is then implemented by popping the last change off the stack and rolling back any changes made to it by this change's addition.
This seems to be quite a difficult task - there are a lot of things to take care of and I am quickly piling up complex code!
I feel sure that this must be problem that has been dealt with in other software, but looking around various hex editors and so on hasn't pointed me to a useful implementation. Is there a name for this problem ("data undo stack" and friends hasn't got me very far!), or a library that can be used, even as a reference, for this kind of thing?

I believe the most common approach (one I have used successfully in the past) is to simply store the original state and then put each change operation (what's being done + arguments) on the undo stack. Then, to get to a particular prior state you start from the original and apply all changes except the ones you want undone.
This is a lot easier to implement than trying to identify what parts of the data changed, and it works well unless the operations themselves are very time-consuming (and therefore slow to "replay" onto the original state).

I would look at persistent data structures, such as https://en.wikipedia.org/wiki/Persistent_data_structure and http://www.toves.org/books/persist/#s2 - or websearch on terms from these. I think you could do this with a persistent tree whose leaves carry short strings.

Related

Why must files be entirely rewritten when there are arrays in C++?

From what I've read on the internet and in books, when editing a text file, or any file for that matter, the file must be entirely rewritten; You can't just insert data into a file and save it.
If this is so, how can there be arrays in programming languages? In C++ I can explicitly modify certain values in the middle of arrays. To me, this seems like a demonstration of the modification of one or two bytes in the middle of a group of bytes.
The only two possible solutions I have thought of are
There is some funky stuff going on behind-the-scenes in C++, so it seems like only one or two bytes are being modified, but the array is actually entirely rewritten.
Then, after thinking about it, and especially after typing it out, I realized the the aformentioned solution seems really really dumb and totally not true, because there are things like addresses and pointers and the performance sounds awful. So I thought that maybe files are not entirely rewritten; only the everything after the first point in new data was inserted is rewritten. This seems much more plausible to me, and makes sense.
What is the difference between writing data to a file and writing data to an array?
You can change the values of particular locations in an array without needing to re-write the whole thing. However, you cannot insert new values into the middle of an array without shifting everything following the new values in order to make room.
Similarly, you can overwrite sections of a file without copying it (although the underlying storage technology may need to re-write an entire storage block in order to change a single byte within that block), but you can't insert new data without somehow making room for it. Text editors (and editors for more complicated file formats) are designed for random-access modifications that do not preserve length, so they will typically re-write the entire file regardless of what changed rather than trying to optimize for length-preserving edits.

Defending classes with 'magic numbers'

A few months ago I read a book on security practices, and it suggested the following method for protecting our classes from overwriting with e.g. overflows etc.:
first define a magic number and a fixed-size array (can be a simple integer too)
use that array containing the magic number, and place one at the top, and one at the bottom of our class
a function compares these numbers, and if they are equal, and equal to the static variable, the class is ok, return true, else it is corrupt, and return false.
place this function at the start of every other class method, so this will check the validity of the class on function calls
it is important to place this array at the start and the end of the class
At least this is as I remember it. I'm coding a file encryptor for learning purposes, and I'm trying to make this code exception safe.
So, in which scenarios is it useful, and when should I use this method, or is this something totally useless to count on? Does it depend on the compiler or OS?
PS: I forgot the name of the book mentioned in this post, so I cannot check it again, if anyone of you know which one was it please tell me.
What you're describing sounds a Canary, but within your program, as opposed to the compiler. This is usually on by default when using gcc or g++ (plus a few other buffer overflow countermeasures).
If you're doing mutable operations on your class and you want to make sure you don't have side effects, I don't know if having a magic number is very useful. Why rely on a homebrew validity check when there are mothods out there that are more likely to be successful?
Checksums: I think it'd be more useful for you to hash the unencrypted text and add that to the end of the encrypted file. When decrypting, remove the hash and compare the hash(decrypted text) with what it should be.
I think most, if not all, widely used encryptors/decryptors store some sort of checksum in order to verify that the data has not changed.
This type of a canary will partially protect you against a very specific type of overflow attack. You can make it a little more robust by randomizing the canary value every time you run the program.
If you're worried about buffer overflow attacks (and you should be if you are ever parsing user input), then go ahead and do this. It probably doesn't cost too much in speed to check your canaries every time. There will always be other ways to attack your program, and there might even be careful buffer overflow attacks that get around your canary, but it's a cheap measure to take so it might be worth adding to your classes.

how deal with atomicity situation

Hi imagine I have such code:
0. void someFunction()
1. {
2. ...
3. if(x>5)
4. doSmth();
5.
6. writeDataToCard(handle, data1);
7.
8. writeDataToCard(handle, data2);
9.
10. incrementDataOnCard(handle, data);
11. }
The thing is following. If step 6 & 8 gets executed, and then someone say removes the card - then operation 10 will not be completed successfully. But this will be a bug in my system. Meaning if 6 & 8 are executed then 10 MUST also be executed. How to deal with such situations?
Quick Summary: What I mean is say after step 8 someone may remove my physical card, which means that step 10 will never be reached, and that will cause a problem in my system. Namely card will be initialized with incomplete data.
You will have to create some kind of protcol, for instance you write to the card a list of operatons to complete:
Step6, Step8, Step10
and as you complete the tasks you remove the entry from the list.
When you reread the data from the disk, you check the list if any entry remains. If it does, the operation did not successfully complete before and you restore a previous state.
Unless you can somehow physically prevent the user from removing the card, there is no other way.
If the transaction is interrupted then the card is in the fault state. You have three options:
Do nothing. The card is in fault state, and it will remain there. Advise users not to play with the card. Card can be eligible for complete clean or format.
Roll back the transaction the next time the card becomes available. You need enough information on the card and/or some central repository to perform the rollback.
Complete the transaction the next time the card becomes available. You need enough information on the card and/or some central repository to perform the completion.
In all three cases you need to have a flag on the card denoting a transaction in progress.
More details are required in order to answer this.
However, making some assumption, I will suggest two possible solutions (more are possible...).
I assume the write operations are persistent - hence data written to the card is still there after card is removed-reinserted, and that you are referring to the coherency of the data on the card - not the state of the program performing the function calls.
Also assumed is that the increment method, increments the data already written, and the system must have this operation done in order to guarantee consistency:
For each record written, maintain another data element (on the card) that indicates the record's state. This state will be initialized to something (say "WRITING" state) before performing the writeData operation. This state is then set to "WRITTEN" after the incrementData operation is (successfully!) performed.
When reading from the card - you first check this state and ignore (or delete) the record if its not WRITTEN.
Another option will be to maintain two (persistent) counters on the card: one counting the number of records that began writing, the other counts the number of records that ended writing.
You increment the first before performing the write, and then increment the second after (successfully) performing the incrementData call.
When later reading from the card, you can easily check if a record is indeed valid, or need to be discarded.
This option is valid if the written records are somehow ordered or indexed, so you can see which and how many records are valid just by checking the counter. It has the advantage of requiring only two counters for any number of records (compared to 1 state for EACH record in option 1.)
On the host (software) side you then need to check that the card is available prior to beginning the write (don't write if its not there). If after the incrementData op you you detect that the card was removed, you need to be sure to tidy up things (remove unfinished records, update the counters) either once you detect that the card is reinserted, or before doing another write. For this you'll need to maintain state information on the software side.
Again, the type of solution (out of many more) depends on the exact system and requirements.
Isn't that just:
Copy data to temporary_data.
Write to temporary_data.
Increment temporary_data.
Rename data to old_data.
Rename temporary_data to data.
Delete the old_data.
You will still have a race condition (if a lucky user removes the card) at the two rename steps, but you might restore the data or temporary_data.
You haven't said what you're incrementing (or why), or how your data is structured (presumably there is some relationship between whatever you're writing with writeDataToCard and whatever you're incrementing).
So, while there may be clever techniques specific to your data, we don't have enough to go on. Here are the obvious general-purpose techniques instead:
the simplest thing that could possibly work - full-card commit-or-rollback
Keep two copies of all the data, the good one and the dirty one. A single byte at the lowest address is sufficient to say which is the current good one (it's essentially an index into an array of size 2).
Write your new data into the dirty area, and when that's done, update the index byte (so swapping clean & dirty).
Either the index is updated and your new data is all good, or the card is pulled out and the previous clean copy is still active.
Pro - it's very simple
Con - you're wasting exactly half your storage space, and you need to write a complete new copy to the dirty area when you change anything. You haven't given enough information to decide whether this is a problem for you.
... now using less space ... - commit-or-rollback smaller subsets
if you can't waste 50% of your storage, split your data into independent chunks, and version each of those independently. Now you only need enough space to duplicate your largest single chunk, but instead of a simple index you need an offset or pointer for each chunk.
Pro - still fairly simple
Con - you can't handle dependencies between chunks, they have to be isolated
journalling
As per RedX's answer, this is used by a lot of filesystems to maintain integrity.
Pro - it's a solid technique, and you can find documentation and reference implementations for existing filesystems
Con - you just wrote a modern filesystem. Is this really what you wanted?

Optimal strategy to make a C++ hash table, thread safe

(I am interested in design of implementation NOT a readymade construct that will do it all.)
Suppose we have a class HashTable (not hash-map implemented as a tree but hash-table)
and say there are eight threads.
Suppose read to write ratio is about 100:1 or even better 1000:1.
Case A) Only one thread is a writer and others including writer can read from HashTable(they may simply iterate over entire hash table)
Case B) All threads are identical and all could read/write.
Can someone suggest best strategy to make the class thread safe with following consideration
1. Top priority to least lock contention
2. Second priority to least number of locks
My understanding so far is thus :
One BIG reader-writer lock(semaphore).
Specialize the semaphore so that there could be eight instances writer-resource for case B, where each each writer resource locks one row(or range for that matter).
(so i guess 1+8 mutexes)
Please let me know if I am thinking on the correct line, and how could we improve on this solution.
With such high read/write ratios, you should consider a lock free solution, e.g. nbds.
EDIT:
In general, lock free algorithms work as follows:
arrange your data structures such that for each function you intend to support there is a point at which you are able to, in one atomic operation, determine whether its results are valid (i.e. other threads have not mutated its inputs since they have been read) and commit to them; with no changes to state visible to other threads unless you commit. This will involve leveraging platform-specific functions such as Win32's atomic compare-and-swap or Cell's cache line reservation opcodes.
each supported function becomes a loop that repeatedly reads the inputs and attempts to perform the work, until the commit succeeds.
In cases of very low contention, this is a performance win over locking algorithms since functions mostly succeed the first time through without incurring the overhead of acquiring a lock. As contention increases, the gains become more dubious.
Typically the amount of data it is possible to atomically manipulate is small - 32 or 64 bits is common - so for functions involving many reads and writes, the resulting algorithms become complex and potentially very difficult to reason about. For this reason, it is preferable to look for and adopt a mature, well-tested and well-understood third party lock free solution for your problem in preference to rolling your own.
Hashtable implementation details will depend on various aspects of the hash and table design. Do we expect to be able to grow the table? If so, we need a way to copy bulk data from the old table into the new safely. Do we expect hash collisions? If so, we need some way of walking colliding data. How do we make sure another thread doesn't delete a key/value pair between a lookup returning it and the caller making use of it? Some form of reference counting, perhaps? - but who owns the reference? - or simply copying the value on lookup? - but what if values are large?
Lock-free stacks are well understood and relatively straightforward to implement (to remove an item from the stack, get the current top, attempt to replace it with its next pointer until you succeed, return it; to add an item, get the current top and set it as the item's next pointer, until you succeed in writing a pointer to the item as the new top; on architectures with reserve/conditional write semantics, this is enough, on architectures only supporting CAS you need to append a nonce or version number to the atomically manipulated data to avoid the ABA problem). They are one way of keeping track of free space for keys/data in an atomic lock free manner, allowing you to reduce a key/value pair - the data actually stored in a hashtable entry - to a pointer/offset or two, a small enough amount to be manipulated using your architecture's atomic instructions. There are others.
Reads then become a case of looking up the entry, checking the kvp against the requested key, doing whatever it takes to make sure the value will remain valid when we return it (taking a copy / increasing its reference count), checking the entry hasn't been modified since we began the read, returning the value if so, undoing any reference count changes and repeating the read if not.
Writes will depend on what we're doing about collisions; in the trivial case, they are simply a case of finding the correct empty slot and writing the new kvp.
The above is greatly simplified and insufficient to produce your own safe implementation, especially if you are not familiar with lock-free/wait-free techniques. Possible complications include the ABA problem, priority inversion, starvation of particular threads; I have not addressed hash collisions.
The nbds page links to an excellent presentation on a real world approach that allows growth / collisions. Others exist, a quick Google finds lots of papers.
Lock free and wait free algorithms are fascinating areas of research; I encourage the reader to Google around. That said, naive lock free implementations can easily look reasonable and behave correctly much of the time while in reality being subtly unsafe. While it is important to have a solid grasp on the principles, I strongly recommend using an existing, well-understood and proven implementation over rolling your own.
You may want to look at Java's ConcurrentHashMap implementation for one possible implementation.
The basic idea is NOT to lock for every read operation but only for writes. Since in your interview they specifically mentioned an extremely high read:write ratio it makes sense trying to stuff as much overhead as possible into writes.
The ConcurrentHashMap divides the hashtable into so called "Segments" that are themselves concurrently readable hashtables and keep every single segment in a consistent state to allow traversing without locking.
When reading you basically have the usual hashmap get() with the difference that you have to worry about reading stale values, so things like the value of the correct node, the first node of the segment table and next pointers have to be volatile (with c++'s non-existent memory model you probably can't do this portably; c++0x should help here, but haven't looked at it so far).
When putting a new element in there you get all the overhead, first of all having to lock the given segment. After locking it's basically a usual put() operation, but you have to guarantee atomic writes when updating the next pointer of a node (pointing to the newly created node whose next pointer has to be already correctly pointing to the old next node) or overwriting the value of a node.
When growing the segment, you have to rehash the existing nodes and put them into the new, larger table. The important part is to clone nodes for the new table as not to influence the old table (by changing their next pointers too early) until the new table is complete and replaces the old one (they use some clever trick there that means they only have to clone about 1/6 of the nodes - nice that but I'm not really sure how they reach that number).
Note that garbage collection makes this a whole lot easier because you don't have to worry about the old nodes that weren't reused - as soon as all readers are finished they will automatically be GCed. That's solvable though, but I'm not sure what the best approach would be.
I hope the basic idea is somewhat clear - obviously there are several points that aren't trivially ported to c++, but it should give you a good idea.
No need to lock the whole table, just have a lock per bucket. That immediately gives parallelism. Inserting a new node to the table requires a lock on the bucket about to have the head node modified. New nodes are always added at the head of the table so that readers can iterate through the nodes without worrying about seeing new nodes.
Each node has a r/w lock; readers iterating get a read lock lock. Node modification requires a write lock.
Iteration without the bucket lock leading to node removal requires an attempt to take the bucket lock, and if it fails it must release the locks and retry to avoid deadlock because the lock order is different.
Brief overview.
You can try atomic_hashtable for c
https://github.com/Taymindis/atomic_hashtable for read, write, and delete without locking while multithreading accessing the data, Simple and Stable
API documents given in README.

How to handle changing data structures on program version update?

I do embedded software, but this isn't really an embedded question, I guess. I don't (can't for technical reasons) use a database like MySQL, just C or C++ structs.
Is there a generic philosophy of how to handle changes in the layout of these structs from version to version of the program?
Let's take an address book. From program version x to x+1, what if:
a field is deleted (seems simple enough) or added (ok if all can use some new default)?
a string gets longer or shorter? An int goes from 8 to 16 bits of signed / unsigned?
maybe I combine surname/forename, or split name into two fields?
These are just some simple examples; I am not looking for answers to those, but rather for a generic solution.
Obviously I need some hard coded logic to take care of each change.
What if someone doesn't upgrade from version x to x+1, but waits for x+2? Should I try to combine the changes, or just apply x -> x+ 1 followed by x+1 -> x+2?
What if version x+1 is buggy and we need to roll-back to a previous version of the s/w, but have already "upgraded" the data structures?
I am leaning towards TLV (http://en.wikipedia.org/wiki/Type-length-value) but can see a lot of potential headaches.
This is nothing new, so I just wondered how others do it....
I do have some code where a longer string is puzzled together from two shorter segments if necessary. Yuck. Here's my experience after 12 years of keeping some data compatible:
Define your goals - there are two:
new versions should be able to read what old versions write
old versions should be able to read what new versions write (harder)
Add version support to release 0 - At least write a version header. Together with keeping (potentially a lot of) old reader code around that can solve the first case primitively. If you don't want to implement case 2, start rejecting new data right now!
If you need only case 1, and and the expected changes over time are rather minor, you are set. Anyway, these two things done before the first release can save you many headaches later.
Convert during serialization - at run time, only keep the data in the "new format" in memory. Do necessary conversions and tests at persistence limits (convert to newest when reading, implement backward compatibility when writing). This isolates version problems in one place, helping to avoid hard-to-track-down bugs.
Keep a set of test data from all versions around.
Store a subset of available types - limit the actually serialized data to a few data types, such as int, string, double. In most cases, the extra storage size is made up by reduced code size supporting changes in these types. (That's not always a tradeoff you can make on an embedded system, though).
e.g. don't store integers shorter than the native width. (you might need to do that when you need to store long integer arrays).
add a breaker - store some key that allows you to intentionally make old code display an error message that this new data is incompatible. You can use a string that is part of the error message - then your old version could display an error message it doesn't know about - "you can import this data using the ConvertX tool from our web site" is not great in a localized application but still better than "Ungültiges Format".
Don't serialize structs directly - that's the logical / physical separation. We work with a mix of two, both having their pros and cons. None of these can be implemented without some runtime overhead, which can pretty much limit your choices in an embedded environment. At any rate, don't use fixed array/string lengths during persistence, that should already solve half of your troubles.
(A) a proper serialization mechanism - we use a bianry serializer that allows to start a "chunk" when storing, which has its own length header. When reading, extra data is skipped and missing data is default-initialized (which simplifies implementing "read old data" a lot in the serializationj code.) Chunks can be nested. That's all you need on the physical side, but needs some sugar-coating for common tasks.
(B) use a different in-memory representation - the in-memory reprentation could basically be a map<id, record> where id woukld likely be an integer, and record could be
empty (not stored)
a primitive type (string, integer, double - the less you use the easier it gets)
an array of primitive types
and array of records
I initially wrote that so the guys don't ask me for every format compatibility question, and while the implementation has many shortcomings (I wish I'd recognize the problem with the clarity of today...) it could solve
Querying a non existing value will by default return a default/zero initialized value. when you keep that in mind when accessing the data and when adding new data this helps a lot: Imagine version 1 would calculate "foo length" automatically, whereas in version 2 the user can overrride that setting. A value of zero - in the "calculation type" or "length" should mean "calculate automatically", and you are set.
The following are "change" scenarios you can expect:
a flag (yes/no) is extended to an enum ("yes/no/auto")
a setting splits up into two settings (e.g. "add border" could be split into "add border on even days" / "add border on odd days".)
a setting is added, overriding (or worse, extending) an existing setting.
For implementing case 2, you also need to consider:
no value may ever be remvoed or replaced by another one. (But in the new format, it could say "not supported", and a new item is added)
an enum may contain unknown values, other changes of valid range
phew. that was a lot. But it's not as complicated as it seems.
There's a huge concept that the relational database people use.
It's called breaking the architecture into "Logical" and "Physical" layers.
Your structs are both a logical and a physical layer mashed together into a hard-to-change thing.
You want your program to depend on a logical layer. You want your logical layer to -- in turn -- map to physical storage. That allows you to make changes without breaking things.
You don't need to reinvent SQL to accomplish this.
If your data lives entirely in memory, then think about this. Divorce the physical file representation from the in-memory representation. Write the data in some "generic", flexible, easy-to-parse format (like JSON or YAML). This allows you to read in a generic format and build your highly version-specific in-memory structures.
If your data is synchronized onto a filesystem, you have more work to do. Again, look at the RDBMS design idea.
Don't code a simple brainless struct. Create a "record" which maps field names to field values. It's a linked list of name-value pairs. This is easily extensible to add new fields or change the data type of the value.
Some simple guidelines if you're talking about a structure use as in a C API:
have a structure size field at the start of the struct - this way code using the struct can always ensure they're dealing only with valid data (for example, many of the structures the Windows API uses start with a cbCount field so these APIs can handle calls made by code compiled against old SDKs or even newer SDKs that had added fields
Never remove a field. If you don't need to use it anymore, that's one thing, but to keep things sane for dealing with code that uses an older version of the structure, don't remove the field.
it may be wise to include a version number field, but often the count field can be used for that purpose.
Here's an example - I have a bootloader that looks for a structure at a fixed offset in a program image for information about that image that may have been flashed into the device.
The loader has been revised, and it supports additional items in the struct for some enhancements. However, an older program image might be flashed, and that older image uses the old struct format. Since the rules above were followed from the start, the newer loader is fully able to deal with that. That's the easy part.
And if the struct is revised further and a new image uses the new struct format on a device with an older loader, that loader will be able to deal with it, too - it just won't do anything with the enhancements. But since no fields have been (or will be) removed, the older loader will be able to do whatever it was designed to do and do it with the newer image that has a configuration structure with newer information.
If you're talking about an actual database that has metadata about the fields, etc., then these guidelines don't really apply.
What you're looking for is forward-compatible data structures. There are several ways to do this. Here is the low-level approach.
struct address_book
{
unsigned int length; // total length of this struct in bytes
char items[0];
}
where 'items' is a variable length array of a structure that describes its own size and type
struct item
{
unsigned int size; // how long data[] is
unsigned int id; // first name, phone number, picture, ...
unsigned int type; // string, integer, jpeg, ...
char data[0];
}
In your code, you iterate through these items (address_book->length will tell you when you've hit the end) with some intelligent casting. If you hit an item whose ID you don't know or whose type you don't know how to handle, you just skip it by jumping over that data (from item->size) and continue on to the next one. That way, if someone invents a new data field in the next version or deletes one, your code is able to handle it. Your code should be able to handle conversions that make sense (if employee ID went from integer to string, it should probably handle it as a string), but you'll find that those cases are pretty rare and can often be handled with common code.
I have handled this in the past, in systems with very limited resources, by doing the translation on the PC as a part of the s/w upgrade process. Can you extract the old values, translate to the new values and then update the in-place db?
For a simplified embedded db I usually don't reference any structs directly, but do put a very light weight API around any parameters. This does allow for you to change the physical structure below the API without impacting the higher level application.
Lately I'm using bencoded data. It's the format that bittorrent uses. Simple, you can easily inspect it visually, so it's easier to debug than binary data and is tightly packed. I borrowed some code from the high quality C++ libtorrent. For your problem it's so simple as checking that the field exist when you read them back. And, for a gzip compressed file it's so simple as doing:
ogzstream os(meta_path_new.c_str(), ios_base::out | ios_base::trunc);
Bencode map(Bencode::TYPE_MAP);
map.insert_key("url", url.get());
map.insert_key("http", http_code);
os << map;
os.close();
To read it back:
igzstream is(metaf, ios_base::in | ios_base::binary);
is.exceptions(ios::eofbit | ios::failbit | ios::badbit);
try {
torrent::Bencode b;
is >> b;
if( b.has_key("url") )
d->url = b["url"].as_string();
} catch(...) {
}
I have used Sun's XDR format in the past, but I prefer this now. Also it's much easier to read with other languages such as perl, python, etc.
Embed a version number in the struct or, do as Win32 does and use a size parameter.
if the passed struct is not the latest version then fix up the struct.
About 10 years ago I wrote a similar system to the above for a computer game save game system. I actually stored the class data in a seperate class description file and if i spotted a version number mismatch then I coul run through the class description file, locate the class and then upgrade the binary class based on the description. This, obviously required default values to be filled in on new class member entries. It worked really well and it could be used to auto generate .h and .cpp files as well.
I agree with S.Lott in that the best solution is to separate the physical and logical layers of what you are trying to do. You are essentially combining your interface and your implementation into one object/struct, and in doing so you are missing out on some of the power of abstraction.
However if you must use a single struct for this, there are a few things you can do to help make things easier.
1) Some sort of version number field is practically required. If your structure is changing, you will need an easy way to look at it and know how to interpret it. Along these same lines, it is sometimes useful to have the total length of the struct stored in a structure field somewhere.
2) If you want to retain backwards compatibility, you will want to remember that code will internally reference structure fields as offsets from the structure's base address (from the "front" of the structure). If you want to avoid breaking old code, make sure to add all new fields to the back of the structure and leave all existing fields intact (even if you don't use them). That way, old code will be able to access the structure (but will be oblivious to the extra data at the end) and new code will have access to all of the data.
3) Since your structure may be changing sizes, don't rely on sizeof(struct myStruct) to always return accurate results. If you follow #2 above, then you can see that you must assume that a structure may grow larger in the future. Calls to sizeof() are calculated once (at compile time). Using a "structure length" field allows you to make sure that when you (for example) memcpy the struct you are copying the entire structure, including any extra fields at the end that you aren't aware of.
4) Never delete or shrink fields; if you don't need them, leave them blank. Don't change the size of an existing field; if you need more space, create a new field as a "long version" of the old field. This can lead to data duplication problems, so make sure to give your structure a lot of thought and try to plan fields so that they will be large enough to accommodate growth.
5) Don't store strings in the struct unless you know that it is safe to limit them to some fixed length. Instead, store only a pointer or array index and create a string storage object to hold the variable-length string data. This also helps protect against a string buffer overflow overwriting the rest of your structure's data.
Several embedded projects I have worked on have used this method to modify structures without breaking backwards/forwards compatibility. It works, but it is far from the most efficient method. Before long, you end up wasting space with obsolete/abandoned structure fields, duplicate data, data that is stored piecemeal (first word here, second word over there), etc etc. If you are forced to work within an existing framework then this might work for you. However, abstracting away your physical data representation using an interface will be much more powerful/flexible and less frustrating (if you have the design freedom to use such a technique).
You may want to take a look at how Boost Serialization library deals with that issue.