Key-value store by foldername - c++

We have our in houses noSQL db, that basically store everything in a compact binary file. Now, I need a data structure similar to key-value store or B+Tree. The issue is 'value' in my case can be of different types, and of size very volatile, could be from 1Kb to 1-2Gb. Typically the key is a string, and the value is a stream of data, can be a stream of int, string, or of custom type.
I was thinking about implementing an B+ Tree, but that's not easy because the B+ Tree need the 'value' to be of the same type, and the size of 'value' should be small enough to be storable in a relative small block. There maybe a variant but I didn't find a tutorial of how to implement a B+ Tree with examples showing how to store on disk. Most of tutorial I see are only in-memory B+ Tree.
I then have the idea of using folder/file name as the key. And then the value can be anything inside the file. Values then could be of arbitrary size, that's really what I want. So my question is here, in the extreme case,
data for different days is store in separated folders
I can have 1M-50M keys (indeed files/folder) to store on disk for a days
Data operation on a files will generally be 'read-only', and 'append to' during the day. Historical data will never be modified.
I've seen that I can have ~4 billion files on modern OS, so I'm happy with that approach for ~2YR storage on a single machine. I just worries if that way of implementing key-value store is very bad? Why? What issue can I have when dealing with filesystem? (Framented disk on windows for example?)
All are implemented in C++ in both Windows/Linux.

I think, if you can secure and match your requirement, it shouldn't be bad.
I have done similar thing for an embedded project and its limited set of data.
Things needs to be considered
OS/Filesystem should support required length of the folder(key) and filename(how you choose)
It does fragment the disk and might delay the disk access for huge directory structures. Which might affect overall system process.
Application Performance might degrade, since read/write operation depends on file operation - probably you can add cache in your program if required.
Not good for multi-threaded application, file locking should be taken care.
Security should be taken care.

Why are you worried w.r.t size of value. You can use your existing db. Value can be a string of following format "type|value_data", where "|" is a separator.
Here, value_data can be "actual value" or "path of file which contains value"
type = LOCAL (in this case value_data will be actual value if it can fit in db)
type = REMOTE (in this case value_data will be path of file)

"Data for different days is stored in different folders" - this isn't convenient if you want to search for a single across days.
Also, you might run into problems when the number of files per folder exceed a filesystem limit. 4 billion files on a disk is not a problem, 50M in a single folder is. But you don't need to store everything in the same folder, of course.
A key can be partitioned into a folder part and a filename part.
Things do get tricky if you need to rely on the B-Tree property of finding a range of keys. This means you need an order, and cannot use a hashing function to map the key to a folder/filename pair. In that case, you have an issue. Worst case is that your keys are just "1" to "999999999" continuously, plus a random set of much larger keys. That means you can't use the last 4 digits as the filename (too many folders) or the last 8 digits (too many files).

Related

Data storage library for C++

I would like to use some tiny C++ library in my code, which would allow to do something like:
DataStore ds;
ds.open("data.bin");
int num=5;
std::string str="some text";
ds.put("key1",num);
ds.put("key2",str);
ds.get("key1");// returns int(5)
ds.get("key2");// returns std::string("some text")
The usage style doesn't have to be the same as that code example, but the principle should remain (get/set value of any type and store it in the file blob). The library should also not be SQL based, nor be an SQL wrapper. What are such libraries and what are their advantages?
EDIT: max 10k keys would be used, with approx. 100bytes data per key, file doesn't need to be portable between computers or OS, file shouldn't be editable with text editor (it looks more professional if it isn't) and doesn't have to be multi-threaded aware.
One option for you is to use BerkeleyDB and its C API, C++ API or C++ STL API:
BekeleyDB has a small footprint, is fast, mature and robust. Another advantage of BerkeleyDB is that most scripting languages like Python, Perl, etc. have bindings to it so you can manipulate (examine, visualize) the data with them.
The disadvantage is that all you can store in it is a key-value pair where both key and value are strings (or rather blobs), so you have to convert from C++ data types to strings/blobs.
It wouldn't be hard to create a class that does what you describe in a basic way. All you need is a some functions that can read/write keys, a tag for "what type is this" [perhaps combined with the size of the data stored, if we assume the data stored isn't enormous - and I mean a few MB per item or so]. You may find having some sort of "index structure" or "where is the next element" location references help.
There is a small problem with the way your ds.get(std::string) is shown: you can't actually return an int and a std::string from the same function. You may be able to write a function that takes a std::string as a reference, and another that takes an int as a reference, or some such.
It gets more interesting if you need to have many keys - at this point, you probably need some sort of hash or binary tree type organisation to search through the keys. 10k keys is probably not a big deal - if you store them in sorted order it gets easier.
file shouldn't be editable with text editor (it looks more professional if it isn't)
I must say, I don't agree with that. I find text editable files are VERY professional looking. The fact that it's binary is just making things awkward, and harder to deal with should something in the application not work as you wish (e.g. it stored the installation path, you moved it, and it no longer works, and since you can't start it, it won't allow you to edit that configuration).

How do I convert a directory path to a unique numerical identifier (Linux/C++)?

I am investigating ways to take a directory (folder) and derive some form of unique numerical identifier. I have investigated "string to hash" methods, however, the Pigeon Hole Principle means that one can never derive a truely unique number for every single string.
String to unique hash is no good.
I have recently been investigating other means of achieving my goal and thus have the following question to ask:
Directory time stamps - how 'unique' are they?
To what resolution are the time stamps reported by 'stat' as described here (second post)? if the resolution is small enough, is it possible for more than one folder to share the exact same time stamp on a Linux system?
If anyone has other methods/techniques they'd like to share, I'd be happy to listen :)
Edit 1 To clarify my use case in response to the answers posted so far: I am working on Android platforms, so the filesystem is not linked to any other (except of course for removeable media such as Micro SD cards).
I am inserting each path into a database but trying to avoid string comparisons when querying the table. The use of maps/hashmaps is not an option here. Yes, the path itself is unique, but ideally I need a numerical identifier that can be used to query the table as opposed to the path itself. The identifier must also be unique per path. I have experimented with std::collate but found there were many collides in the hashes (a dataset of 20, 000 paths yeilds approximatley 100 collides). What was even more surprising is that the hashes appeared to be largely different each time my application is run. I wonder if it's seeded somehow?
Many thanks,
P
On any UNIX-based system, you can use the inode number as a unique identifier within that file system. Combining it with the device number will make it unique within the machine. If you wanted it to be globally unique, you could throw in the system's primary MAC address.
Keep in mind, however, that:
The inode number will "follow" the directory if it is moved or renamed. It will change if the directory is deleted and replaced.
The inode number will not be stable across systems, beyond one or two really special directories. (For instance, / is usually inode 2.)
+1 duskwuff, good one!
Another way is to simply treat the dir's path as a number ("BigInt").
Take this dir for example: /opt/www/log
It's 12 chars long.
12 * 8bits = 96bits
Thus you have a 96 bits long number,which you can represent in hex/base64/anything (in case you need to pass it as an HTML link).
I'd personally go for duskwuff's approach though.
I think it depends very much on the purpose of why you want a unique numerical identifier. Timestamps can change, inodes can change, disknumbers can change, MAC adresses can change. (Still, +1 for duskwuff)
In some scenarios you can simply make a table, where each path you add gets a new, unique number, just like a numerical key column in a database.
Although hashes can collide, in every actual environment this is absolutely unlikely (if you don't use the lousiest algorithm around...) It is much more likely that you get errors due to flaws in your implementation, for example you treat "/tmp" differently than "/tmp/", because you don't normalize the paths before hashing them. Or, you want to distinguish physical folders, but forget to check for hardlinks and symlinks to the same folder, thus you get multiple hashes / id's for the same dir.
Again, depending on your usecase, a collision is not necessarily fatal: if you find that a new path results in the same hash as an existing one (will not happen!) you can still react on that case. (*)
Just to help your imagination: If you use a 64 bit hash, you could fill 150 000 000 of 1TB hard disk drives with empty folders (nothing on them but short folder names...) and then you will have a collision for sure. If you think, that's too risky (blink, blink), use a 128 bit hash which makes it 18 446 744 073 710 000 times less likely.
Hashes are designed to make collisions unlikely, and even good old MD5 will do its job well if nobody willingly tries to produce a collision.
(*) Edit:
Your pigeonhole article already points that out: a collision just mean that lookup is no longer O(1), but slightly slower. As it rarely ever happens, you can easily live with that. If you use a std::map (no hash) or std::hashmap, you'll not have to worry about collisions. See what the difference between map and hashmap in STL

Fastest way to do many small, blind writes on a huge file (in C++)?

I have some very large (>4 GB) files containing (millions of) fixed-length binary records. I want to (efficiently) join them to records in other files by writing pointers (i.e. 64-bit record numbers) into those records at specific offsets.
To elaborate, I have a pair of lists of (key, record number) tuples sorted by key for each join I want to perform on a given pair of files, say, A and B. Iterating through a list pair and matching up the keys yields a list of (key, record number A, record number B) tuples representing the joined records (assuming a 1:1 mapping for simplicity). To complete the join, I conceptually need to seek to each A record in the list and write the corresponding B record number at the appropriate offset, and vice versa. My question is what is the fastest way to actually do this?
Since the list of joined records is sorted by key, the associated record numbers are essentially random. Assuming the file is much larger than the OS disk cache, doing a bunch of random seeks and writes seems extremely inefficient. I've tried partially sorting the record numbers by putting the A->B and B->A mappings in a sparse array, and flushing the densest clusters of entries to disk whenever I run out of memory. This has the benefit of greatly increasing the chances that the appropriate records will be cached for a cluster after updating its first pointer. However, even at this point, is it generally better to do a bunch of seeks and blind writes, or to read chunks of the file manually, update the appropriate pointers, and write the chunks back out? While the former method is much simpler and could be optimized by the OS to do the bare minimum of sector reads (since it knows the sector size) and copies (it can avoid copies by reading directly into properly aligned buffers), it seems that it will incur extremely high syscall overhead.
While I'd love a portable solution (even if it involves a dependency on a widely used library, such as Boost), modern Windows and Linux are the only must-haves, so I can make use of OS-specific APIs (e.g. CreateFile hints or scatter/gather I/O). However, this can involve a lot of work to even try out, so I'm wondering if anyone can tell me if it's likely worth the effort.
It looks like you can solve this by use of data structures. You've got three constraints:
Access time must be reasonably quick
Data must be kept sorted
You are on a spinning disk
B+ Trees were created specifically to address the kind of workload you're dealing with here. There are several links to implementations in the linked Wikipedia article.
Essentially, a B+ tree is a binary search tree, except groups of nodes are held together in groups. That way, instead of having to seek around for each node, the B+ tree loads only a chunk at a time. And it keeps a bit of information around to know which chunk it's going to need in a search.
EDIT: If you need to sort by more than one item, you can do something like:
+--------+-------------+-------------+---------+
| Header | B+Tree by A | B+Tree by B | Records |
+--------+-------------+-------------+---------+
|| ^ | ^ | ^
|\------/ | | | |
\-------------------/ | |
| | |
\----------+----------/
I.e. you have seperate B+Trees for each key, and a seperate list of records, pointers to which are stored in the B+ trees.
I've tried partially sorting the record numbers by putting the A->B and B->A mappings in a sparse array, and flushing the densest clusters of entries to disk whenever I run out of memory.
it seems that it will incur extremely high syscall overhead.
You can use memory mapped access to the file to avoid syscall overhead. mmap() on *NIX, and CreateFileMapping() on Windows.
Split file logically into blocks, e.g. 32MB. If somethings needs to be changed in the block, mmap() it , modify data, optionally msync() if desired, munmap() and then move to the next block.
That would have been something I have tried first. OS would automatically read whatever needs to be read (on first access to the data), and it will queue IO anyway it likes.
Important things to keep in mind is that the real IO isn't that fast. Performance-wise limiting factors for random access are (1) the number of IOs per second (IOPS) storage can handle and (2) the number of disk seeks. (Usual IOPS is in hundreds range. Usual seek latency is 3-5ms.) Storage for example can read/write 50MB/s: one continuous block of 50MB in one second. But if you would try to patch byte-wise 50MB file, then seek times would simply kill the performance. Up to some limit, it is OK to read more and write more, even if to update only few bytes.
Another limit to observe is the OS' max size of IO operation: it depends on the storage but most OSs would split IO tasks larger than 128K. The limit can be changed and best if it is synchronized with the similar limit in the storage.
Also keep in mind the storage. Many people forget that storage is often only one. I'm trying here to say that starting crapload of threads doesn't help IO, unless you have multiple storages. Even single CPU/core is capable of easily saturating RAID10 with its 800 read IOPS and 400 write IOPS limits. (But a dedicated thread per storage at least theoretically makes sense.)
Hope that helps. Other people here often mention Boost.Asio which I have no experience with - but it is worth checking.
P.S. Frankly, I would love to hear other (more informative) responses to your question. I was in the boat several times already, yet had no chance to really get down to it. Books/links/etc related to IO optimizations (regardless of platform) are welcome ;)
Random disk access tends to be orders of magnitude slower than sequential disk access. So much so that it can be useful to choose algorithms that might sound badly inefficient at first blush. For example, you might try this:
Create your join index, but instead of using it, just write out the list of pairs (A index, B index) to a disk file.
Sort this new file of pairs by the A index. Use a sort algorithm designed for external sorting (though I've not tried it myself, the STXXL library from stxxl.sourceforge.net looked promising when I was researching a similar problem)
Sequentially walk through the A record file and the sorted pair list. Read a huge chunk in, make all the relevant changes in memory, write the chunk out. Never touch that portion of the A record file again (since the changes you planned to make come in sequential order)
Go back, sort the pair file by the B index (again, using an external sort). Use this to update the B record file in the same manner.
Instead of building a list of (key, record number A, record number B) I would leave out the key to save space and just build (record number A, record number B). I'd sort that table or file by the A's, sequentially seek to each A record, write the B number, then sort the list by the B's, sequentially seek to each B record, write the A number.
I'm doing very similar large file manipulations, and these newer machines are so damn fast it doesn't take long at all:
On a cheapo 2.4gHz HP Pavilion with 3gb ram and 32-bit Vista, writing 3 million sequential 1,008-byte records to a new file takes 56 seconds, using Delphi library routines (as opposed to the Win API).
Sequentially seeking to each record in the file and writing 8 bytes using Win API FileSeek/FileWrite on a booted machine takes 136 seconds. That's 3 million updates. Immediately rerunning the same code takes 108 seconds, since the O/S has some things cached.
Sorting record offsets first, then sequentially updating the files, is the way to go.

mmap-loadable data structure library for C++ (or C)

I have a some large data structure (N > 10,000) that usually only needs to be created once (at runtime), and can be reused many times afterwards, but it needs to be loaded very quickly. (It is used for user input processing on iPhoneOS.) mmap-ing a file seems to be the best choice.
Are there any data structure libraries for C++ (or C)? Something along the line
ReadOnlyHashTable<char, int> table ("filename.hash");
// mmap(...) inside the c'tor
...
int freq = table.get('a');
...
// munmap(...); inside the d'tor.
Thank you!
Details:
I've written a similar class for hash table myself but I find it pretty hard to maintain, so I would like to see if there's existing solutions already. The library should
Contain a creation routine that serialize the data structure into file. This part doesn't need to be fast.
Contain a loading routine that mmap a file into read-only (or read-write) data structure that can be usable within O(1) steps of processing.
Use O(N) amount of disk/memory space with a small constant factor. (The device has serious memory constraint.)
Small time overhead to accessors. (i.e. the complexity isn't modified.)
Assumptions:
Bit representation of data (e.g. endianness, encoding of float, etc.) does not matter since it is only used locally.
So far the possible types of data I need are integers, strings, and struct's of them. Pointers do not appear.
P.S. Can Boost.intrusive help?
You could try to create a memory mapped file and then create the STL map structure with a customer allocator. Your customer allocator then simply takes the beginning of the memory of the memory mapped file, and then increments its pointer according to the requested size.
In the end all the allocated memory should be within the memory of the memory mapped file and should be reloadable later.
You will have to check if memory is free'd by the STL map. If it is, your customer allocator will lose some memory of the memory mapped file but if this is limited you can probably live with it.
Sounds like maybe you could use one of the "perfect hash" utilities out there. These spend some time opimising the hash function for the particular data, so there are no hash collisions and (for minimal perfect hash functions) so that there are no (or at least few) empty gaps in the hash table. Obviously, this is intended to be generated rarely but used frequently.
CMPH claims to cope with large numbers of keys. However, I have never used it.
There's a good chance it only generates the hash function, leaving you to use that to generate the data structure. That shouldn't be especially hard, but it possibly still leaves you where you are now - maintaining at least some of the code yourself.
Just thought of another option - Datadraw. Again, I haven't used this, so no guarantees, but it does claim to be a fast persistent database code generator.
WRT boost.intrusive, I've just been having a look. It's interesting. And annoying, as it makes one of my own libraries look a bit pointless.
I thought this section looked particularly relevant.
If you can use "smart pointers" for links, presumably the smart pointer type can be implemented using a simple offset-from-base-address integer (and I think that's the point of the example). An array subscript might be equally valid.
There's certainly unordered set/multiset support (C++ code for hash tables).
Using cmph would work. It does have the serialization machinery for the hash function itself, but you still need to serialize the keys and the data, besides adding a layer of collision resolution on top of it if your query set universe is not known before hand. If you know all keys before hand, then it is the way to go since you don't need to store the keys and will save a lot of space. If not, for such a small set, I would say it is overkill.
Probably the best option is to use google's sparse_hash_map. It has very low overhead and also has the serialization hooks that you need.
http://google-sparsehash.googlecode.com/svn/trunk/doc/sparse_hash_map.html#io
GVDB (GVariant Database), the core of Dconf is exactly this.
See git.gnome.org/browse/gvdb, dconf and bv
and developer.gnome.org/glib/2.30/glib-GVariant.html

How to handle changing data structures on program version update?

I do embedded software, but this isn't really an embedded question, I guess. I don't (can't for technical reasons) use a database like MySQL, just C or C++ structs.
Is there a generic philosophy of how to handle changes in the layout of these structs from version to version of the program?
Let's take an address book. From program version x to x+1, what if:
a field is deleted (seems simple enough) or added (ok if all can use some new default)?
a string gets longer or shorter? An int goes from 8 to 16 bits of signed / unsigned?
maybe I combine surname/forename, or split name into two fields?
These are just some simple examples; I am not looking for answers to those, but rather for a generic solution.
Obviously I need some hard coded logic to take care of each change.
What if someone doesn't upgrade from version x to x+1, but waits for x+2? Should I try to combine the changes, or just apply x -> x+ 1 followed by x+1 -> x+2?
What if version x+1 is buggy and we need to roll-back to a previous version of the s/w, but have already "upgraded" the data structures?
I am leaning towards TLV (http://en.wikipedia.org/wiki/Type-length-value) but can see a lot of potential headaches.
This is nothing new, so I just wondered how others do it....
I do have some code where a longer string is puzzled together from two shorter segments if necessary. Yuck. Here's my experience after 12 years of keeping some data compatible:
Define your goals - there are two:
new versions should be able to read what old versions write
old versions should be able to read what new versions write (harder)
Add version support to release 0 - At least write a version header. Together with keeping (potentially a lot of) old reader code around that can solve the first case primitively. If you don't want to implement case 2, start rejecting new data right now!
If you need only case 1, and and the expected changes over time are rather minor, you are set. Anyway, these two things done before the first release can save you many headaches later.
Convert during serialization - at run time, only keep the data in the "new format" in memory. Do necessary conversions and tests at persistence limits (convert to newest when reading, implement backward compatibility when writing). This isolates version problems in one place, helping to avoid hard-to-track-down bugs.
Keep a set of test data from all versions around.
Store a subset of available types - limit the actually serialized data to a few data types, such as int, string, double. In most cases, the extra storage size is made up by reduced code size supporting changes in these types. (That's not always a tradeoff you can make on an embedded system, though).
e.g. don't store integers shorter than the native width. (you might need to do that when you need to store long integer arrays).
add a breaker - store some key that allows you to intentionally make old code display an error message that this new data is incompatible. You can use a string that is part of the error message - then your old version could display an error message it doesn't know about - "you can import this data using the ConvertX tool from our web site" is not great in a localized application but still better than "Ungültiges Format".
Don't serialize structs directly - that's the logical / physical separation. We work with a mix of two, both having their pros and cons. None of these can be implemented without some runtime overhead, which can pretty much limit your choices in an embedded environment. At any rate, don't use fixed array/string lengths during persistence, that should already solve half of your troubles.
(A) a proper serialization mechanism - we use a bianry serializer that allows to start a "chunk" when storing, which has its own length header. When reading, extra data is skipped and missing data is default-initialized (which simplifies implementing "read old data" a lot in the serializationj code.) Chunks can be nested. That's all you need on the physical side, but needs some sugar-coating for common tasks.
(B) use a different in-memory representation - the in-memory reprentation could basically be a map<id, record> where id woukld likely be an integer, and record could be
empty (not stored)
a primitive type (string, integer, double - the less you use the easier it gets)
an array of primitive types
and array of records
I initially wrote that so the guys don't ask me for every format compatibility question, and while the implementation has many shortcomings (I wish I'd recognize the problem with the clarity of today...) it could solve
Querying a non existing value will by default return a default/zero initialized value. when you keep that in mind when accessing the data and when adding new data this helps a lot: Imagine version 1 would calculate "foo length" automatically, whereas in version 2 the user can overrride that setting. A value of zero - in the "calculation type" or "length" should mean "calculate automatically", and you are set.
The following are "change" scenarios you can expect:
a flag (yes/no) is extended to an enum ("yes/no/auto")
a setting splits up into two settings (e.g. "add border" could be split into "add border on even days" / "add border on odd days".)
a setting is added, overriding (or worse, extending) an existing setting.
For implementing case 2, you also need to consider:
no value may ever be remvoed or replaced by another one. (But in the new format, it could say "not supported", and a new item is added)
an enum may contain unknown values, other changes of valid range
phew. that was a lot. But it's not as complicated as it seems.
There's a huge concept that the relational database people use.
It's called breaking the architecture into "Logical" and "Physical" layers.
Your structs are both a logical and a physical layer mashed together into a hard-to-change thing.
You want your program to depend on a logical layer. You want your logical layer to -- in turn -- map to physical storage. That allows you to make changes without breaking things.
You don't need to reinvent SQL to accomplish this.
If your data lives entirely in memory, then think about this. Divorce the physical file representation from the in-memory representation. Write the data in some "generic", flexible, easy-to-parse format (like JSON or YAML). This allows you to read in a generic format and build your highly version-specific in-memory structures.
If your data is synchronized onto a filesystem, you have more work to do. Again, look at the RDBMS design idea.
Don't code a simple brainless struct. Create a "record" which maps field names to field values. It's a linked list of name-value pairs. This is easily extensible to add new fields or change the data type of the value.
Some simple guidelines if you're talking about a structure use as in a C API:
have a structure size field at the start of the struct - this way code using the struct can always ensure they're dealing only with valid data (for example, many of the structures the Windows API uses start with a cbCount field so these APIs can handle calls made by code compiled against old SDKs or even newer SDKs that had added fields
Never remove a field. If you don't need to use it anymore, that's one thing, but to keep things sane for dealing with code that uses an older version of the structure, don't remove the field.
it may be wise to include a version number field, but often the count field can be used for that purpose.
Here's an example - I have a bootloader that looks for a structure at a fixed offset in a program image for information about that image that may have been flashed into the device.
The loader has been revised, and it supports additional items in the struct for some enhancements. However, an older program image might be flashed, and that older image uses the old struct format. Since the rules above were followed from the start, the newer loader is fully able to deal with that. That's the easy part.
And if the struct is revised further and a new image uses the new struct format on a device with an older loader, that loader will be able to deal with it, too - it just won't do anything with the enhancements. But since no fields have been (or will be) removed, the older loader will be able to do whatever it was designed to do and do it with the newer image that has a configuration structure with newer information.
If you're talking about an actual database that has metadata about the fields, etc., then these guidelines don't really apply.
What you're looking for is forward-compatible data structures. There are several ways to do this. Here is the low-level approach.
struct address_book
{
unsigned int length; // total length of this struct in bytes
char items[0];
}
where 'items' is a variable length array of a structure that describes its own size and type
struct item
{
unsigned int size; // how long data[] is
unsigned int id; // first name, phone number, picture, ...
unsigned int type; // string, integer, jpeg, ...
char data[0];
}
In your code, you iterate through these items (address_book->length will tell you when you've hit the end) with some intelligent casting. If you hit an item whose ID you don't know or whose type you don't know how to handle, you just skip it by jumping over that data (from item->size) and continue on to the next one. That way, if someone invents a new data field in the next version or deletes one, your code is able to handle it. Your code should be able to handle conversions that make sense (if employee ID went from integer to string, it should probably handle it as a string), but you'll find that those cases are pretty rare and can often be handled with common code.
I have handled this in the past, in systems with very limited resources, by doing the translation on the PC as a part of the s/w upgrade process. Can you extract the old values, translate to the new values and then update the in-place db?
For a simplified embedded db I usually don't reference any structs directly, but do put a very light weight API around any parameters. This does allow for you to change the physical structure below the API without impacting the higher level application.
Lately I'm using bencoded data. It's the format that bittorrent uses. Simple, you can easily inspect it visually, so it's easier to debug than binary data and is tightly packed. I borrowed some code from the high quality C++ libtorrent. For your problem it's so simple as checking that the field exist when you read them back. And, for a gzip compressed file it's so simple as doing:
ogzstream os(meta_path_new.c_str(), ios_base::out | ios_base::trunc);
Bencode map(Bencode::TYPE_MAP);
map.insert_key("url", url.get());
map.insert_key("http", http_code);
os << map;
os.close();
To read it back:
igzstream is(metaf, ios_base::in | ios_base::binary);
is.exceptions(ios::eofbit | ios::failbit | ios::badbit);
try {
torrent::Bencode b;
is >> b;
if( b.has_key("url") )
d->url = b["url"].as_string();
} catch(...) {
}
I have used Sun's XDR format in the past, but I prefer this now. Also it's much easier to read with other languages such as perl, python, etc.
Embed a version number in the struct or, do as Win32 does and use a size parameter.
if the passed struct is not the latest version then fix up the struct.
About 10 years ago I wrote a similar system to the above for a computer game save game system. I actually stored the class data in a seperate class description file and if i spotted a version number mismatch then I coul run through the class description file, locate the class and then upgrade the binary class based on the description. This, obviously required default values to be filled in on new class member entries. It worked really well and it could be used to auto generate .h and .cpp files as well.
I agree with S.Lott in that the best solution is to separate the physical and logical layers of what you are trying to do. You are essentially combining your interface and your implementation into one object/struct, and in doing so you are missing out on some of the power of abstraction.
However if you must use a single struct for this, there are a few things you can do to help make things easier.
1) Some sort of version number field is practically required. If your structure is changing, you will need an easy way to look at it and know how to interpret it. Along these same lines, it is sometimes useful to have the total length of the struct stored in a structure field somewhere.
2) If you want to retain backwards compatibility, you will want to remember that code will internally reference structure fields as offsets from the structure's base address (from the "front" of the structure). If you want to avoid breaking old code, make sure to add all new fields to the back of the structure and leave all existing fields intact (even if you don't use them). That way, old code will be able to access the structure (but will be oblivious to the extra data at the end) and new code will have access to all of the data.
3) Since your structure may be changing sizes, don't rely on sizeof(struct myStruct) to always return accurate results. If you follow #2 above, then you can see that you must assume that a structure may grow larger in the future. Calls to sizeof() are calculated once (at compile time). Using a "structure length" field allows you to make sure that when you (for example) memcpy the struct you are copying the entire structure, including any extra fields at the end that you aren't aware of.
4) Never delete or shrink fields; if you don't need them, leave them blank. Don't change the size of an existing field; if you need more space, create a new field as a "long version" of the old field. This can lead to data duplication problems, so make sure to give your structure a lot of thought and try to plan fields so that they will be large enough to accommodate growth.
5) Don't store strings in the struct unless you know that it is safe to limit them to some fixed length. Instead, store only a pointer or array index and create a string storage object to hold the variable-length string data. This also helps protect against a string buffer overflow overwriting the rest of your structure's data.
Several embedded projects I have worked on have used this method to modify structures without breaking backwards/forwards compatibility. It works, but it is far from the most efficient method. Before long, you end up wasting space with obsolete/abandoned structure fields, duplicate data, data that is stored piecemeal (first word here, second word over there), etc etc. If you are forced to work within an existing framework then this might work for you. However, abstracting away your physical data representation using an interface will be much more powerful/flexible and less frustrating (if you have the design freedom to use such a technique).
You may want to take a look at how Boost Serialization library deals with that issue.