Serialization with a custom pattern and random access with Boost - c++

I'm asking here because i have already tried to search but i have no idea if this things even exist and what their names are.
I start explaining that with custom pattern i mean this: suppose that i need to serialize objects or data of type foo, bar and boo, usually the library handle this for the user in a very simple way, what comes first goes in first in the serialization process, so if i serialize all the foo first they are written "at the top" of the file and all the bar and boo are after the foo.
Now I would like to keep order in my file and organize things based on a custom pattern, it's this possible with Boost ? What section provides this feature ?
Second thing, that is strictly related to the first one, I also would like to access my serialized binary files in a way that I'm not forced to parse and read all the previous values to extract only the one that I'm interested in, kinda like the RAM that works based on memory address and offers a random access without forcing you to parse all the others addresses.
Thanks.

On the first issue: the Boost serialization library is agnostic as to what happens after it turns an object into its serialized form. It does this by using input and output streams. Files are just that - fostream/fistream. For other types of streams however, the order/pattern that you speak of doesn't make sense. Imagine you're sending serialized objects over the network - the library can't know that it'll have to rearrange the order of objects and, in fact, it can't do that once they've been sent. For this reason, it does not support what you're looking for.
What you can do is create a wrapper that either just caches serialized versions of the objects and arranges them in memory before you tell it to write them out to a file, or that knows that since you're working with files, it can later tellg to the appropriate place in the file and append (this approach would require you to store the locations of the objects you wrote to the file).
As for the second thing - random access file reading. You will have to know exactly where the object is in memory. If you know that the structure of your file won't change, you can seekg on the file stream before handing it to boost for deserialization. If the file structure will change however, you still need to know the location of objects in the file. If you don't want to parse the file to find it, you'll have to store it somewhere during serialization. For example - you can maintain a sort of registry of objects at the top of the file. You will still have to parse it, but it should be just a simple [Object identifier]-[location in file] sort of thing.

Related

Boost PTree used just for reading file or for storing the values too?

I have a configuration file, that is a json. I have created a class (ConfigFile) that reads that file and store the values (using boost parser and ptree). I am wandering is it a good practice to use the ptree as a member of the ConfigFile class, or I shall use it just for reading the json and store the values in a map member?
I'd say what matters is the ConfigFile's interface. If you can keep it consistent with either version, it shouldn't be a problem to just choose one and switch to the other if you feel the need without breaking anything.
Keep property tree out of the header. The latter can also be fixed with the pimpl idiom.
#sehe's comment makes a lot of sense here as well and is something to remember.

Making functions increasingly generic

I am currently working on a generic C++ function for reading in data from a database, and then printing it to a file in XML format. Each database file gets a structure with basic information in it about the database file (it's a proprietary database).
The trouble is, while it's perfectly generic for most things, there is a single database file that uses a sort of union, where a column determines the contents of the remaining columns. These are to get their own indentation level, like a child in the main record. The secondary problem is that there might end up being a file with an unknown number of child indentations.
What would I need to do to make this function as generic as possible to avoid having to put in an if statement for that specific file? I have a means (potentially) to make it work with one child, but when it gets to multiple child indentations things get a little crazy and hazy.
Instead of passing a structure with basic information, pass one or more function pointers (i.e. "callbacks") which take care of parsing individual parts. That way, your generic function implements the toplevel algorithm but the lower level details are handled by functions provided by the caller.

C++ Truncating an iostream Class

For a project I'm working on for loading/storing data in files, I made the decision to implement the iostream library, because of some of the features it holds over other file io libraries. One such feature, is the ability to use either the deriving fstream or stringstream classes to allow the loading of data from a file, or an already existent place in memory. Although, so far, there is one major fallback, and I've been having trouble getting information about it for a while.
Similar to the functions available in the POSIX library, I was looking for some implementation of the truncate or ftruncate functions. For stringstream, this would be as easy as passing the associated string back to the stream after reconstructing it with a different size. For fstream, I've yet to find any way of doing this, actually, because I can't even find a way to pull the handle to the file out of this class. While giving me a solution to the fstream problem would be great, an even better solution to this problem would have to be iostream friendly. Because every usage of either stream class goes through an iostream in my program, it would just be easier to truncate them through the base class, than to have to figure out which class is controlling the stream every time I want to change the overall size.
Alternatively, if someone could point me to a library that implements all of these features I'm looking for, and is mildly easy to replace iostream with, that would be a great solution as well.
Edit: For clarification, the iostream classes I'm using are more likely to just be using only the stringstream and fstream classes. Realistically, only the file itself needs to be truncated to a certain point, I was just looking for a simpler way to handle this, that doesn't require me knowing which type of streambuf the stream was attached to. As the answer suggested, I'll look into a way of using ftruncate alongside an fstream, and just handle the two specific cases, as the end user of my program shouldn't see the stream classes anyways.
You can't truncate an iostream in place. You have to copy the first N bytes from the existing stream to a new one. This can be done with the sgetn() and sputn() methods of the underlying streambuf object, which you can obtain by iostream::rdbuf().
However, that process may be I/O intensive. It may be better to use special cases to manipulate the std::string or call ftruncate as you mentioned.
If you want to be really aggressive, you can create a custom std::streambuf derivative class which keeps a pointer to the preexisting rdbuf object in a stream, and only reads up to a certain point before generating an artifiicial end-of-file. This will look to the user like a shorter I/O sequence, but will not actually free memory or filesystem space. (It's not clear if that is important to you.)

Is there something like a Filestorage class to store files in?

Is there something like a class that might be used to store Files and directories in, just like the way Zip files might be used?
Since I haven't found any "real" class to write Zip files (real class as in real class),
It would be nice to be able to store Files and Directories in a container-like file.
A perfect API would probably look like this:
int main()
{
ContainerFile cntf("myContainer.cnt", ContainerFile::CREATE);
cntf.addFile("data/some-interesting-stuff.txt");
cntf.addDirectory("data/foo/");
cntf.addDirectory("data/bar/", ContainerFile::RECURSIVE);
cntf.close();
}
... I hope you get the Idea.
Important Requirements are:
The Library must be crossplatform
anything *GPL is not acceptable in this case (MIT and BSD License are)
I already played with the thought of creating an Implentation based on SQLite (and its ability to store binary blobs). Unfortunately, it seems impossible to store Directory structures in a SQLite Database, which makes it pretty much useless in this case.
Is it useless to hope for such a class library?
In an SQLite db you can store directory-like structures... you just have to have a "Directories" table, with one entry for each directory, having at least an index and a "parent" field (which holds the index of another directory, or 0 if it has no parent). Then you can have a "Files" table, which contains file attributes, the index of the parent directory and the file content.
That's it, now you have your directory tree in a relational DB.
Someone pointed me to PhysicsFS, which has an API similar to what you describe, but it's a pure C API that does everything you need. A trivial object-oriented wrapper could be easily written.
You might like to check out http://www.cs.unc.edu/Research/compgeom/gzstream/
If you are making your own then redis may be a better choice than SQLite as I believe it handles binary data better.
I took the time to write a tiny, yet working wrapper around libarchive. I'm not exactly familiar with all features of Libarchive, but the result fits what I needed:
archive_wrapper.cpp # gist.github.com
It uses libmars for strings, etc. But I guess it wouldn't be too hard to replace the mars::mstring occurances with std::string.
And of course this wrapper is available under the MIT/X11 License (just as libmars), which means you can do whatever you want with it. ;-)

Parsing huge data with c++

In my job, i need to parse different kind of data files from different data sources.Sometimes i parse them by writing directly c++ code (with the help of qt and boost:D), sometimes manually with a helper program.
I must note that data types are so different from each other it is so hard to create common a interface for all of them. But i want to do this job in a more generic way.I am planning to write a library to convert them and it should be easy to add new parser utility in future.I am also planning to use other helper programs inside my program, not manually.
My question is what kind of an architecture or pattern do you suggest, Basic condition is library must be extendable via new classes or dll's and also configurable.
By the way data can be in text, ascii or something like CSV(comma seperated values) and most of them are specific for a certain data.
Not to blow my own trumpet, but my small Open Source utility CSVfix has an extensible architecture based on deriving new C++ classes with a very simple interface. I did consider using a plugin-architecture with DLLs but it seemed like overkill for such a simple utility . If interested, you can get the binaries & sources here.
I'd suggest a 3-part model, where the common data-format is a String which should be able to contain every value:
Reader: In this layer the values are read from the source (ie. CSV-file) using some sort of file-format-descriptor. The values are then stored in some sort of intermediate data structure.
Connector/Converter: This layer is responsible for mapping the reader-data to the writer-fields.
Writer: This layer is responsible for writing a specific data structure to the target (ie. another file-format or a database).
This way you can write different Readers for different input files.
I think the hardest part would be creating the definition of the intermediate storage format/structure so that it is future-proof and flexible.
One method I used for defining data structure in my datafile read/write classes is to use std::map<std::string, std::vector<std::string>, string_compare> where the key is the variable name and the vector of strings is the data. While this is expensive in memory, it does not lock me down to only numeric data. And, this method allows for different lengths of data within the same file.
I had the base class implement this generic storage, while the derived classes implemented the reader/writer capability. I then used a factory to get to the desired handler, using another class that determined the file format.