I want to store a graph of different objects for a game, their classes may or may not be related, they may or may not contain vectors of simple structures.
I want parsing operation to be fast, data can be pretty big.
Adding new things should not be hard, and it should not break backward compatibility.
Smaller file size is kind of important
Readability counts
By serialization I mean, making objects serialize themselves, which is effective, but I will need to write different serialization methods for different objects for that.
By binary parsing/composing I mean, creating a new tree of parsers/composers that holds and reads data for these objects, and passing this around to have my objects push/pull their data.
I can also use json, but it can be pretty slow for reading, and it is not very size effective when it comes to pretty big sets of matrices, and numbers.
Point by point:
Fast Parsing: binary (since you don't necessarily have to "parse", you can just deserialize)
Adding New Things: text
Smaller: text (even if gzipped text is larger than binary, it won't be much larger).
Readability: text
So that's three votes for text, one point for binary. Personally, I'd go with text for everything except images (and other data which is "naturally" binary). Then, store everything in a big zip file (I can think of several games do this or something close to it).
Good reads: The Importance of Being Textual and Power Of Plain Text.
Check out protocol buffers from Google or thrift from Apache. Although billed as a way to write wire protocols easily, it's basically an object serialization mechanism that can create bindings in a dozen languages, has efficient binary representation, easy versioning, fast performance, and is well-supported.
We're using Boost.Serialization. Don't know how it performs next to those offered by samkass.
Related
I am reading code for a project written by others. The main task of the project is to read contents from a large structured text file (.txt) with 8 columns into a KnowledgeBase object, which have a number of methods and variables. The KnowledgeBase object is then output into a binary file. For example, the KnowledgeBase class has at least these two variables:
map<string, pair<string, string>> key_info
vector<ObjectInfo> objects
...
These variables are easy to understand when I track the code with gdb. Then, it seems it is converting such vectors and maps into binary forms. And the two variables above have their corresponding binary forms:
BinaryKeyInfo *bkeys
BinaryObjectInfo *bObjects
Later on when outputting to binary file, it has such code:
fwrite((char*)(&wcount),sizeof(int32_t),1,output);
fwrite((char*)bkeys,sizeof(KeyInfo_t),wcount,output);
The converting code from the original KnowledgeBase to binary is complicated. My question is, what's the main purpose of this conversion? Is it for faster loading of binary file into memory than plain text file? The plain text file is large. I learnt that object serialization is primarily for transmitting objects over the net, but I don't think the purpose here is for that. It is more like for speeding up data loading and memory saving. Could that be part of object serialization in C++?
Is the main purpose of object serialization in C++ for faster object loading?
No. The most important purpose of serialisation is to transform the state of the program into a format that can be stored on the filesystem, or that can be communicated across a network, and that can be de-serialised back. Often, the purpose of either is for another program to do the de-serialisation. Sometimes the de-serialiser is another instance of the same program.
The speed of de-serialisation is one metric that can be used to gauge whether one particular serialisation format is a good one. The ability to quickly undo what you have done is not the reason why you do it in the first place.
what's the benefit of converting them into binary vectors or maps?
As I mention above, the benefit of serialisation is the ability to store the serialised data on the filesystem, or to send it over a network.
what' the benefit between plain text files VS binary files?
Pros of text serialisation format:
Humans are able to read and write plain text. Humans generally are not able read nor write binary files.
It's generally easier to implement a plain text format de-/serialiser in a way that works across differing computers than it is to implement a binary format de-/serialiser that achieves the same.
Pros of binary serialisation format:
Typically faster and uses less storage and bandwidth.
Can be easier to implement if there is no need for communication between differing systems. This is typically only the case in very simple cases. (Furthermore, there usually is a need for cross-system compatibility, even if the need haven't been realised yet).
We aim at using HDF5 for our data format. HDF5 has been selected because it is a hierarchical filesystem-like cross-platform data format and it supports large amounts of data.
The file will contain arrays and some parameters. The question is about how to store the parameters (which are not made up by large amounts of data), considering also file versioning issues and the efforts to build the library. Parameters inside the HDF5 could be stored as either (A) human-readable attribute/value pairs or (B) binary data in the form of HDF5 compound data types.
Just as an example, let's consider as a parameter a polygon with three vertex. Under case A we could have for instance a variable named Polygon with the string representation of the series of vertices, e.g. for instance (1, 2); (3, 4); (4, 1). Under case B, we could have instead a variable named Polygon made up by a [2 x 3] matrix.
We have some idea, but it would be great to have inputs from people who have already worked with something similar. More precisely, could you please list pro/cons of A and B and also say under what circumstances which would be preferable?
Speaking as someone who's had to do exactly what you're talking about a number of time, rr got it basically right, but I would change the emphasis a little.
For file versioning, text is basically the winner.
Since you're using an hdf5 library, I assume both serializing and parsing are equivalent human-effort.
text files are more portable. You can transfer the files across generations of hardware with the minimal risk.
text files are easier for humans to work with. If you want to extract a subset of the data and manipulate it, you can do that with many programs on many computers. If you are working with binary data, you will need a program that allows you to do so. Depending on how you see people working with your data, this can make a huge difference to the accessibility of the data and maintenance costs. You'll be able to sed, grep, and even edit the data in excel.
input and output of binary data (for large data sets) will be vastly faster than text.
working with those binary files in a new environmnet (e.g. a 128 bit little endian computer in some sci-fi future) will require some engineering.
similarly, if you write applications in other languages, you'll need to handle the encoding identically between applications. This will either mean engineering effort, or having the same libraries available on all platforms. Plain text this is easier...
If you want others to write applications that work with your data, plain text is simpler. If you're providing binary files, you'll have to provide a file specification which they can follow. With plain text, anyone can just look at the file and figure out how to parse it.
you can archive the text files with compression, so space concerns are primarily an issue for the data you are actively working with.
debugging binary data storage is significantly more work than debugging plain-text storage.
So in the end it depends a little on your use case. Is it meaningful to look at the data in the myriad tools that handle plain-text? Is it only meaningful to look at it with big-data hdf5 viewers? Will writing plain text be onerous to you in terms of time and space?
In general, when I'm faced with this issue, I basically always do the same thing: I store the data in plain text until I realize the speed problems are more irritating than working with binary would be, and then I switch. If you don't know in advance if you're crossing that threshold start with plain-text, and write your interface to your persistence layer in such a way that it will be easy to switch later. This is tiny bit of additional work, which you will probably get back thanks to plain text being easier to debug.
If you expect to edit the file by hand often (like XMLs or JSONs), then go with human readable format.
Otherwise go with binary - it's much easier to create a parser for it and it will run faster than any grammar parser.
Also note how there's nothing that prevents you from creating a converter between binary and human-readable form later.
Versioning files might sound nice, but are you really going to inspect the diffs for files "containing large arrays"?
I'm currently working on a project that requires working with gigabytes of scientific data sets. The data sets are in the form of very large arrays (30,000 elements) of integers and floating point numbers. The problem here is that they are too large too fit into memory, so I need an on disk solution for storing and working with them. To make this problem even more fun, I am restricted to using a 32-bit architecture (as this is for work) and I need to try to maximize performance for this solution.
So far, I've worked with HDF5, which worked okay, but I found it a little too complicated to work with. So, I thought the next best thing would be to try a NoSQL database, but I couldn't find a good way to store the arrays in the database short of casting them to character arrays and storing them like that, which caused a lot of bad pointer headaches.
So, I'd like to know what you guys recommend. Maybe you have a less painful way of working with HDF5 while at the same time maximizing performance. Or maybe you know of a NoSQL database that works well for storing this type of data. Or maybe I'm going in the totally wrong direction with this and you'd like to smack some sense into me.
Anyway, I'd appreciate any words of wisdom you guys can offer me :)
Smack some sense into yourself and use a production-grade library such as HDF5. So you found it too complicated, but did you find its high-level APIs ?
If you don't like that answer, try one of the emerging array databases such as SciDB, rasdaman or MonetDB. I suspect though, that if you have baulked at HDF5 you'll baulk at any of these.
In my view, and experience, it is worth the effort to learn how to properly use a tool such as HDF5 if you are going to be working with large scientific data sets for any length of time. If you pick up a tool such as a NoSQL database, which was not designed for the task at hand, then, while it may initially be easier to use, eventually (before very long would be my guess) it will lack features you need or want and you will find yourself having to program around its deficiencies.
Pick one of the right tools for the job and learn how to use it properly.
Assuming your data sets really are large enough to merit (e.g., instead of 30,000 elements, a 30,000x30,000 array of doubles), you might want to consider STXXL. It provides interfaces that are intended to (and largely succeed at) imitate those of the collections in the C++ standard library, but are intended to work with data too large to fit in memory.
I have been working on scientific computing for years, and I think HDF5 or NetCDF is a good data format for you to work with. It can provide efficient parallel read/wirte, which is important for dealing with big data.
An alternate solution is to use array database, like SciDB, MonetDB, or RasDaMan. However, it will be kinda painful if you try to load HDF5 data into an array database. I once tried to load HDF5 data into SciDB, but it requires a series of data transformations. You need to know if you will query the data often or not. If not often, then the time-consuming loading may be unworthy.
You may be interested in this paper.
It can allow you to query the HDF5 data directly by using SQL.
What is a flexible way to stream data to disk in a c++ program in Windows?
I am looking to create a flexible stream of data that may contain arbitrary data (say time, average, a flag if reset, etc) to disk for later analysis. Data may come in at non-uniform, irregular intervals. Ideally this stream would have minimal overhead and be easily readable in something like MATLAB so I could easily analyze events and data.
I'm thinking of a binary file with a header file describing types of packets followed by a wild dump of data tagged with . I'm considering a lean, custom format but would also be interested in something like HDF5.
It is probably better to use an existing file format rather than a custom one. First you don't reinvent the wheel, second you will benefit from a well tested and optimized library.
HFD5 seems like a good bet. It is fast and reliable, and easy to read from Matlab. It has some overhead but it is to allow great flexibility and compatibility.
This requirement sounds suspiciously like a "database"
This data is stored in an array (using C++) and is a repetition of 125 bits each one varying from the other. It also has 8 messages of 12 ASCII characters each at the end. Please suggest if I should use differential compression within the array and if so how?
Or should I apply some other compression scheme onto the whole array?
Generally you can compress data that has some sort of predictability or redundancy. Dictionary based compression (e.g. ZIP style algorithms) traditionally don't work well on small chunks of data because of the need to share the selected dictionary.
In the past, when I have compressed very small chunks of data with somewhat predictable patterns, I have used SharpZipLib with a custom dictionary. Rather than embed the dictionary in the actual data, I hard-coded the dictionary in every program that needs to (de)compress the data. SharpZipLib gives you both options: custom dictionary, and keep dictionary separate from the data.
Again this will only work well if you can predict some patterns to your data ahead of time so that you can create an appropriate compression dictionary, and it's feasible for the dictionary itself to be separate from the compressed data.
You haven't given us enough information to help you. However, I can highly recommend the book Text Compression by Bell, Cleary, and Witten. Don't be fooled by the title; "Text" here just means "lossless"—all the techniques apply to binary data. Because the book is expensive you might try to get it on interlibrary loan.
Also, don't overlook the obvious Burrows-Wheeler (bzip2) or Lempel-Ziv (gzip, zlib) techniques. It's quite possible that one of these techniques will work well for your application, so before investigating alternatives, try compressing your data with standard tools.