How to use array optimization in boost serialization - c++

I have to serialize an object that contains a std::vector<unsigned char> that can contain thousand of members, with that vector sizes the serialization doesn't scale well.
According with the documentation, Boost provides a wrapper class array that wraps the vector for optimizations but it generates the same xml output. Diving in boost code, i've found a class named use_array_optimization that seems to control the optimization but is somehow deactivated by default. i've also tried to override the serialize function with no results.
I would like to know how to activate that optimizations since the documents at boost are unclear.

The idea behind the array optimization is that, for arrays of types that can be archived by simply "dumping" their representation as-is to the archive, "dumping" the whole array at once is faster than "dumping" one element after the other.
I understand from your question that you are using the xml archive. The array optimization does not apply in that case because the serialization of the elements implies a transformation anyway.

Finally, I used the BOOST_SERIALIZATION_SPLIT_MEMBER() macro and coded two functions for loading and saving. The Save function looks like:
template<class Archive>
void save(Archive & ar, const unsigned int version) const
{
using boost::serialization::make_nvp;
std::string sdata;
Vector2String(vData, sdata);
ar & boost::serialization::make_nvp("vData", sdata);
}
The Vector2String function simply takes the data in vector and format it to a std::string. The load function uses a function that reverses the encoding.

You have several ways to serialize a vector with Boost Serialization to XML.
From what I read in the comments your are looking for Case 2 below.
I think you cannot change how std::vector is serialized by library after including boost/serialization/vector.hpp, however you can replace the code there by your own and by something close to Case 2.
0. Library default, not optimized
The first is to use the default given by the library, that as far as I know won't optimize anything:
#include <boost/serialization/vector.hpp>
...
std::vector<double> vec(4);
std::iota(begin(vec), end(vec), 0);
std::ofstream ofs{"default.xml", boost::archive::no_header};
boost::archive::xml_oarchive xoa{ofs};
xoa<< BOOST_NVP(vec);
output:
<vec>
<count>4</count>
<item_version>0</item_version>
<item>0.00000000000000000e+00</item>
<item>1.00000000000000000e+00</item>
<item>2.00000000000000000e+00</item>
<item>3.00000000000000000e+00</item>
</vec>
1. Manually, use that data is contiguous
#include <boost/serialization/array_wrapper.hpp> // for make_array
...
std::ofstream ofs{"array.xml"};
boost::archive::xml_oarchive xoa{ofs, boost::archive::no_header};
auto const size = vec.size();
xoa<< BOOST_NVP(size) << boost::serialization::make_nvp("data", boost::serialization::make_array(vec.data(), vec.size()));
output:
<size>4</size>
<data>
<item>0.00000000000000000e+00</item>
<item>1.00000000000000000e+00</item>
<item>2.00000000000000000e+00</item>
<item>3.00000000000000000e+00</item>
</data>
2. Manually, use that data is binary and contiguous
#include <boost/serialization/binary_object.hpp>
...
std::ofstream ofs{"binary.xml"};
boost::archive::xml_oarchive xoa{ofs, boost::archive::no_header};
auto const size = vec.size();
xoa<< BOOST_NVP(size) << boost::serialization::make_nvp("binary_data", boost::serialization::make_binary_object(vec.data(), vec.size()*sizeof(double)));
<size>4</size>
<binary_data>
AAAAAAAAAAAAAAAAAADwPwAAAAAAAABAAAAAAAAACEA=
</binary_data>
I think this makes the XML technically not portable.

Related

saving object information into a binary file

I m trying to save all the member variables of an object in a binary file. However, the member variables are vectors that is dynamically allocated. So, is there any way to combine all the data and save it in a binary file. As of now, it just saves the pointer, which is of little help. Following is my running code.
#include <vector>
#include <iostream>
#include <fstream>
class BaseSaveFile {
protected:
std::vector<float> first_vector;
public:
void fill_vector(std::vector<float> fill) {
first_vector = fill;
}
void show_vector() {
for ( auto x: first_vector )
std::cout << x << std::endl;
}
};
class DerivedSaveFile : public BaseSaveFile {
};
int main ( int argc, char **argv) {
DerivedSaveFile derived;
std::vector<float> fill;
for ( auto i = 0; i < 10; i++) {
fill.push_back(i);
}
derived.fill_vector(fill);
derived.show_vector();
std::ofstream save_object("../save_object.bin", std::ios::out | std::ios::binary);
save_object.write((char*)&derived, sizeof(derived));
}
Currently size of the binary file is just 24 bytes. But I was execting much larger because of the vector of 10 floats.
"is there any way to combine all the data and save it in a binary file" - of course there is. You write code to iterate over all the data and convert it into a form suitable for writing to a file (that you know how to later parse when reading it back in). Then you write code to read the file, parse it into meaningful variables classes and construct new objects from the read-in data. There's no built-in facility for it, but it's not rocket science - just a bunch of work/code you need to do.
It's called serialisation/de-serialisation btw, in case you want to use your preferred search engine to look up more details.
The problem
You can write the exact binary content of an object to a file:
save_object.write((char*)&derived, sizeof(derived));
However, it is not guaranteed that you you read it back into memory with the reverse read operation. This is only possible for a small subset of objects that have a trivially copyable type and do not contain any pointer.
You can verify if your type matches this definition with std::is_trivially_copyable<BaseSaveFile>::value but I can already tell you that it's not because of the vector.
To simplify a bit the formal definition, trivially copyable types are more or less the types that are composed only of other trivially copiable elements and very elementary data types such as int, float, char, or fixed-size arrays.
The solution: introduction to serialization
The general solution, as mentionned int he other response it called serialization. But for a more tailored answer, here is how it would look like.
You would add the following public method to your type:
std::ostream& save(std::ostream& os){
size_t vsize=first_vector.size();
os.write((char*)&vsize, sizeof(vsize));
os.write((char*)first_vector.data(), vsize*sizeof(float));
return os;
}
This method has access to all the members and can write them to the disk. For the case of the vector, you'd first write down its size (so that you know how big it is when you'll read the file later on).
You would then add the reverse method:
std::istream& load(std::istream& is){
size_t vsize;
if(is.read((char*)&vsize, sizeof(vsize))) {
first_vector.resize(vsize);
is.read((char*)first_vector.data(), vsize*sizeof(float));
}
return is;
}
Here the trick is to first read the size of the vector on disk, and then resize the vector before loading it.
Note the use of istream and ostream. This allows you to store the data on a file, but you could use any other kind of stream such as in memory string stream if you want.
Here a full online example (it uses stringstream because the online service doesn't provide for files to be written).
More serialization ?
There are some serialization tricks to know. First, if you have derived types, you'd need to make load() and save() virtual and provide the derived types with their own overridden version.
If one of your data member is not trivially copyable, it would need its own load() and save() that you could then invoke recursively. Or you'd need to handle the thing yourself, which is only possible if you can access all the members you'd need to restore its state.
Finally, you don't need to reinvent the wheel. There are some libraries outside that may help, like boost serialisation or cereal

Decreasing the overall computation time

So I have a computationally heavy c++ function that extracts numbers from a file and puts them into a vector. When I run this function in main, it takes a lot of time. Is it possible to somehow have this function computed once, and then linked to the main program so I can save precious computation time in my main program every time I try to run it?
The function I have is this:
vector <double> extract (vector <double> foo)
{
ifstream wlm;
wlm.open("wlm.dat");
if(wlm.is_open())
{
while (!wlm.eof())
{
//blah blah extraction stuff
}
return foo;
}
else
cout<<"File didn't open"<<endl;
wlm.close();
}
And my main program has other stuff which I compute over there. I don't want to call this function from the main program because it will take a long time. Instead I want the vector to be extracted beforehand during compile time so I can use the extracted vector later in my main program. Is this possible?
Change your function to that:
std::vector<double>& extract(std::vector<double>& foo)
So you will not copy vector twice (I guess that eats most of time).
Try to reserve() memory for your vector according to file data (if that is possible, that will let you avoid reallocations).
You should return std::vector<double> always, not just in case of good result.
You should close file just if it was successfully opened.
Something like that:
std::vector<double>& extract (std::vector<double>& foo)
{
ifstream wlm;
wlm.open("wlm.dat");
if(wlm.is_open())
{
while (!wlm.eof())
{
//blah blah extraction stuff
}
wlm.close();
}
else
cout<<"File didn't open"<<endl;
return foo;
}
While your question was not entirely clear, I assume that you want to:
compute a vector of doubles from a large set of data
use this computed (smaller) set of data in your program
do the computation at compile time
This is possible of course, but you will have to leverage whatever build system you are using. Without more specifics, I can only give a general answer:
Create a helper program that you can invoke during compilation. This program should implement the extract function and dump the result into a file. You have two main choices here: go for a resource file that can be embedded into the executable, or generate source code that contains the data. If the data is not terribly large, I suggest the latter.
Use the generated file in your program
For example:
Pre-build step extract_data.exe extracted_data_generated
This dumps the extracted data into a header and source, such as:
// extracted_data_generated.h
#pragma once
extern const std::array<double, 4> extracted;
// extracted_data_generated.cpp
#include "extracted_data_generated.h"
const std::array<double, 4> extracted{ { 1.2, 3.4, 5.6, 6.7 } }; //etc.
In other parts of your program, use the generated data
#include "extracted_data_generated.h"
// you have extracted available as a variable here.
I also changed to a std::array whose size you will know in your helper program because you will know the size of the vector.
The resource route is similar, but you will have to implement platform-specific extraction of the resource and reading the data. So unless your computed data is very large, I'd suggest the code generation.

ostream default parameter in template function

In the book "Essential C++" (more specifically, part 2.7), the author briefly discusses the usage of template functions with the following example, which displays a diagnostic message and then iterates through the elements of a vector
template <typename T>
void display_message(const string& msg, const vector<T>& vec)
{
cout << msg;
for (int i = 0; i < vec.size(); ++i)
cout << vec[i] << ' ';
}
So, this example got me interested, because i (as many other hobbyist developers, probably) have always taken for granted that in most applications, the standard input/output streams are being used for communication and data processing. The author then mentions that this way of implementing display_message is more flexible. Can you give me an example of a situation where this flexability "shines", so to speak? In other words, is there a case where the optional 3rd parameter takes on another input/output representation (say, an embedded device) or it is just a simple addition that is supposed to be used with, well, simple constructions instead of the extreme situations i am trying to describe?
EDIT: As #Matteo Italia noticed, this is the function declaration
void display_message(const string&, const vector<T>&, ostream& = cout);
You are confusing two "flexibilities" available in this function.
the template part (which I think is the one the author is talking about) allows you to pass any std::vector<T> given that T can be output on the stream; i.e. you can pass a vector of integers, doubles, or even of your custom objects and the function will happily output it on the given stream;1
the stream part (which caught your attention) is instead to allow you to specify any (narrow) output stream for the output part; it's useful because you may want to output your message (and your vector) on some other streams; for example, if it's an error message you'll want cerr; and, most importantly, if you are writing to file, you'll pass your file stream.
Notes
notice that in more "STL2-like" interfaces typically you won't receive a vector like that, but more probably a couple of iterators. Actually, the standard library prefers an even more abstract way to solve this problem (std::ostream_iterator, which allow you to use std::copy to copy data from container iterators to the output stream);
to nitpickers: I know and you won't convince me.

Compile-time population of data structures other than arrays?

In C++, you can do this:
static const char * [4] = {
"One fish",
"Two fish",
"Red fish",
"Blue fish"
};
... and that gives you a nice read-only array data-structure that doesn't take any CPU cycles to initialize at runtime, because all the data has been laid out for you (in the executable's read-only memory pages) by the compiler.
But what if I'd rather be using a different data structure instead of an array? For example, if I wanted my data structure to have fast lookups via a key, I'd have to do something like this:
static std::map<int, const char *> map;
int main(int, char **)
{
map.insert(555, "One fish");
map.insert(666, "Two fish");
map.insert(451, "Red fish");
map.insert(626, "Blue fish");
[... rest of program here...]
}
... which is less elegant and less efficient as the map data structure is getting populated at run-time, even though all the necessary data was known at compile time and therefore that work could have (theoretically) been done then.
My question is, is there any way in C++ (or C++11) to create a read-only data structure (such as a map) whose data is entirely set up at compile time and thus pre-populated and ready to use at run-time, the way an array can be?
If you want a map (or set), consider instead using a binary tree stored as an array. You can assert that it's ordered properly at runtime in debug builds, but in optimized builds you can just assume everything is properly arranged, and then can do the same sorts of binary search operations that you would in std::map, but with the underlying storage being an array. Just write a little program to heapify the data for you before pasting it into your program.
Not easily, no. If you tried to do your first example using malloc, obviously it wouldn't work at compile time. Since every single standard container utilizes new (well, std::allocator<T>::allocate(), but we'll pretend that it's new for now), we cannot do this at compile time.
That having been said, it depends on how much pain you are willing to go through, and how much you want to push back to compile time. You certainly cannot do this using only standard library features. Using boost::mpl on the other hand...
#include <iostream>
#include "boost/mpl/map.hpp"
#include "boost/mpl/for_each.hpp"
#include "boost/mpl/string.hpp"
#include "boost/mpl/front.hpp"
#include "boost/mpl/has_key.hpp"
using namespace boost::mpl;
int main()
{
typedef string<'One ', 'fish'> strone;
typedef string<'Two ', 'fish'> strtwo;
typedef string<'Red ', 'fish'> strthree;
typedef string<'Blue', 'fish'> strfour;
typedef map<pair<int_<555>, strone>,
pair<int_<666>, strtwo>,
pair<int_<451>, strthree>,
pair<int_<626>, strfour>> m;
std::cout << c_str<second<front<m>::type>::type>::value << "\n";
std::cout << has_key<m, int_<666>>::type::value << "\n";
std::cout << has_key<m, int_<111>>::type::value << "\n";
}
It's worth mentioning, that your problem stems from the fact you are using map.
Maps are often overused.
The alternative solution to a map is a sorted vector/array. Maps only become "better" than maps when used to store data that is of unknown length, or (and only sometimes) when the data changes frequently.
The functions std::sort, std::lower_bound/std::upper_bound are what you need.
If you can sort the data yourself you only need one function, lower_bound, and the data can be const.
Yes, C++11 allows brace initializers:
std::map<int, const char *> map = {
{ 555, "One fish" },
{ 666, "Two fish" },
// etc
};

boost::serialization to serialize only the keys of a map

I have a class with a map, and I want to serialize the class using boost serialize.
std::map<int, ComplicatedThing> stuff;
ComplicatedThing is derivable just by knowing the int. I want to serialize this efficiently. One way (ick, but works) is to make a vector of the keys and serialize the vector.
// illustrative, not test-compiled
std::vector<int> v;
std::copy(stuff.begin, stuff.end, std::back_inserter(v));
// or
for(std::vector<int> it = v.begin(); it != v.end(); it++)
stuff[*it] = ComplicatedThing(*it);
// ...and later, at serialize/deserialize time
template<class Archive>
void srd::leaf::serialize(Archive &ar, const unsigned int version)
{
ar & v;
}
But this is inelegant. Using BOOST_SERIALIZATION_SPLIT_MEMBER() and load/save methods, I think I should be able to skip the allocation of the intermediate vector completely. And there I am stuck.
Perhaps my answer lies in understanding boost/serialization/collections_load_imp.hpp. Hopefully there is a simpler path.
you can serialize it as list of ints (I don't mean std::list) instead of serializing it as a container (map or vector). fist write number of elements and then them one by one, deserizalize accordingly. it's 10 mins task. if you need this solution in many places, wrap map in your class and define serialization for it
If you want to make it not look clumsy, use range adaptors
ar & (stuff | transformed(boost::bind(&map_type::value_type::first, _1));
Or if you include the appropriate headers, I suppose you could reduce this to
ar & (stuff | transformed(&map_type::value_type::first))
Disclaimer
All of this assumes that Boost Serialization ships with serializers for Boost Ranges (haven't checked)
This might not work well in a bidirectional serialize setting (you'll want to read http://www.boost.org/doc/libs/1_46_1/libs/serialization/doc/serialization.html#splitting)
I haven't brought the above into the vicinity of a compiler