Adjusting granularity in tbb parallel_pipeline - c++

Task for pipeline is following:
read sequentially huge(10-15k) amount of ~100-200 Mb compressed files
decompress each file in parallel
deserialize each decompressed file in parallel
process result deserialized objects and get some values based on all objects (mean, median, grouppings etc.)
When I get decompressed file memory buffer, serialized blocks go one after one, so I'd like to pass them to the next filter in the same manner or, at least, adjust this process by packing serialized blocks in groups of some number and then pass. However (as I understand it) tbb_pipeline makes me pass pointer to buffer with ALL serialized blocks because each filter has to get pointer and return pointer.
Using concurrent queue to accumulate packs of serialized objects kills matter of using tbb_pipeline, as I understand. Moreover, constness of operator() in filters doesn't allow to have my own intermediate "task pool"(but nevertheless if each thread had its own local copy of storage for "tasks" and just cut right pieces from it, it would be great)
Primary question:
Is there some way to "adjust" granularity in this situation? (i.e. some filter gets pointer to all serialized objects and passes to the next filter small pack of objects)
Reformatting(splitting etc.) input files is almost impossible.
Secondary question:
When I accumulate processing results, I don't really care about any kind of order, I need only aggregating statistics. Can I use parallel filter instead of serial_out_of_order and accumulate results of processing for each thread somewhere, and then just merge them?

However (as I understand it) tbb_pipeline makes me pass pointer to buffer with ALL serialized blocks because each filter has to get pointer and return pointer.
First I think, it's better to use more modern, type-safe form of the pipeline: parallel_pipeline. It does not prescribe you to pass any specific pointer of any specific data. You just specify which data of which type is needed for the next stage to be able to process it. So it's rather a matter of how your first filter partitions the data to be processed by the following filters.
Primary question: Is there some way to "adjust" granularity in this situation? (i.e. some filter gets pointer to all serialized objects and passes to the next filter small pack of objects)
You can safely embed one parallel algorithm into another in order to change the granularity for some stages, e.g. on the top level, 1st pipeline goes through the file list; 2nd pipeline reads big blocks of the file on the nested level; and finally, the innermost pipeline breaks down the big blocks to smaller ones for some of the 2nd level stages. See a general example of nesting below.
Secondary question: Can I use parallel filter instead of serial_out_of_order and accumulate results of processing for each thread somewhere, and then just merge them?
Yes, you can always use a parallel filter if it does not modify a shared data. For example, you can use tbb::combinable in order to collect thread-specific partial sums and then combine them.
but nevertheless if each thread had its own local copy of storage for "tasks" and just cut right pieces from it, it would be great
yes, they have. Each thread has its own local pool of tasks.
General example of nested parallel_pipelines
parallel_pipeline( 2/*only two files at once*/,
make_filter<void,std::string>(
filter::serial,
[&](flow_control& fc)-> std::string {
if( !files.empty() ) {
std::string filename = files.front();
files.pop();
return filename;
} else {
fc.stop();
return "stop";
}
}
) &
make_filter<std::string,void>(
filter::parallel,
[](std::string s) {
// a nested pipeline
parallel_pipeline( 1024/*only two files at once*/,
make_filter<void,char>(
filter::serial,
[&s](flow_control& fc)-> char {
if( !s.empty() ) {
char c = s.back();
s.pop_back();
return c;
} else {
fc.stop();
return 0;
}
}
) &
make_filter<char,void>(
filter::parallel,
[](char c) {
putc(c, stdout);
}
)
);
}
)
);

Related

How to buffer efficiently when writing to 1000s of files in C++

I am quite inexperienced when it comes to C++ I/O operations especially when dealing with buffers etc. so please bear with me.
I have a programme that has a vector of objects (1000s - 10,000s). At each time-step the state of the objects is updated. I want to have the functionality to log a complete state time history for each of these objects.
Currently I have a function that loops through my vector of objects, updates the state, and then calls a logging function which opens the file (ascii) for that object, writes the state to file, and closes the file (using std::ofstream). The problem is this signficantly slows down my run time.
I've been recommended a couple things to do to help speed this up:
Buffer my output to prevent extensive I/O calls to the disk
Write to binary not ascii files
My question mainly concerns 1. Specifically, how would I actually implement this? Would each object effectively require it's own buffer? or would this be a single buffer that somehow knows which file to send each bit of data? If the latter, what is the best way to achieve this?
Thanks!
Maybe the simplest idea first: instead of logging to separate files, why not log everything to an SQLite database?
Given the following table structure:
create table iterations (
id integer not null,
iteration integer not null,
value text not null
);
At the start of the program, prepare a statement once:
sqlite3_stmt *stmt;
sqlite3_prepare_v3(db, "insert into iterations values(?,?,?)", -1, SQLITE_PREPARE_PERSISTENT, &stmt, NULL);
The question marks here are placeholders for future values.
After every iteration of your simulation, you could walk your state vector and execute the stmt a number of times to actually insert rows into the database, like so:
for (int i = 0; i < objects.size(); i++) {
sqlite3_reset(stmt);
// Fill in the three placeholders and execute the query.
sqlite3_bind_int(stmt, 1, i);
sqlite3_bind_int(stmt, 2, current_iteration); // Could be done once, but here for illustration.
std::string state = objects[i].get_state();
sqlite3_bind_text(stmt, 3, state.c_str(), state.size(), SQLITE_STATIC); // SQLITE_STATIC means "no need to free this"
sqlite3_step(stmt); // Execute the query.
}
You can then easily query the history of each individual object using the SQLite command-line tool or any database manager that understands SQLite.

Seek in libarchive, how to reset header?

Is it possible to read decompressed file once again?
Let imagine I used archive_read_next_header(a, &entry),
and I read an unknown number of bytes using archive_read_data(a, ptr_to_buffer, buffer_size). Right now I want to reset it and start reading again from the beginning. I trying to override seekoff(std::streamoff off, std::ios_base::seekdir way, std::ios_base::openmode which). I understand that might be impossible to just seek inside decompressed data because of inner work of compression algorithms, and data is not stored anywhere except a limited number of bytes in libarchive internal buffer.
The idea is to just reset it all, and read std::streamoff off bytes, that way I could create backward seek. Forward seek would be easy, just read std::streamoff off bytes. It's really inefficient, but let's hope, seek won't be used much.
Whole structure archive was initialized that way:
archive_read_set_read_callback(a, read_callback);
archive_read_set_callback_data(a, container);
archive_read_set_seek_callback(a, seek_callback);
archive_read_set_skip_callback(a, skip_callback);
int r = (archive_read_open1(a));
where container contains most of all std::istream, and callbacks are functions which manipulate that stream.
Template of what I would like to achive
`
std::streampos seek_beg(std::streamoff off) {
if(off >= 0) {
// read/skip 'off' bytes
} else {
// reset (a)
// read/skip 'off' bytes
}
// return position
}
`
also my underflow() method is implemented that way:
`
int underflow() {
int r = archive_read_data(ar, ptr, BUFFER_SIZE);
if (r < 0) {
throw std::runtime_error("ERROR");
} else if (r == 0) {
return std::streambuf::traits_type::eof();
} else {
setg(ptr, ptr, ptr + r);
}
return std::streambuf::traits_type::to_int_type(*ptr);
}
`
Libarchive documentation, more precisely, wishlist in libarchive wiki on GitHub says:
A few people have asked for the ability to efficiently "re-read"
particular archive entries. This is a tricky subject. For many
formats, the performance gains from this would be very modest. For
example, with a little performance work, the seeking Zip reader could
support very fast re-reading from the beginning since it only involves
re-parsing the central directory. The cases where there would be real
gains (e.g., tar.gz) are going to be very difficult to handle. The
most likely implementation would be some form of checkpointing so that
clients can explicitly ask for a checkpoint object and then restore
back to that checkpoint. The checkpoint object could be complex if you
have a series of stacked read filters plus state in the format handler
itself.
As I see seeking in archives with help of libarchive is not right now possible, so a solution to my problem was to remember all read data only if I have some suspicion that I would want to re-read it, and alternatively push it back to stream.

C++ Read specific parts of a file with start and endpoint

I am serializing multiple objects and want to save the given Strings to a file. The structure is the following:
A few string and long attributes, then a variable amount of maps<long, map<string, variant> >. My first idea was creating one valid JSONFile but this is very hard to do (all of the maps are very big and my temporary memory is not big enough). Since I cant serialize everything together I have to do it piece by piece. I am planning on doing that and I then want to save the recieved strings to a file. Here is how it will look like:
{ "Name": "StackOverflow"}
{"map1": //map here}
{"map2": //map here}
As you can see this is not one valid JSON object but 3 valid JSONObjects in one file. Now I want to deserialize and I need to give a valid JSONObject to the deserializer. I already save tellp() everytime when I write a new JSONObject to file, so in this example I would have the following adresses saved: 26, endofmap1, endofmap2.
Here is what I want to do: I want to use these addresses, to extract the strings from the file I wrote to. I need one string which is from 0 to (26-1), one string from 26 to(endofmap1-1) and one string from endofmap1 to (endofmap2-1). Since these strings would be valid JSONObjects i could deserialize them without problem.
How can I do this?
I would create a serialize and deserialize class that you can use as part of a hierarchy.
So for instance, in rough C++ psuedo-code:
class Object : public serialize, deserialize {
public:
int a;
float b;
Compound c;
bool serialize(fstream& fs) {
fs << a;
fs << b;
c->serialize(fs);
fs->flush();
}
// same for deserialize
};
class Compound : serialize, deserialize {
public:
map<> things;
bool serialize(fstream& fs) {
for(thing : things) {
fs << thing;
}
fs->flush();
}
};
With this you can use JSON as the file will be written as your walk the heirarchy.
Update:
To extract a specific string from a file you can use something like this:
// pass in an open stream (streams are good for unit testing!)
std::string extractString(fstream& fs) {
int location = /* the location of the start from file */;
int length = /* length of the string you want to extract */;
std::string str;
str.resize(length);
char* begin = *str.begin();
fs->seekp(location);
fs->read(begin, length);
return str;
}
Based on you saying "my temporary memory is not big enough", I'm going to assume two possibilities (though some kind of code example may help us help you!).
possibility one, the file is too big
The issue you would be facing here isn't a new one - a file too large for memory, assuming your algorithm isn't buffering all the data, and your stack can handle the recursion of course.
On windows you can use the MapViewOfFile function, the MSDN has plenty of detail on that. This function will effectively grab a "view" of a section of a file - allowing you to load enough of the file to modify only what you need, before closing and opening a view at a later offset.
If you are on a different platform, there will be similar functions.
possibility two, you are doing too much at once
The other option is more of a "software engineering" issue. You have so much data then when holding them in your std::maps, you run out of heap-memory.
If this is the case, you are going to need to use some clever thinking - here are some ideas!
Don't load all your data into the maps. wherever the data is coming from, take a CRC, Index, or Filename of the data-source. Store that information in the map, and leave the actual "big strings" on the hard disk. - This way you can load each item of data when you need it.
This works really well for data that needs to be sorted, or correlated.
Process or load your data when you need to write it. If you don't need to sort or correlate the data, why load it into a map beforehand at all? Just load each "big string" of data in sequence, then write them to the file with an ofstream.

Can protobuf read partially?

I want to save my terrain data to a file and load only some parts of it, because it's just too big to store it in memory as a whole. Actually I don't even know whether the protobuf is good for this purposes.
For example I would have a structure like (might be invalid gramatically, I know only simple basics):
message Quad {
required int32 x = 1;
required int32 z = 2;
repeated int32 y = 3;
}
The x and z values are available in my program and by using them I would like to find the correct Quad object with the same x and z (in the file) to obtain y values. However, I can't just parse the file with the ParseFromIstream(), because (I think so) it loads whole file into memory, but in my case the file is just too big.
So, is the protobuf able to load one object, send me for checking it and if the object is wrong give me the second one?
Actually... I could just ask: does the ParseFromIstream() loads whole file into memory?
While some libraries to allow you to read files partially, the technique recommended by Google is to simply have the file consist of multiple messages:
https://developers.google.com/protocol-buffers/docs/techniques
Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if
you are dealing in messages larger than a megabyte each, it may be time to consider an
alternate strategy.
That said, Protocol Buffers are great for handling individual messages within a large data
set. Usually, large data sets are really just a collection of small pieces, where each small
piece may be a structured piece of data.
So you could just write a long sequence of Quad messages to the file, delimited by the lengths of the messages. If you need to seek randomly to specific Quads, you may want to add some kind of an index.
This depends on which implementation you are using. Some have "read as a sequence" APIs. For example, assuming you stored it as a "repeated Quad", then with protobuf-net that would be:
int x = ..., y = ...;
var found = Serializer.DeserializeItems<Quad>(source)
.Where(q => q.x ==x && q.y == y);
The point being: it yields a spooling (not loaded all at once) and short-circuiting sequence.
I don't know the c++ api specifically, but I would hope it has something similar - but worst case you could parse the varint headers and prepare a length-capped stream.

What is the best way to return an image or video file from a function using c++?

I am writing a c++ library that fetches and returns either image data or video data from a cloud server using libcurl. I've started writing some test code but still stuck at designing API because I'm not sure about what's best way to handle these media files. Storing it in a char/string variable as binary data seems to work, but I wonder if that would take up too much RAM memory if the files are too big. I'm new to this, so please suggest a solution.
You can use something like zlib to compress it in memory, and then uncompress it only when it needs to be used; however, most modern computers have quite a lot of memory, so you can handle quite a lot of images before you need to start compressing. With videos, which are effectively a LOT of images, it becomes a bit more important -- you tend to decompress as you go, and possibly even stream-from-disk as you go.
The usual way to handle this, from an API point of view, is to have something like an Image object and a Video object (classes). These objects would have functions to "get" the uncompressed image/frame. The "get" function would check to see if the data is currently compressed; if it is, it would decompress it before returning it; if it's not compressed, it can return it immediately. The way the data is actually stored (compressed/uncompressed/on disk/in memory) and the details of how to work with it are thus hidden behind the "get" function. Most importantly, this model lets you change your mind later, adding additional types of compression, adding disk-streaming support, etc., without changing how the code that calls the get() function is written.
The other challenge is how you return an Image or Video object from a function. You can do it like this:
Image getImageFromURL( const std::string &url );
But this has the interesting problem that the image is "copied" during the return process (sometimes; depends how the compiler optimizes things). This way is more memory efficient:
void getImageFromURL( const std::string &url, Image &result );
This way, you pass in the image object into which you want your image loaded. No copies are made. You can also change the 'void' return value into some kind of error/status code, if you aren't using exceptions.
If you're worried about what to do, code for both returning the data in an array and for writing the data in a file ... and pass the responsability to choose to the caller. Make your function something like
/* one of dst and outfile should be NULL */
/* if dst is not NULL, dstlen specifies the size of the array */
/* if outfile is not NULL, data is written to that file */
/* the return value indicates success (0) or reason for failure */
int getdata(unsigned char *dst, size_t dstlen,
const char *outfile,
const char *resource);