Optimizing huge graph traversal with OpenMP - c++

I am trying to optimize this function which according to the perf tool is the bottleneck of archiving close to linear scaling. The performance gets worse when the number of threads go up, when I drill down the assembly code generated by perf it shows most of the time is spent checking for visited and not visited vertices. I've done a ton of google searches to improve the performance to no avail. Is there a way to improve the performance of this function? Or is there a thread safe way of implementing this function? Thanks for your help in advance!
typedef uint32_t vidType;
template<typename T, typename U, typename V>
bool compare_and_swap(T &x, U old_val, V new_val) {
return __sync_bool_compare_and_swap(&x, old_val, new_val);
}
template<bool map_vertices, bool map_edges>
VertexSet GraphT<map_vertices, map_edges>::N(vidType vid) const {
assert(vid >= 0);
assert(vid < n_vertices);
eidType begin = vertices[vid], end = vertices[vid+1];
if (begin > end || end > n_edges) {
fprintf(stderr, "vertex %u bounds error: [%lu, %lu)\n", vid, begin, end);
exit(1);
}
assert(end <= n_edges);
return VertexSet(edges + begin, end - begin, vid);
}
void bfs_step(Graph &g, vidType *depth, SlidingQueue<vidType> &queue) {
#pragma omp parallel
{
QueueBuffer<vidType> lqueue(queue);
#pragma omp for
for (auto q_iter = queue.begin(); q_iter < queue.end(); q_iter++) {
auto src = *q_iter;
for (auto dst : g.N(src)) {
//int curr_val = parent[dst];
auto curr_val = depth[dst];
if (curr_val == MYINFINITY) { // not visited
//if (compare_and_swap(parent[dst], curr_val, src)) {
if (compare_and_swap(depth[dst], curr_val, depth[src] + 1)) {
lqueue.push_back(dst);
}
}
}
}
lqueue.flush();
}
}

First of all, you're using a very traditional formulation of graph algorithms. Good for textbooks, not for computation. If you write this as a generalized matrix-vector product with the adjacency matrix you lose all those fiddly queues and the parallelism becomes quite obvious.
In your formulation, the problem is with the push_back function on the queue. That is hard to parallelize. The solution is to let each thread have its own queue, and then using a reduction. This works if you define the plus operator on your queue object to effect a merge of the local queues.

Related

If find_if() takes too long, are there alternatives that can be used for better program performance?

I'm working on a D* Lite path planner in C++. The program maintains a priority queue of cells (U), each cell have two cost values, and a key can be calculated for a cell which determine it's order on the priority queue.
using Cost = float;
using HeapKey = pair<Cost, Cost>;
using KeyCompare = std::greater<std::pair<HeapKey, unsigned int>>;
vector<pair<HeapKey, unsigned int>> U;
When a cell is added it is done so by using:
U.push_back({ k, id });
push_heap(U.begin(), U.end(), KeyCompare());
As part of the path planning algorithm cells sometimes need to be removed, and here lies the current problem as far as I can see. I recently had help on this site to speed my program up quite a bit by using push_heap instead of make_heap, but now it seems that the part of the program that removes cells is the slowest part. Cells are removed from the priority queue by:
void DstarPlanner::updateVertex(unsigned int id) {
...
...
auto it = find_if(U.begin(), U.end(), [=](auto p) { return p.second == id; });
U.erase(it);
...
...
}
From my tests this seems to take roughly 80% of the time my program use for path planning. It was my hope coming here that a more time-saving method existed.
Thank you.
EDIT - Extra information.
void DstarPlanner::insertHeap(unsigned int id, HeapKey k) {
U.push_back({ k, id });
push_heap(U.begin(), U.end(), KeyCompare());
in_U[id]++;
}
void DstarPlanner::updateVertex(unsigned int id) {
Cell* u = graph.getCell(id);
if (u->id != id_goal) {
Cost mincost = infinity;
for (auto s : u->neighbors) {
mincost = min(mincost, graph.getEdgeCost(u->id, s->id) + s->g);
}
u->rhs = mincost;
}
if (in_U[id]) {
auto it = find_if(U.begin(), U.end(), [=](auto p) { return p.second == id; });
U.erase(it);
in_U[id]--;
}
if (u->g != u->rhs) {
insertHeap(id, u->calculateKey());
}
}
vector<int> DstarPlanner::ComputeShortestPath() {
vector<int> bestPath;
vector<int> emptyPath;
Cell* n = graph.getCell(id_start);
while (U.front().first < n->calculateKey() || n->rhs != n->g) {
auto uid = U.front().second;
Cell* u = graph.getCell(uid);
auto kold = U.front().first;
pop_heap(U.begin(), U.end(), KeyCompare());
U.pop_back();
in_U[u->id]--;
if (kold < u->calculateKey()) {
insertHeap(u->id, u->calculateKey());
} else if (u->g > u->rhs) {
u->g = u->rhs;
for (auto s : u->neighbors) {
if (!occupied(s->id)) {
updateVertex(s->id);
}
}
} else {
u->g = infinity;
for (auto s : u->neighbors) {
if (!occupied(s->id)) {
updateVertex(s->id);
}
}
updateVertex(u->id);
}
}
bestPath=constructPath();
return bestPath;
}
find_if does a linear search. It maybe faster to use:
std::map/std::set -> Standard binary search tree implementations
std::unordered_map/std::unordered_set -> Standard hash table implementations
These may use a lot of memory if your elements (key-value pairs) are small integers. To avoid that you can use 3rd party alternatives like boost::unordered_flat_map.
How do you re-heapify after U.erase(it)? Do you ever delete multiple nodes at once?
If deletions need to be atomic between searches, then you can
swap it with end() - 1,
erase end() - 1, and
re-heapify.
Erasing end() - 1 is O(1) while erasing it is linear in std::distance(it, end).
void DstarPlanner::updateVertex(unsigned int id) {
...
// take the id by reference since this is synchronous
auto it = find_if(U.begin(), U.end(), [&](const auto& p) { return p.second == id; });
*it = std::move(*(U.end() - 1));
U.erase((U.end() - 1));
std::make_heap(U.begin(), U.end()); // expensive!!! 3*distance(begin, end)
...
}
If you can delete multiple nodes between searches, then you can use a combination of erase + remove_if to only perform one mass re-heapify. This is important be heapify is expensive.
it = remove_if(begin, end, [](){ lambda }
erase(it, end)
re-heapify
void DstarPlanner::updateVertex(const std::vector<unsigned int>& sorted_ids) {
...
auto it = remove_if(U.begin(), U.end(), [&](const auto& p) { return std::binary_search(ids.begin(), ids.end(), p.second); });
U.erase(it, U.end());
std::make_heap(U.begin(), U.end()); // expensive!!! 3*distance(begin, end)
...
}
Doing better
You can possibly improve on this by replacing std::make_heap (which makes no assumptions about the heapiness of [begin(), end()) with a custom method that re-heapifies a former heap around "poison points" -- it only needs to initially inspect the elements around the elements that were swapped. This sounds like a pain to write and I'd only do it if the resulting program was still too slow.
Have you thought of...
Just not even removing elements from the heap? The fact you're using a heap tells me that the algorithm designers suggested a heap. If they suggested a heap, then they likely didn't envision random removals. This is speculation on my part. I'm otherwise not familiar with D* lite.

An high-performance similarity cache for metric spaces

The Problem:
First of all: this is an high-performance application, so time execution is the most important aspect. I have a back end system which computes some expensive function:
template<typename C, typename R>
R backEndFunction (C &code){
...
}
Where code is a 1xN vector belonging to a metric space. Notice that since N is large, computing distance between two codes is expensive (this problem is known as Curse of Dimensionality).
I'm designing a "similarity cache" (following a LRU policy), which is placed between the user who submit query codes and the back end system. So each cached element is a pair (actually a triple, as we'll see later) of type (C,R) containing the cached code and the associated result.
It works like this, given a query code by the user:
Compute the distance between the query and each cached code and find the minimum distance. Notice that if the code is the same, the distance is 0.
If the distance is lower than a given threshold, there is a cache hit, so take the hit element and put it in front of the cache and finally return the result relative to the hit element.
Otherwise, call the back end system function, insert the new pair in front of the cache, pop the last one and finally return the computed result.
Cache design:
First of all, this cache is supposed to contain say 10k elements. This is important, becuase if we had say 100k elements we had to solve step 1. with a Nearest Neighbor algorithm (e.g. LSH). Instead, with this few elements, a parallel brute force approach is still feasible.
The parallel section has been implemented using OpenMP. Since using OpenMP can't be used for non-vector-like-structures, our cache can't be a simple std::queue or similars.
So, this is my solution, involves two data structures:
std::vector<CacheElem> values, where CacheElem is a triple (code, result, listElem) where listElem is an iterator to an element of the second structure (below).
std::list<size_t> lru to implement the lru policy (you don't say?), where lru[i] is the index of the corresponding element in values.
So, if lru[i]=j then values[j].listElem is an iterator to lru[j]. So when the cache receives a query code:
Parallel compute the minimum distance between the query and all elements in values
If there is a cache hit, use the iterator listElem as reference to the corresponding element in lru and put it in front of the list.
If there is a cache miss, compute the query result (using the back end), push in front of lru the same index of the last element (the one which has to be removed), replace the values[lru[size]] with the query code, value and lru.begin() and finally pop the last element.
Obviously if the cache isn't full all the "pop last element" part isn't necessary.
The code (first version, didn't test it yet):
/**
* C = code type
* R = result type
* D = distance type (e.g. float for euclidean)
*/
template <typename C, typename R, typename D>
class Cache {
typedef std::shared_ptr<cc::Distance<C,D>> DistancePtr;
public:
Cache(const DistancePtr distance, const std::function<R(C)> &backEnd, const size_t size = 10000, const float treshold = 0);
R Query(const C &query);
void PrintCache();
private:
struct Compare{
Compare(D val = std::numeric_limits<D>::max(), size_t index = 0) : val(val), index(index) {}
D val;
size_t index;
};
#pragma omp declare reduction(minimum : struct Compare : omp_out = omp_in.val < omp_out.val ? omp_in : omp_out) initializer (omp_priv=Compare())
struct CacheElem{
CacheElem(const C &code, const R &result, std::list<size_t>::iterator listElem) : code(code), result(result), listElem(listElem) {}
C code;
R result;
std::list<size_t>::iterator listElem; //pointing to corresponding element in lru0
};
DistancePtr distance;
std::function<R(C)> backEnd;
std::vector<CacheElem> values;
std::list<size_t> lru;
float treshold;
size_t size;
};
template <typename C, typename R, typename D>
Cache<C,R,D>::Cache(const DistancePtr distance, const std::function<R(C)> &backEnd, const size_t size, const float treshold)
: distance(distance), backEnd(backEnd), treshold(treshold), size(size) {
values.reserve(size);
std::cout<<"CACHE SETUP: size="<<size<<" treshold="<<treshold<<std::endl;
}
template <typename C, typename R, typename D>
void Cache<C,R,D>::PrintCache() {
std::cout<<"LRU: ";
for(std::list<size_t>::iterator it=lru.begin(); it != lru.end(); ++it)
std::cout<<*it<<" ";
std::cout<<std::endl;
std::cout<<"VALUES: ";
for(size_t i=0; i<values.size(); i++)
std::cout<<"("<<values[i].code<<","<<values[i].result<<","<<*(values[i].listElem)<<")";
std::cout<<std::endl;
}
template <typename C, typename R, typename D>
R Cache<C,R,D>::Query(const C &query){
PrintCache();
Compare min;
R result;
std::cout<<"query="<<query<<std::endl;
//Find the cached element with min distance
#pragma omp parallel for reduction(minimum:min)
for(size_t i=0; i<values.size(); i++){
D d = distance->compute(query, values[i].code);
#pragma omp critical
{
std::cout<<omp_get_thread_num()<<" min="<<min.val<<" distance("<<query<<" "<<values[i].code<<")= "<<d;
if(d < min.val){
std::cout<<" NEW MIN!";
min.val = d;
min.index = i;
}
std::cout<<std::endl;
}
}
std::cout<<"min.val="<<min.val<<std::endl;
//Cache hit
if(!lru.empty() && min.val < treshold){
std::cout<<"cache hit with index="<<min.index<<" result="<<values[min.index].result<<" distance="<<min.val<<std::endl;
CacheElem hitElem = values[min.index];
//take the hit element to top of the queue
if( hitElem.listElem != lru.begin() )
lru.splice( lru.begin(), lru, hitElem.listElem, std::next( hitElem.listElem ) );
result = hitElem.result;
}
//cache miss
else {
result = backEnd(query);
std::cout<<"cache miss backend="<<result;
//Cache reached max capacity
if(lru.size() == size){
//last item (the one that must be removed) value is its corresponding index in values
size_t lastIndex = lru.back();
//remove last element
lru.pop_back();
//insert new element in the list
lru.push_front(lastIndex);
//insert new element in the value vector, replacing the old one
values[lastIndex] = CacheElem(query, result, lru.begin());
std::cout<<" index to replace="<<lastIndex;
}
else{
lru.push_front(values.size()); //since we are going to inser a new element, we don't need to do size()-1
values.push_back(CacheElem(query, result, lru.begin()));
}
std::cout<<std::endl;
}
PrintCache();
std::cout<<"-------------------------------------"<<std::endl;
return result;
}
My questions:
Knowing that for each query I have to inspect all the cached elements, do you think there is a more performant solution than mine? Especially considering the solution of using std::vector<CacheElem> values; and std::list<size_t> lru;?
In case that this is the best solution, we know that std::list isn't a good solution for high-performance applications, but for the given problem, I didn't find a better one. Do you know any queue-like structure where you can put random elements at the top of the queue and I have to pop_back and push_front a (as in this case)?

performing vector intersection in C++

I have a vector of vector of unsigned. I need to find the intersection of all these vector of unsigned's for doing so I wrote the following code:
int func()
{
vector<vector<unsigned> > t;
vector<unsigned> intersectedValues;
bool firstIntersection=true;
for(int i=0;i<(t).size();i++)
{
if(firstIntersection)
{
intersectedValues=t[0];
firstIntersection=false;
}else{
vector<unsigned> tempIntersectedSubjects;
set_intersection(t[i].begin(),
t[i].end(), intersectedValues.begin(),
intersectedValues.end(),
std::inserter(tempIntersectedSubjects, tempIntersectedSubjects.begin()));
intersectedValues=tempIntersectedSubjects;
}
if(intersectedValues.size()==0)
break;
}
}
Each individual vector has 9000 elements and there are many such vectors in "t". When I profiled my code I found that set_intersection takes the maximum amount of time and hence makes the code slow when there are many invocations of func(). Can someone please suggest as to how can I make the code more efficient.
I am using: gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
EDIT: Individual vectors in vector "t" are sorted.
I don't have a framework to profile the operations but I'd certainly change the code to reuse the readily allocated vector. In addition, I'd hoist the initial intersection out of the loop. Also, std::back_inserter() should make sure that elements are added in the correct location rather than in the beginning:
int func()
{
vector<vector<unsigned> > t = some_initialization();
if (t.empty()) {
return;
}
vector<unsigned> intersectedValues(t[0]);
vector<unsigned> tempIntersectedSubjects;
for (std::vector<std::vector<unsigned>>::size_type i(1u);
i < t.size() && !intersectedValues.empty(); ++i) {
std::set_intersection(t[i].begin(), t[i].end(),
intersectedValues.begin(), intersectedValues.end(),
std::back_inserter(tempIntersectedSubjects);
std::swap(intersectedValues, tempIntersectedSubjects);
tempIntersectedSubjects.clear();
}
}
I think this code has a fair chance to be faster. It may also be reasonable to intersect the sets different: instead of keeping one set and intersecting with that you could create a new intersection for pairs of adjacent sets and then intersect the first sets with their respect adjacent ones:
std::vector<std::vector<unsigned>> intersections(
std::vector<std::vector<unsigned>> const& t) {
std::vector<std::vector<unsigned>> r;
std::vector<std::vector<unsignned>>::size_type i(0);
for (; i + 1 < t.size(); i += 2) {
r.push_back(intersect(t[i], t[i + 1]));
}
if (i < t.size()) {
r.push_back(t[i]);
}
return r;
}
std::vector<unsigned> func(std::vector<std::vector<unsigned>> const& t) {
if (t.empty()) { /* deal with t being empty... */ }
std::vector<std::vector<unsigned>> r(intersections(t))
return r.size() == 1? r[0]: func(r);
}
Of course, you wouldn't really implement it like this: you'd use Stepanov's binary counter to keep the intermediate sets. This approach assumes that the result is most likely non-empty. If the expectation is that the result will be empty that may not be an improvement.
I can't test this but maybe something like this would be faster?
int func()
{
vector<vector<unsigned> > t;
vector<unsigned> intersectedValues;
// remove if() branching from loop
if(t.empty())
return -1;
intersectedValues = t[0];
// now start from 1
for(size_t i = 1; i < t.size(); ++i)
{
vector<unsigned> tempIntersectedSubjects;
tempIntersectedSubjects.reserve(intersectedValues.size()); // pre-allocate
// insert at end() not begin()
set_intersection(t[i].begin(),
t[i].end(), intersectedValues.begin(),
intersectedValues.end(),
std::inserter(tempIntersectedSubjects, tempIntersectedSubjects.end()));
// as these are not used again you can move them rather than copy
intersectedValues = std::move(tempIntersectedSubjects);
if(intersectedValues.empty())
break;
}
return 0;
}
Another possibility:
Thinking about it using swap() could optimize the exchange of data and remove the need to re-allocate. Also then the temp constructor can be moved out of the loop.
int func()
{
vector<vector<unsigned> > t;
vector<unsigned> intersectedValues;
// remove if() branching from loop
if(t.empty())
return -1;
intersectedValues = t[0];
// no need to construct this every loop
vector<unsigned> tempIntersectedSubjects;
// now start from 1
for(size_t i = 1; i < t.size(); ++i)
{
// should already be the correct size from previous loop
// but just in case this should be cheep
// (profile removing this line)
tempIntersectedSubjects.reserve(intersectedValues.size());
// insert at end() not begin()
set_intersection(t[i].begin(),
t[i].end(), intersectedValues.begin(),
intersectedValues.end(),
std::inserter(tempIntersectedSubjects, tempIntersectedSubjects.end()));
// swap should leave tempIntersectedSubjects preallocated to the
// correct size
intersectedValues.swap(tempIntersectedSubjects);
tempIntersectedSubjects.clear(); // will not deallocate
if(intersectedValues.empty())
break;
}
return 0;
}
You can make std::set_intersection as well as a bunch of other standard library algorithms run in parallel by defining _GLIBCXX_PARALLEL during compilation. That probably has the best work-gain ratio. For documentation see this.
Obligatory pitfall warning:
Note that the _GLIBCXX_PARALLEL define may change the sizes and behavior of standard class templates such as std::search, and therefore one can only link code compiled with parallel mode and code compiled without parallel mode if no instantiation of a container is passed between the two translation units. Parallel mode functionality has distinct linkage, and cannot be confused with normal mode symbols.
from here.
Another simple, though probably insignificantly small, optimization would be reserving enough space before filling your vectors.
Also, try to find out whether inserting the values at the back instead of the front and then reversing the vector helps. (Although I even think that your code is wrong right now and your intersectedValues is sorted the wrong way. If I'm not mistaken, you should use std::back_inserter instead of std::inserter(...,begin) and then not reverse.) While shifting stuff through memory is pretty fast, not shifting should be even faster.
To copy elements from vectors from vector for loop with emplace_back() may save your time. And no need of a flag if you change the iterator index of for loop. So for loop can be optimized, and condition check can be removed for each iteration.
void func()
{
vector<vector<unsigned > > t;
vector<unsigned int > intersectedValues;
for(unsigned int i=1;i<(t).size();i++)
{
intersectedValues=t[0];
vector<unsigned > tempIntersectedSubjects;
set_intersection(t[i].begin(),
t[i].end(), intersectedValues.begin(),
intersectedValues.end(),
std::back_inserter(tempIntersectedSubjects);
for(auto &ele: tempIntersectedSubjects)
intersectedValues.emplace_back(ele);
if( intersectedValues.empty())
break;
}
}
set::set_intersection can be rather slow for large vectors. It's possible use create a similar function that uses lower_bound. Something like this:
template<typename Iterator1, typename Iterator2, typename Function>
void lower_bound_intersection(Iterator1 begin_1, Iterator1 end_1, Iterator2 begin_2, Iterator2 end_2, Function func)
{
for (; begin_1 != end_1 && begin_2 != end_2;)
{
if (*begin_1 < *begin_2)
{
begin_1 = begin_1.lower_bound(*begin_2);
//++begin_1;
}
else if (*begin_2 < *begin_1)
{
begin_2 = begin_2.lower_bound(*begin_1);
//++begin_2;
}
else // equivalent
{
func(*begin_1);
++begin_1;
++begin_2;
}
}
}

How to efficiently clear vector<Mat> in opencv c++

I have a vector<Mat> my_vect that each Mat is float and their size is 90*90. I start loading matrices from the disk that I load 16000 matrices to that vector. After I finish working with those matrices, I clear them. Here is my code for loading and clearing the vector:
Mat mat1(90,90,CV_32F);
load_vector_of_matrices("filename",my_vect); //this loads 16K elements
//do something
for(i = 1:16K)
correlate(mat1, my_vect.at(i));
my_vect.clear();
I'm loading 16K element together for the sake of efficiency.
Now my question is reading all these matrices takes 3-4 second and my_vect.clear() takes approximately same amount of time which is a lot.
According to this answer, it should take O(n) time which I assume vector<Mat> doesn't have a trivial destructor.
Why clearing takes so much time, is matrix destructor overwrite each index in the matrix? Is there a way to decrease the time for clearing the vector?
EDIT:
I'm using Visual Studio 2010 and the level of optimization is Maximize Speed(/O2).
First, a streaming loader. Provide it with a function that, given a max, returns a vector of data (aka loader<T>). It can store internal state, but it will be copied, so store that internal state in a std::shared_ptr. I guarantee that only one copy of it will be invoked.
You are not responsible for returning all max data from your loader, but as written you must return at least 1 element. Returning more is gravy, and may reduce threading overhead.
You then call streaming_loader<T>( your_loader, count ).
It returns a std::shared_ptr< std::vector< std::future< T > > >. You can wait on these futures, but you must wait on them in order (the 2nd one is not guaranteed to be ready to be waited on until the first one has provided data).
template<class T>
using loader = std::function< std::vector<T>(size_t max) >;
template<class T>
using stream_data = std::shared_ptr< std::vector< std::future<T> > >;
namespace details {
template<class T>
T streaming_load_some( loader<T> l, size_t start, stream_data<T> data ) {
auto loaded = l(data->size()-start);
// populate the stuff after start first, so they are ready:
for( size_t i = 1; i < loaded.size(); ++i ) {
std::promise<T> promise;
promise.set_value( std::move(loaded[i]) );
(*data)[start+i] = promise.get_future();
}
if (start+loaded.size() < data->size()) {
// recurse:
std::size_t new_start = start+loaded.size();
(*data)[new_start] = std::async(
std::launch::async,
[l, new_start, data]{return streaming_load_some<T>( l, new_start, data );}
);
}
// populate the future:
return std::move(loaded.front());
}
}
template<class T>
stream_data< T >
streaming_loader( loader<T> l, size_t n ) {
auto retval = std::make_shared<std::vector< std::future<T> >>(n);
if (retval->empty()) return retval;
retval->front() = std::async(
std::launch::async,
[retval, l]()->T{return details::streaming_load_some<T>( l, 0, retval );
});
return retval;
}
For use, you take the stream_data<T> (aka a shared pointer to vector of future data), iterate over it, and .get() each in turn. Then do your processing. If you need a block of 50 of them, call .get() on each in turn until you get to 50 -- do not skip to number 50.
Here is a completely toy loader and test harness:
struct loader_state {
int index = 0;
};
struct test_loader {
std::shared_ptr<loader_state> state; // current loading state stored here
std::vector<int> operator()( std::size_t max ) const {
std::size_t amt = max/2+1;// really, really stupid way to decide how much to load
std::vector<int> retval;
retval.reserve(amt);
for (size_t i = 0; i < amt; ++i) {
retval.push_back( -(int)(state->index + i) ); // populate the return value
}
state->index += amt;
return retval;
}
// in real code, make this constructor do something:
test_loader():state(std::make_shared<loader_state>()) {}
};
int main() {
auto data = streaming_loader<int>( test_loader{}, 1024 );
std::size_t count = 0;
for( std::future<int>& x : *data ) {
++count;
int value = x.get(); // get data
// process. In this case, print out 100 in blocks of 10:
if (count * 100 / data->size() > (count-1) * 100 / data->size())
std::cout << value << ", ";
if (count * 10 / data->size() > (count-1) * 10 / data->size())
std::cout << "\n";
}
std::cout << std::endl;
// your code goes here
return 0;
}
count may or may not be worthless. The internal state of the loader above is pretty darn worthless, I just use it to demonstrate how to store some state.
You can do something similar to destroy a pile of objects without waiting for their destructors to complete. Or, you can rely on the fact that destroying your data can happen while you are working on it and waiting for the next data to load.
live example
In an industrial strength solution, you'd need to include ways to abort all this stuff, among other things. Exceptions might be one way. Also, feedback to the loader about how far behind the processing code is can be helpful (if it is at your heels, return smaller chunks -- if it is way behind, return larger chunks). In theory, that can be arranged via a back channel in loader<T>.
Now that I have played with the above for a bit, a probably better fit is:
#include <iostream>
#include <future>
#include <functional>
#include <vector>
#include <memory>
// if it returns empty, there is nothing more to load:
template<class T>
using loader = std::function< std::vector<T>() >;
template<class T>
struct next_data;
template<class T>
struct streamer {
std::vector<T> data;
std::unique_ptr<next_data<T>> next;
};
template<class T>
struct next_data:std::future<streamer<T>> {
using parent = std::future<streamer<T>>;
using parent::parent;
next_data( parent&& o ):parent(std::move(o)){}
};
live example. It requires some infrastructure to populate that very first streamer<T>, but the code will be simpler, and the strange requirement (of knowing how much data, and only doing a .get() from the first element) goes away.
template<class T>
streamer<T> stream_step( loader<T> l ) {
streamer<T> retval;
retval.data = l();
if (retval.data.empty())
return retval;
retval.next.reset( new next_data<T>(std::async( std::launch::async, [l](){ return stream_step(l); })));
return retval;
}
template<class T>
streamer<T> start_stream( loader<T> l ) {
streamer<T> retval;
retval.next.reset( new next_data<T>(std::async( std::launch::async, [l](){ return stream_step(l); })));
return retval;
}
A downside is that writing a ranged-based iterator becomes a bit trickier.
Here is a sample use of the second implementation:
struct counter {
std::size_t max;
std::size_t current = 0;
counter( std::size_t m ):max(m) {}
std::vector<int> operator()() {
std::vector<int> retval;
std::size_t do_at_most = 100;
while( current < max && (do_at_most-->0)) {
retval.push_back( int(current) );
++current;
}
return retval;
}
};
int main() {
streamer<int> s = start_stream<int>( counter(1024) );
while(true) {
for (int x : s.data) {
std::cout << x << ",";
}
std::cout << std::endl;
if (!s.next)
break;
s = std::move(s.next->get());
}
// your code goes here
return 0;
}
where counter is a trivial loader (an object that reads data into a std::vector<T> in whatever sized chunks it feels like). The processing of the data is in the main code, where we just print them out in get-sized chunks.
The loading happens in a different thread, and will continue asynchronously whatever the main thread does. The main thread just gets delivered std::vector<T> to do with as they will. In your case, you'd make T a Mat.
The Mat objects are complex objects that have internal allocations of memory. When you clear the vector, one will need to iterate through every instance of Mat contained and run its destructor which is itself a non-trivial operation.
Also remember that free-store memory perations are non-trivial, so depending on your heap implementation, the heap may decide to merge cells etc.
If this is a problem, you should run your clear through a profiler and find out where the bottle-neck is.
Be careful using optimization, it can drive the debugger crazy.
If you were to do this in a function and simply let the vector go out of scope??
Since the elements are not pointers I think this will work.

Recursion Depth Cut Off Strategy: Parallel QuickSort

I have a parallel quickosort alogirthm implemented. To avoid overhead of excess parallel threads I had a cut off strategy to turn the parallel algorithm into a sequential one when the vector size was smaller than a paticular threshold. However, now I am trying to set the cut off strategy based on recursion depth. i.e I want my algorithm to turn sequential when a certain recursion depth is reached. I employed the following code, but it dosent work. I'm not sure how to proceed. Any ideas?
template <class T>
void ParallelSort::sortHelper(typename vector<T>::iterator start, typename vector<T>::iterator end, int level =0) //THIS IS THE QUICKSoRT INTERFACE
{
static int depth =0;
const int insertThreshold = 20;
const int threshold = 1000;
if(start<end)
{
if(end-start < insertThreshold) //thresholf for insert sort
{
insertSort<T>(start, end);
}
else if((end-start) >= insertThreshold && depth<threshold) //threshhold for non parallel quicksort
{
int part = partition<T>(start,end);
depth++;
sortHelper<T>(start, start + (part - 1), level+1);
depth--;
depth++;
sortHelper<T>(start + (part + 1), end, level+1);
depth--;
}
else
{
int part = partition<T>(start,end);
#pragma omp task
{
depth++;
sortHelper<T>(start, start + (part - 1), level+1);
depth--;
}
depth++;
sortHelper<T>(start + (part + 1), end, level+1);
depth--;
}
}
}
I tried the static variable depth and also the non static variable level but both of them dont work.
NOTE: The above snipped only depends on depth. level is included to show both the methods tried
static depth being written to from two threads makes your code execute unspecified behavior, as what those writes do is not specified.
As it happens, you are passing down level, which is your recursion depth. At each level, you double the number of threads -- so a limit on level equal to 6 (say) corresponds to 2^6 threads at most. Your code is only half parallel, because the partition code occurs in the main thread, so you will probably have fewer than the theoretical maximum number of threads going at once.
template <class T>
void ParallelSort::sortHelper(typename vector<T>::iterator start, typename vector<T>::iterator end, int level =0) //THIS IS THE QUICKSoRT INTERFACE
{
const int insertThreshold = 20;
const int treeDepth = 6; // at most 2^6 = 64 tasks
if(start<end)
{
if(end-start < insertThreshold) //thresholf for insert sort
{
insertSort<T>(start, end);
}
else if(level>=treeDepth) // only 2^treeDepth threads, after which we run in sequence
{
int part = partition<T>(start,end);
sortHelper<T>(start, start + (part - 1), level+1);
sortHelper<T>(start + (part + 1), end, level+1);
}
else // launch two tasks, creating an exponential number of threads:
{
int part = partition<T>(start,end);
#pragma omp task
{
sortHelper<T>(start, start + (part - 1), level+1);
}
sortHelper<T>(start + (part + 1), end, level+1);
}
}
}
Alright, I figured it out. It was a stupid mistake on my part.
The algorithm should fall back onto the sequential code when the stack size is greater than some threshold, not smaller. Doing so solves the problem, and gives me a speedup.