Recursion Depth Cut Off Strategy: Parallel QuickSort - c++

I have a parallel quickosort alogirthm implemented. To avoid overhead of excess parallel threads I had a cut off strategy to turn the parallel algorithm into a sequential one when the vector size was smaller than a paticular threshold. However, now I am trying to set the cut off strategy based on recursion depth. i.e I want my algorithm to turn sequential when a certain recursion depth is reached. I employed the following code, but it dosent work. I'm not sure how to proceed. Any ideas?
template <class T>
void ParallelSort::sortHelper(typename vector<T>::iterator start, typename vector<T>::iterator end, int level =0) //THIS IS THE QUICKSoRT INTERFACE
{
static int depth =0;
const int insertThreshold = 20;
const int threshold = 1000;
if(start<end)
{
if(end-start < insertThreshold) //thresholf for insert sort
{
insertSort<T>(start, end);
}
else if((end-start) >= insertThreshold && depth<threshold) //threshhold for non parallel quicksort
{
int part = partition<T>(start,end);
depth++;
sortHelper<T>(start, start + (part - 1), level+1);
depth--;
depth++;
sortHelper<T>(start + (part + 1), end, level+1);
depth--;
}
else
{
int part = partition<T>(start,end);
#pragma omp task
{
depth++;
sortHelper<T>(start, start + (part - 1), level+1);
depth--;
}
depth++;
sortHelper<T>(start + (part + 1), end, level+1);
depth--;
}
}
}
I tried the static variable depth and also the non static variable level but both of them dont work.
NOTE: The above snipped only depends on depth. level is included to show both the methods tried

static depth being written to from two threads makes your code execute unspecified behavior, as what those writes do is not specified.
As it happens, you are passing down level, which is your recursion depth. At each level, you double the number of threads -- so a limit on level equal to 6 (say) corresponds to 2^6 threads at most. Your code is only half parallel, because the partition code occurs in the main thread, so you will probably have fewer than the theoretical maximum number of threads going at once.
template <class T>
void ParallelSort::sortHelper(typename vector<T>::iterator start, typename vector<T>::iterator end, int level =0) //THIS IS THE QUICKSoRT INTERFACE
{
const int insertThreshold = 20;
const int treeDepth = 6; // at most 2^6 = 64 tasks
if(start<end)
{
if(end-start < insertThreshold) //thresholf for insert sort
{
insertSort<T>(start, end);
}
else if(level>=treeDepth) // only 2^treeDepth threads, after which we run in sequence
{
int part = partition<T>(start,end);
sortHelper<T>(start, start + (part - 1), level+1);
sortHelper<T>(start + (part + 1), end, level+1);
}
else // launch two tasks, creating an exponential number of threads:
{
int part = partition<T>(start,end);
#pragma omp task
{
sortHelper<T>(start, start + (part - 1), level+1);
}
sortHelper<T>(start + (part + 1), end, level+1);
}
}
}

Alright, I figured it out. It was a stupid mistake on my part.
The algorithm should fall back onto the sequential code when the stack size is greater than some threshold, not smaller. Doing so solves the problem, and gives me a speedup.

Related

Optimizing huge graph traversal with OpenMP

I am trying to optimize this function which according to the perf tool is the bottleneck of archiving close to linear scaling. The performance gets worse when the number of threads go up, when I drill down the assembly code generated by perf it shows most of the time is spent checking for visited and not visited vertices. I've done a ton of google searches to improve the performance to no avail. Is there a way to improve the performance of this function? Or is there a thread safe way of implementing this function? Thanks for your help in advance!
typedef uint32_t vidType;
template<typename T, typename U, typename V>
bool compare_and_swap(T &x, U old_val, V new_val) {
return __sync_bool_compare_and_swap(&x, old_val, new_val);
}
template<bool map_vertices, bool map_edges>
VertexSet GraphT<map_vertices, map_edges>::N(vidType vid) const {
assert(vid >= 0);
assert(vid < n_vertices);
eidType begin = vertices[vid], end = vertices[vid+1];
if (begin > end || end > n_edges) {
fprintf(stderr, "vertex %u bounds error: [%lu, %lu)\n", vid, begin, end);
exit(1);
}
assert(end <= n_edges);
return VertexSet(edges + begin, end - begin, vid);
}
void bfs_step(Graph &g, vidType *depth, SlidingQueue<vidType> &queue) {
#pragma omp parallel
{
QueueBuffer<vidType> lqueue(queue);
#pragma omp for
for (auto q_iter = queue.begin(); q_iter < queue.end(); q_iter++) {
auto src = *q_iter;
for (auto dst : g.N(src)) {
//int curr_val = parent[dst];
auto curr_val = depth[dst];
if (curr_val == MYINFINITY) { // not visited
//if (compare_and_swap(parent[dst], curr_val, src)) {
if (compare_and_swap(depth[dst], curr_val, depth[src] + 1)) {
lqueue.push_back(dst);
}
}
}
}
lqueue.flush();
}
}
First of all, you're using a very traditional formulation of graph algorithms. Good for textbooks, not for computation. If you write this as a generalized matrix-vector product with the adjacency matrix you lose all those fiddly queues and the parallelism becomes quite obvious.
In your formulation, the problem is with the push_back function on the queue. That is hard to parallelize. The solution is to let each thread have its own queue, and then using a reduction. This works if you define the plus operator on your queue object to effect a merge of the local queues.

Parallelization of bin packing problem by OpenMp

I am learning open mp, and I want to parallelize well-known BinPacking problem. But the problem is what whatever I try, can't get correct solution ( the one I get with sequential verstion).
So far, I have tried multiple different versions (including reduction, tasks, schedule) but didn't get anything useful.
Below is my the most recent try.
int binPackingParallel(std::vector<int> weight, int n, int c)
{
int resltut = 0;
int bin_rem[n];
#pragma omp parallel for schedule(dynamic) reduction(+:result)
for (int i = 0; i < n; i++) {
bool done = false;
int j;
for (j = 0; j < result && !done; j++) {
int b ;
#pragma omp atomic
b = bin_rem[j] - weight[i];
if ( b >= 0) {
bin_rem[j] = bin_rem[j] - weight[i];
done = true;
}
}
if (!done) {
#pragma omp critical
bin_rem[result] = c - weight[i];
result++;
}
}
return result;
}
Edit: I made modification on starting problem, so now there is given number of bins N and we need to check if all elements can be put in N bins. I made this by using recursion, still my parallel version is slower.
bool can_fit_parallel(std::vector<int> arr, std::vector<int> bins, int n) {
// base case: if the array is empty, we can fit the elements
if (arr.empty()) {
return true;
}
bool found = false;
#pragma omp parallel for schedule (dynamic,10)
for (int i = 0; i < n; i++) {
if (bins[i] >= arr[0]) {
bins[i] -= arr[0];
if (can_fit_parallel(std::vector<int>(arr.begin() + 1, arr.end()), bins, n)) {
found = true;
#pragma omp cancel for
}
// if the element doesn't fit or if the recursion fails,
// restore the bin's capacity and try the next bin
bins[i] += arr[0];
}
}
// if the element doesn't fit in any of the bins, return false
return found;
}
Any help would be great
You do not need parallelization to make your code significantly faster. You have implemented the First Fit method (its complexity is O(n2)), but it can be significantly faster if you use binary search trees (O(n Log n)). To do so, you just have to use the standard library (std::multiset), in this example I have implemented the BestFit algorithm:
int binPackingSTL(const std::vector<int>& weight, const int n, const int c)
{
std::multiset<int> bins; //multiset to store bins
for (const auto x: weight) {
const auto it=bins.lower_bound(x); // find the best bin to accomodate x
if(it==bins.end()){
bins.insert(c - x); // if no suitale bin found insert a new one
} else {
//suitable bin found - replace it with a smaller value
auto value=*it; // store its value
bins.erase(it); // erase the old value
bins.insert(value-x); // insert the new value
}
}
return bins.size(); // number of bins
}
In my measurements, it is 100x times faster than your code in the case of n=50000
EDIT: Both algorithms mentioned above (First-Fit and Best-Fit) are approximations to the bin packing problem. To answer your revised question, you have to use an algorithm that finds the optimal solution. So, you need to find an algorithm for the exact solution, not an approximation. Instead of trying to reinvent the wheel, you can consider using already available libraries such as BPPLIB – A Bin Packing Problem Library.
This is not a reduction: that would cause each thread to have it own partial result, and you want result to be global. I think that putting a critical section around two statements might work. The atomic statement is meaningless since it is not on a shared variable.
But there a deeper problem: each i iteration can write a result, which affects how far the search of the other iterations goes. That means that the outer iteration has to be sequential. (You really need to think hard about whether iterations are independent before you slap a parallel directive on them!) Maybe you can make the inner iteration parallel: it's a search, which would be a reduction on j. However that loop would have to be pretty dang long before you'd see a performance improvement.
This looks to me like the sort of algorithm that you'd have to reformulate before you can make it parallel.

Efficient way to retrieve count of number of times a flag set since last n seconds

I need to track how many times a flag is enabled in last n seconds. Below is the example code I can come up with.StateHandler maintains the value of the flag in active array for last n(360 here) seconds. In my case update function is called from outside every second. So when I need to know how many times it set since last 360 seconds I call getEnabledInLast360Seconds. Is it possible to do it more efficiently like not using an array size of n for booleans ?
#include <map>
#include <iostream>
class StateHandler
{
bool active[360];
int index;
public:
StateHandler() :
index(0),
active()
{
}
void update(bool value)
{
if (index >= 360)
{
index = 0;
}
active[index % 360] = value;
index++;
}
int getEnabledInLast360Seconds()
{
int value = 0;
for (int i = 0; i < 360; i++)
{
if (active[i])
{
value++;
}
}
return value;
}
};
int main()
{
StateHandler handler;
handler.update(true);
handler.update(true);
handler.update(true);
std::cout << handler.getEnabledInLast360Seconds();
}
Yes. Use the fact that numberOfOccurrences(0,360) and numberOfOccurrences(1,361) have 359 common terms. So remember the sum, calculate the common term, and calculate the new sum.
void update(bool value)
{
if (index >= 360)
{
index = 0;
}
// invariant: count reflects t-360...t-1
if (active[index]) count--;
// invariant: count reflects t-359...t-1
active[index] = value;
if (value) count++;
// invariant: count reflects t-359...t
index++;
}
(Note that the if block resetting index removes the need for the modulo operator % so I removed that)
Another approach would be to use subset sums:
subsum[0] = count(0...19)
subsum[1] = count(20...39)
subsum[17] = count(340...359)
Now you only have to add 18 numbers each time, and you can entirely replace a subsum every 20 seconds.
Instead of fixing the buffer, you can simply use std::set<timestamp> (Or perhaps std::queue). Every time you check, pop off the elements older than 360s and count the remaining ones.
If you check scarcely but update often, you might want to add the "popping" to the update itself, to prevent the set from growing too big.

Implementation of a "hits in last [second/minute/hour]" data structure

I think this is a fairly common question but I can't seem to find answer by googling around (maybe there's a more precise name for the problem I don't know?)
You need to implement a structure with a "hit()" method used to report a hit and hitsInLastSecond|Minute|Hour methods. You have a timer with say nanosecond accuracy. How do you implement this efficiently?
My thought was something like this (in psuedo-C++)
class HitCounter {
void hit() {
hits_at[now()] = ++last_count;
}
int hitsInLastSecond() {
auto before_count = hits_at.lower_bound(now() - 1 * second)
if (before_count == hits_at.end()) { return last_count; }
return last_count - before_count->second;
}
// etc for Minute, Hour
map<time_point, int> hits_at;
int last_count = 0;
};
Does this work? Is it good? Is something better?
Update: Added pruning and switched to a deque as per comments:
class HitCounter {
void hit() {
hits.push_back(make_pair(now(), ++last_count));
}
int hitsInLastSecond() {
auto before = lower_bound(hits.begin(), hits.end(), make_pair(now() - 1 * second, -1));
if (before == hits.end()) { return last_count; }
return last_count - before_count->second;
}
// etc for Minute, Hour
void prune() {
auto old = upper_bound(hits.begin(). hits.end(), make_pair(now - 1 * hour, -1));
if (old != hits.end()) {
hits.erase(hits.begin(), old)
}
}
deqeue<pair<time_point, int>> hits;
int last_count = 0;
};
What you are describing is called a histogram.
Using a hash, if you intend nanosecond accuracy, will eat up much of your cpu. You probably want a ring buffer for storing the data.
Use std::chrono to achieve the timing precision you require, but frankly hits per second seems like the highest granularity you need and if you are looking at the overall big picture, it doesn't seem like it will matter terribly what the precision is.
This is a partial, introductory sample of how you might go about it:
#include <array>
#include <algorithm>
template<size_t RingSize>
class Histogram
{
std::array<size_t, RingSize> m_ringBuffer;
size_t m_total;
size_t m_position;
public:
Histogram() : m_total(0)
{
std::fill_n(m_ringBuffer.begin(), RingSize, 0);
}
void addHit()
{
++m_ringBuffer[m_position];
++m_total;
}
void incrementPosition()
{
if (++m_position >= RingSize)
m_position = 0;
m_total -= m_ringBuffer[m_position];
m_ringBuffer[m_position] = 0;
}
double runningAverage() const
{
return (double)m_total / (double)RingSize;
}
size_t runningTotal() const { return m_total; }
};
Histogram<60> secondsHisto;
Histogram<60> minutesHisto;
Histogram<24> hoursHisto;
Histogram<7> weeksHisto;
This is a naive implementation which assumes you will call it every second and increment the position, and will transpose runningTotal from one histogram to the next every RingSize (so every 60s, add secondsHisto.runningTotal to minutesHisto).
Hopefully it will be a useful introductory place to start from.
If you want to track a longer histogram of hits per second, you can do that with this model, by increasing the ring size, add a second total to track the last N ring buffer entries, so that m_subTotal = sum(m_ringBuffer[m_position - N .. m_position]), similar to the way m_total works.
size_t m_10sTotal;
...
void addHit()
{
++m_ringBuffer[m_position];
++m_total;
++m_10sTotal;
}
void incrementPosition()
{
// subtract data from >10 sample intervals ago.
m_10sTotal -= m_ringBuffer[(m_position + RingBufferSize - 10) % RingBufferSize];
// for the naive total, do the subtraction after we
// advance position, since it will coincide with the
// location of the value RingBufferSize ago.
if (++m_position >= RingBufferSize)
m_position = 0;
m_total -= m_ringBuffer[m_position];
}
You don't have to make the histo grams these sizes, this is simply a naive scraping model. There are various alternatives, such as incrementing each histogram at the same time:
secondsHisto.addHit();
minutesHisto.addHit();
hoursHisto.addHit();
weeksHisto.addHit();
Each rolls over independently, so all have current values. Size each histo as far as you want data at that granularity to go back.

Iterative Deepening Negamax with Alpha-Beta Pruning

I have a working negamax algorithm in my program. However, I need the program to find the best possible move within kMaxTimePerMove time. I did some research, and it seemed that using iterative deepening with my negamax algorithm would be the best way to do so. Right now, my function that starts the search looks like this:
// this is a global in the same scope as the alpha-beta functions, so they can check the elapsed time
clock_t tStart;
int IterativeDeepening(Board current_state)
{
bool overtime = false;
int depth = 0;
tStart = clock();
MoveHolder best_move(-1, kWorstEvaluation);
while ((static_cast<double> (clock() - tStart)/CLOCKS_PER_SEC) < kMaxTimePerMove)
{
MoveHolder temp_move = AlphaBetaRoot(kWorstEvaluation, -best_move.evaluation_,++depth, current_state, overtime);
if (!overtime)
best_move = temp_move;
}
return best_move.column_;
}
I think I should also be reordering the previous best move to the front of the children list, however, I am waiting on implementing that until I get the basic version working. The actual Alpha-Beta functions look like this:
MoveHolder AlphaBetaRoot(int alpha, int beta, int remaining_depth, Board current_state, bool &overtime)
{
MoveHolder best(-1, -1);
if (overtime)
return MoveHolder(0,0);
std::vector<Board> current_children;
current_state.GetBoardChildren(current_children);
for (auto i : current_children)
{
best.evaluation_ = -AlphaBeta(-beta, -alpha, remaining_depth - 1, i, overtime);
if ((static_cast<double> (clock() - tStart)/CLOCKS_PER_SEC) > kMaxTimePerMove)
{
overtime = true;
return MoveHolder(0,0);
}
if (best.evaluation_ >= beta)
return best;
if (best.evaluation_ > alpha)
{
alpha = best.evaluation_;
best.column_ = i.GetLastMoveColumn();
}
}
return best;
}
int AlphaBeta(int alpha, int beta, int remaining_depth, Board2 current_state, bool &overtime)
{
if (overtime)
return 0;
if ((static_cast<double> (clock() - tStart)/CLOCKS_PER_SEC) > kMaxTimePerMove)
{
overtime = true;
return 0;
}
if (remaining_depth == 0 || current_state.GetCurrentResult() != kNoResult)
{
return current_state.GetToMove() * current_state.GetCurrentEvaluation();
}
std::vector<Board> current_children;
current_state.GetBoardChildren(current_children);
for (auto i : current_children)
{
int score = -AlphaBeta(-beta, -alpha, remaining_depth - 1, i, overtime);
if (score >= beta)
{
return beta;
}
if (score > alpha)
{
alpha = score;
}
}
return alpha;
}
When I try to debug, everything seems like it is working as expected. However, when I have the iterative deepening version play against the regular alpha-beta implementation, it consistently loses. At times it seems like it gets "stuck", and returns a terrible move.
As an example, if this program is "forced" to make a move the next turn, or else the opponent will win, it doesn't block the win. On that move, it reported that it was searching to a depth of 38. I am finding the algorithm extremely difficult to debug, because if I break the execution, it ruins the timing.
I'm not sure if I have implemented the algorithm incorrectly, or simply have a tricky bug in here. If someone could point me in the right direction, I would really appreciate it.
You are using -best_move.evaluation_ as the beta value for the search, where best_move is the best move from the previous depth. This isn't correct: Suppose a move looks good at depth=2 but turns out to be bad at greater depths. This method will continue to consider it good, and cause beta cutoffs which should not have happened on other moves.
You should search each iteration on (-infinity, infinity) to fix this. You can also use aspiration windows to limit the alpha-beta range.
Note that since you do not use the previous iteration to improve move ordering on the next ones, iterative deepening will result in slightly worse results. Ideally you want move ordering to choose the best move from a transposition table and/or the principal variation of the previous iteration.