Parallel Dijkstra - c++

I'm using OpenMP to make parallel version of Dijkstra algorithm. My code consists of two parts. First part is execute only by one thread (master). This thread chooses new nodes from list. Second part is execute by other threads. These threads change distances from source to other nodes. Unfortunatelly in my code is error because one of many threads which execute second part suddenly "disappear". Probably there is problem with data synchronization, but I don't know where. I would be grateful if someone could tell me where is my mistake. Here is the code:
map<int, int> C;
map<int, int> S;
map<int, int> D;
int init;
int nu;
int u;
int p = 3;//omp_get_num_threads();
int d;
int n = graph->getNodesNum();
#pragma omp parallel shared(n, C, d, S, init, nu, u, D, graph, p) num_threads(p)
{
int myId = omp_get_thread_num();
if (myId == 0)
{
init = 0;
nu = 0;
u = to;
while (init < p - 1)
{
}
while (u != 0)
{
S[u] = 1;
while (nu < p - 1)
{
}
u = 0;
d = INFINITY;
for (int i = 1; i <= p - 1; ++i)
{
int j = C[i];
if ((j != 0) && (D[j] < d))
{
d = D[j];
u = j;
}
}
nu = 0;
}
}
else
{
for (int i=myId; i<=n; i += p-1)
{
D[i] = INFINITY;
S[i] = 0;
}
D[u] = 0;
++init;
while (init < p-1)
{
}
while (u != 0)
{
C[myId] = 0;
int d = INFINITY;
for (int i = myId; i<=n; i+=p-1)
{
if (S[i] == 0)
{
if (i != u)
{
int cost = graph->getCostBetween(u, i);
if (cost != INFINITY)
{
D[i] = min(D[i], D[u] + cost);
}
}
if ((d > D[i]))
{
d = D[i];
C[myId] = i;
}
}
}
++nu;
while (nu != 0)
{
}
}
}
}
}

I don't know what information you have, but parallelising an irregular, highly synchronized algorithm with small tasks is amongst the toughest parallel problems one can have. Research teams can dedicate themselves to such tasks and get limited speedups, or nowhere with it. Often such algorithms only work on specific architectures that are tailored for the parallelisation, and quirky overheads such as false sharing have been eliminated by designing the data structures appropriately.
An algorithm such as this needs a lot of time and effort to profile, measure, and consideration. See for example this paper.
ww2.cs.fsu.edu/~flin/ppq_report.pdf
Now, onto your direct question, since your algorithm is highly synchronized and tasks are small you are experiencing the side effect of data races. To remove these from your parallel algorithm is going to be very tricky, and no-one here can do it for you.
So your first point of call is to look at tools which can help you detect data races such as Valgrind and the Intel thread checker.

Related

How to make a thread wait for a condition to perform an operation without it using too much CPU time?

I've been trying to use threads in a matrix operation to make it faster for large matrixes (1000x1000). I've had some sucess so far with the below code. With significant speed improvements in comparision to using a single thread.
void updateG(Matrix &u, Matrix &g, int n, int bgx, int tamx, int tamy)
{
int i, j;
for (i = bgx; i < tamx; i += n)
{
for (j = 0; j < tamy; j++)
{
g(i,j, g(i,j)+ dt * 0.5 * (u(i,j) - (g(i,j) * y)));
}
}
}
void updateGt(Matrix &u, Matrix &g, int tam)strong text
{
int i;
const int n = NT;
std::thread array[n];
for (int i = 0; i < n; i++)
{
array[i] = std::thread(updateG, std::ref(u), std::ref(g), n, i, tam, tam);
}
joinAll(array, n);
}
However, I need to call this operation several times in the main code, and every time this happens, I must initialize the thread array again, creating new threads and wasting a lot of time (according to what I've read online those are expensive).
So, I've developed an alternative solution to create and initialize the thread array only once, and use the same threads to perform the matrix operations every time the function is called. Using some flag variables so the thread only performs the operation when it has to. Like in the following code:
void updateG(int bgx,int tam)
{
while (!flaguGkill[bgx]) {
if (flaguG[bgx]) {
int i, j;
for (i = bgx; i < tam; i += NT)
{
for (j = 0; j < tam; j++)
{
g->operator()(i, j, g->operator()(i, j) + dt * 0.5 * (u->operator()(i, j) - (g->operator()(i, j) * y)));
}
}
flaguG[bgx] = false;
}
}
}
void updateGt()
{
for (int k = 0; k < NT; k++)
{
flaguG[k] = true;
}
for (int i = 0; i < NT; i++)
{
while(flaguG[i]);
}
}
My problem is. This solution, that's supposed to be faster, is much slower than the first one, by a large margin. In my complete code, I have 2 functions like this updateGt and updateXt, and I'm using 4 threads for each, I believe the problem is that while the function is supposed to be idle waiting, it is instead using a lot of CPU time only to keep checking on the codition. Anyone knows if that is really the case, and if so, how could I fix it?
The problem here is called busy waiting. As mentioned in comments you'll want to use std::condition_variable, like this:
std::mutex mutex;
std::condition_variable cv;
while (!flaguGkill[bgx]) {
{
unique_lock<mutex> lock(mutex); // aquire mutex lock as required by condition variable
cv.wait(lock, [this]{return flaguG[bgx];}); // thread will suspend here and release the lock if the expression does not return true
}
int i, j;
for (i = bgx; i < tam; i += NT)
{
for (j = 0; j < tam; j++)
{
g->operator()(i, j, g->operator()(i, j) + dt * 0.5 * (u->operator()(i, j) - (g->operator()(i, j) * y)));
}
}
flaguG[bgx] = false;
}
Note: the section [this] { return flaguG[bgx];}, you may need to alter the capture paramaters (the bit in the []) depending on the scope of those variables
Where you set this to be true, you then need to call
for (int k = 0; k < NT; k++)
{
flaguG[k] = true;
cv.notify_one();
}
// you can then use another condition variable here

graph rerouting algorithm hacker rank

I have tried to solve the problem Rerouting at hacker rank. I am posting here for help as competition is over.
https://www.hackerrank.com/contests/hack-the-interview-v-asia-pacific/challenges/rerouting
I have tried to solve problem using Strong connected components, but test cases failed. I can understand we have to remove cycles. But I stuck how to approach problem. Below is solution i have written. I am looking for guidence how to move forward so that i can apply my knowledge future based on mistakes i made here. Thanks for your time and help
int getMinConnectionChange(vector<int> connection) {
// Idea: Get number of strongly connected components.
int numberOfVertices = connection.size();
for(int idx = 0; idx < numberOfVertices; idx++) {
cout << idx+1 <<":"<< connection[idx] << endl;
}
stack<int> stkVertices;
map<int, bool> mpVertexVisited; //is vertex visited.think this as a chalk mark for nodes visited.
int numOFSCCs = 0;
int currTime = 1;
for (int vertexId = 0; vertexId < numberOfVertices; vertexId++) {
// check if node is already visited.
if (mpVertexVisited.find(vertexId+1) == mpVertexVisited.end()) {
numOFSCCs++;
mpVertexVisited.insert(make_pair(vertexId+1, true));
stkVertices.push(vertexId+1);
currTime++;
while (!stkVertices.empty()) {
int iCurrentVertex = stkVertices.top();
stkVertices.pop();
// get adjacent vertices. In this excercise we have only one neighbour. i.e., edge
int neighbourVertexId = connection[iCurrentVertex-1];
// if vertex is already visisted, don't insert in to stack.
if (mpVertexVisited.find(neighbourVertexId) != mpVertexVisited.end()) {
continue;
}
mpVertexVisited.insert(make_pair(neighbourVertexId, true));
stkVertices.push(neighbourVertexId);
} // while loop
} // if condition m_mapVrtxTimes.find(*itr) == m_mapVrtxTimes.end()
} // for loop of vertices
return numOFSCCs - 1;
}
This is a problem that I just solved and would like to share the solution.
The problem can be solved with union-find.
Two main observation:
The number of edges that has to be changed is the number of components - 1 (not necessarily strongly connected) Thus, union-find is handy here for finding the number of components
Second observation is that some component doesn't have terminating node, consider 1<->2, in other words, a cycle exist. We can detect whether there exists a terminating node if some node doesn't have an outgoing edge.
If all components have a cycle, it means that we need to change every component instead of a number of components - 1. This is to make it such that the graph will have a terminating point.
Code:
struct UF {
vector<int> p, rank, size;
int cnt;
UF(int N) {
p = rank = size = vector<int>(N, 1);
for (int i = 0; i < N; i++) p[i] = i;
cnt = N;
}
int find(int i) {
return p[i] == i ? i : p[i] = find(p[i]);
}
bool connected(int i, int j) {
return find(i) == find(j);
}
void join(int i, int j) {
if (connected(i, j)) return;
int x = find(i), y = find(j);
cnt--;
if (rank[x] > rank[y]) {
p[y] = x;
size[x] += size[y];
} else {
p[x] = y;
size[y] += size[x];
if (rank[x] == rank[y]) rank[y]++;
}
}
};
int getMinConnectionChange(vector<int> connection) {
int nonCycle = 0;
int n = connection.size();
UF uf(n);
for(int i=0;i<n;i++) {
int to = connection[i] - 1;
if(to == i) nonCycle++;
else uf.join(i, to);
}
int components = uf.cnt;
int countCycle = uf.cnt - nonCycle;
int res = components - 1;
if(countCycle == components) res++; // all components have cycle
return res;
}
TL;DR: you can view this as looking for a minimal spanning arborescence problem.
More precisely, add a node for each server, and another one called "Terminate".
Make a complete graph (each node is linked to every other one) and set as cost 0 for the edges corresponding to your input, 1 for the other ones.
You can use for example Edmond's algorithm to solve this.

If statement branch prediction issue

I am implementing a pattern search algorithm that has a vital if statement that seems to be unpredictable in it's result. Random files are searched and thus sometimes the branch predictions are okay and sometimes they can be terrible if the file is completely random. My goal is to eliminate the if statement and I have tried but it has yielded slow results like preallocating a vector. The number of pattern possibilities can be very large so preallocating takes up a lot of time. I therefore have the dynamic vector where I initialize them all with NULL up front and then check with the if statement if a pattern is present. The if seems to be killing me and specifically the cmp assembly statement. Bad branch predictions are scrapping the pipeline a lot and causing huge slow downs. Any ideas would be great as to eliminate the if statement at line 17...stuck in a rut.
for (PListType i = 0; i < prevLocalPListArray->size(); i++)
{
vector<vector<PListType>*> newPList(256, NULL);
vector<PListType>* pList = (*prevLocalPListArray)[i];
PListType pListLength = (*prevLocalPListArray)[i]->size();
PListType earlyApproximation = ceil(pListLength/256);
for (PListType k = 0; k < pListLength; k++)
{
//If pattern is past end of string stream then stop counting this pattern
if ((*pList)[k] < file->fileStringSize)
{
uint8_t indexer = ((uint8_t)file->fileString[(*pList)[k]]);
if(newPList[indexer] != NULL) //Problem if statement!!!!!!!!!!!!!!!!!!!!!
{
newPList[indexer]->push_back(++(*pList)[k]);
}
else
{
newPList[indexer] = new vector<PListType>(1, ++(*pList)[k]);
newPList[indexer]->reserve(earlyApproximation);
}
}
}
//Deallocate or stuff patterns in global list
for (int z = 0; z < newPList.size(); z++)
{
if(newPList[z] != NULL)
{
if (newPList[z]->size() >= minOccurrence)
{
globalLocalPListArray->push_back(newPList[z]);
}
else
{
delete newPList[z];
}
}
}
delete (*prevLocalPListArray)[i];
}
Here is the code without indirection with the changes proposed...
vector<vector<PListType>> newPList(256);
for (PListType i = 0; i < prevLocalPListArray.size(); i++)
{
const vector<PListType>& pList = prevLocalPListArray[i];
PListType pListLength = prevLocalPListArray[i].size();
for (PListType k = 0; k < pListLength; k++)
{
//If pattern is past end of string stream then stop counting this pattern
if (pList[k] < file->fileStringSize)
{
uint8_t indexer = ((uint8_t)file->fileString[pList[k]]);
newPList[indexer].push_back((pList[k] + 1));
}
else
{
totalTallyRemovedPatterns++;
}
}
for (int z = 0; z < 256; z++)
{
if (newPList[z].size() >= minOccurrence/* || (Forest::outlierScans && pList->size() == 1)*/)
{
globalLocalPListArray.push_back(newPList[z]);
}
else
{
totalTallyRemovedPatterns++;
}
newPList[z].clear();
}
vector<PListType> temp;
temp.swap(prevLocalPListArray[i]);
}
Here is the most up to date program that manages to not use 3 times the memory and does not require an if statement. The only bottleneck seems to be the newPList[indexIntoFile].push_back(++index); statement. This bottleneck could be cache coherency issues when indexing the array because the patterns are random. When i search a binary files with just 1s and 0s I don't have any latency with indexing the push back statement. That is why I believe it is cache thrashing. Do you guys see any room for optimization in this code still? You guys have been a great help so far. #bogdan #harold
vector<PListType> newPList[256];
PListType prevPListSize = prevLocalPListArray->size();
PListType indexes[256] = {0};
PListType indexesToPush[256] = {0};
for (PListType i = 0; i < prevPListSize; i++)
{
vector<PListType>* pList = (*prevLocalPListArray)[i];
PListType pListLength = (*prevLocalPListArray)[i]->size();
if(pListLength > 1)
{
for (PListType k = 0; k < pListLength; k++)
{
//If pattern is past end of string stream then stop counting this pattern
PListType index = (*pList)[k];
if (index < file->fileStringSize)
{
uint_fast8_t indexIntoFile = (uint8_t)file->fileString[index];
newPList[indexIntoFile].push_back(++index);
indexes[indexIntoFile]++;
}
else
{
totalTallyRemovedPatterns++;
}
}
int listLength = 0;
for (PListType k = 0; k < 256; k++)
{
if( indexes[k])
{
indexesToPush[listLength++] = k;
}
}
for (PListType k = 0; k < listLength; k++)
{
int insert = indexes[indexesToPush[k]];
if (insert >= minOccurrence)
{
int index = globalLocalPListArray->size();
globalLocalPListArray->push_back(new vector<PListType>());
(*globalLocalPListArray)[index]->insert((*globalLocalPListArray)[index]->end(), newPList[indexesToPush[k]].begin(), newPList[indexesToPush[k]].end());
indexes[indexesToPush[k]] = 0;
newPList[indexesToPush[k]].clear();
}
else if(insert == 1)
{
totalTallyRemovedPatterns++;
indexes[indexesToPush[k]] = 0;
newPList[indexesToPush[k]].clear();
}
}
}
else
{
totalTallyRemovedPatterns++;
}
delete (*prevLocalPListArray)[i];
}
Here are the benchmarks. I didn't think it would be readable in the comments so I am placing it in the answer category. The percentages to the left define how much time is spent percentage wise on a line of code.

Dijkstra's algorithm openmp strange behavior

I'm trying to run an openmp realization of Dijkstra's algorithm which I downloaded here heather.cs.ucdavis.edu/~matloff/OpenMP/Dijkstra.c
If I add for example one more vertice from 5 to 6, so that the path from 0th goes through two vertices, my program fails to give me a correct result, saying that the distance between 0th and 6th is infinite :^(
What can be the reason?
#define LARGEINT 2<<30-1 // "infinity"
#define NV 6
// global variables, all shared by all threads by default
int ohd[NV][NV], // 1-hop distances between vertices
mind[NV], // min distances found so far
notdone[NV], // vertices not checked yet
nth, // number of threads
chunk, // number of vertices handled by each thread
md, // current min over all threads
mv; // vertex which achieves that min
void init(int ac, char **av)
{ int i,j;
for (i = 0; i < NV; i++)
for (j = 0; j < NV; j++) {
if (j == i) ohd[i][i] = 0;
else ohd[i][j] = LARGEINT;
}
ohd[0][1] = ohd[1][0] = 40;
ohd[0][2] = ohd[2][0] = 15;
ohd[1][2] = ohd[2][1] = 20;
ohd[1][3] = ohd[3][1] = 10;
ohd[1][4] = ohd[4][1] = 25;
ohd[2][3] = ohd[3][2] = 100;
ohd[1][5] = ohd[5][1] = 6;
ohd[4][5] = ohd[5][4] = 8;
for (i = 1; i < NV; i++) {
notdone[i] = 1;
mind[i] = ohd[0][i];
}
}
// finds closest to 0 among notdone, among s through e
void findmymin(int s, int e, int *d, int *v)
{ int i;
*d = LARGEINT;
for (i = s; i <= e; i++)
if (notdone[i] && mind[i] < *d) {
*d = ohd[0][i];
*v = i;
}
}
// for each i in [s,e], ask whether a shorter path to i exists, through
// mv
void updateohd(int s, int e)
{ int i;
for (i = s; i <= e; i++)
if (mind[mv] + ohd[mv][i] < mind[i])
mind[i] = mind[mv] + ohd[mv][i];
}
void dowork()
{
#pragma omp parallel // Note 1
{ int startv,endv, // start, end vertices for this thread
step, // whole procedure goes NV steps
mymd, // min value found by this thread
mymv, // vertex which attains that value
me = omp_get_thread_num(); // my thread number
#pragma omp single // Note 2
{ nth = omp_get_num_threads(); chunk = NV/nth;
printf("there are %d threads\n",nth); }
// Note 3
startv = me * chunk;
endv = startv + chunk - 1;
for (step = 0; step < NV; step++) {
// find closest vertex to 0 among notdone; each thread finds
// closest in its group, then we find overall closest
#pragma omp single
{ md = LARGEINT; mv = 0; }
findmymin(startv,endv,&mymd,&mymv);
// update overall min if mine is smaller
#pragma omp critical // Note 4
{ if (mymd < md)
{ md = mymd; mv = mymv; }
}
// mark new vertex as done
#pragma omp single
{ notdone[mv] = 0; }
// now update my section of ohd
updateohd(startv,endv);
#pragma omp barrier
}
}
}
int main(int argc, char **argv)
{ int i;
init(argc,argv);
dowork();
// back to single thread now
printf("minimum distances:\n");
for (i = 1; i < NV; i++)
printf("%d\n",mind[i]);
}
There are two problems here:
If the number of threads doesn't evenly divide the number of values, then this division of work
startv = me * chunk;
endv = startv + chunk - 1;
is going to leave the last (NV - nth*(NV/nth)) elements undone, which will mean the distances are left at LARGEINT. This can be fixed any number of ways; the easiest for now is to give all remaining work to the last thread
if (me == (nth-1)) endv = NV-1;
(This leads to more load imbalance than is necessary, but is a reasonable start to get the code working.)
The other issue is that a barrier has been left out before setting notdone[]
#pragma omp barrier
#pragma omp single
{ notdone[mv] = 0; }
This makes sure notdone is updated and updateohd() is started only after everyone has finished their findmymin() and updated md and mv.
Note that it's very easy to introduce errors into the original code you started with; the global variables used make it very difficult to reason about. John Burkardt has a nicer version of this same algorithm for teaching up on his website here, which is almost excessively well commented and easier to trace through.

How to derive the execution time growth rate(Big O)?

So I have the following code and I need to derive the execution time growth rate, however I have no idea where to start. My question is, how do I go about doing this? Any help would be appreciated.
Thank you.
// function to merge two sorted arrays
int merge (int smax, char sArray[], int tmax, char tArray[], char target[])
{
int m, s, t;
for (m = s = t = 0; s < smax && t < tmax; m++)
{
if (sArray[s] <= tArray[t])
{
target[m] = sArray[s];
s++;
}
else
{
target[m] = tArray[t];
t++;
}
}
int compCount = m;
for (; s < smax; m++)
{
target[m] = sArray[s++];
}
for (; t < tmax; m++)
{
target[m] = tArray[t++];
}
return compCount;
}
It's actually very simple.
Look, the first for loop increases either s or t at each iteration, so it's O(smax + tmax). The second loop is obviously O(smax), the third is O(tmax). Altogether we get O(smax + tmax).
(There exist some cleverer ways to prove, but I've intentionally left them out.)
All loops are bounded in number of iterations by (smax + tmax). So you could say the algorithm is O( max(smax,tmax) ) or O( smax +tmax).