I am currently trying to find out does a clique of size k exist in an undirected graph using OpenMP to make the code run faster.
This is the code which I am trying to paralleize:
bool Graf::doesCliqueSizeKExistParallel(int k) {
if (k > n) return false;
clique_parallel.resize(k);
bool foundClique = false;
#pragma omp parallel
#pragma omp single
for (int i = 0; i < n; i++) {
if (degree[i] >= k - 1) {
clique_parallel[0] = i;
#pragma omp task
doesCliqueSizeKExistParallelRecursive(i + 1, 1, k, foundClique);
}
}
return foundClique;
}
void Graf::doesCliqueSizeKExistParallelRecursive(int node, int currentLength, int k, bool & foundClique) {
for (int j(node); j < n; j++) {
if (degree[j] >= k - 1) {
clique_parallel[currentLength - 1] = j;
bool isClique=true;
for(int i(0);i<currentLength;i++){
for(int l(i+1);l<currentLength;l++){
if(!neighbors[clique_parallel[i]][clique_parallel[l]]){isClique=false; break;}
}
if(!isClique) break;
}
if (isClique) {
if (currentLength < k)
doesCliqueSizeKExistParallelRecursive(j + 1, currentLength + 1, k, foundClique);
else {
foundClique= true;
return;
}
}
}
}
}
The problem here, which I suppose could be the case is that the variables degree, neighbors, clique_parallel are all global and when some thread is trying to write in one of these variables, another one comes and writes in that variable instead of the right thread. The only solution which I tried was to pass, these three variables as a copy to the function so that each thread has its own variable, and it didn't work. I am trying not to use #pragma omp taskwait because that would just be the sequential algorithm and there wouldn't be any speed up. Currently I am lost and don't how to fix this issue (if it is an issue) and don't know what else to try or how to avoid sharing these variables between threads.
Here is the class Graf:
class Graf {
int n; // number of nodes
vector<vector<int>> neighbors; //matrix adjacency
vector<int> degree; //number of nodes each node is adjacent to
vector<int> clique_parallel;
bool directGraph;
void doesCliqueSizeKExistParallelRecursive(int node, int currentLength, int k, bool & foundClique);
public:
Graf(int n, bool directGraph = false);
void addEdge(int i, int j);
void printGraf();
bool doesCliqueSizeKExistParallel(int k);
};
So my question is, here in this code the problem that the threads are fighting over the global variables, or could it be somethin else? Any help is useful, and if you have any question regarding the code, I'll answer.
Your observation that omp task wait turns this into the sequential algorithm is sort of correct. It's in fact worse: it turns your Depth-First search algorithm into effectively a Breadth-First, which would traverse the whole space.
Ok. First of all use task group, which has an implicit task wait at the end of a for loop that generates tasks.
Next, let your tasks return either false, or the values of a clique found.
Now for the big trick: once one task has found a solution, call omp cancel task group, which makes you leave the for loop and you can keep the values you found! This cancel will kill all other tasks (and their recursively spawned tasks) at that level. And now the magic of recursion kicks in and all groups get cancelled higher up the tree.
I once figured this out for another recursive search problem, but I'm sure you can translate it to your problem: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-examples.html#Treetraversal
Related
I am trying to figure it out with OpenMP. I need to parallelize depth-first traversal.
This is the algorithm:
void dfs(int v){
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
dfs(g[v][i]);
}
}
}
I try:
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <queue>
#include <sstream>
#include <omp.h>
#include <fstream>
#include <vector>
using namespace std;
vector<int> output;
vector<bool> visited;
vector < vector <int> >g;
int global = 0;
void dfs(int v)
{
printf(" potoki %i",omp_get_thread_num());
//cout<<endl;
visited[v] = true;
/*for(int i =0;i<visited.size();i++){
cout <<visited[i]<< " ";
}*/
//cout<<endl;
//global++;
output.push_back(v);
int i;
//printf(" potoki %i",omp_get_num_threads());
//cout<<endl;
for (i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task shared(visited)
{
#pragma omp critical
{
dfs(g[v][i]);
}
}
}
}
}
main(){
omp_set_num_threads(5);
int length = 1000;
int e = 4;
for (int i = 0; i < length; i++) {
visited.push_back(false);
}
int limit = (length / 2) - 1;
g.resize(length);
for (int x = 0; x < g.size(); x++) {
int p=0;
while(p<e){
int new_e = rand() % length ;
if(new_e!=x){
bool check=false;
for(int c=0;c<g[x].size();c++){
if(g[x][c]==new_e){
check=true;
}
}
if(check==false){
g[x].push_back(new_e);
p++;
}
}
}
}
ofstream fin("input.txt");
for (int i = 0; i < g.size(); i++)
{
for (int j = 0; j < g[i].size(); j++)
{
fin << g[i][j] << " ";
}
fin << endl;
}
fin.close();
/*for (int x = 0; x < g.size(); x++) {
for(int j=0;j<g[x].size();j++){
printf(" %i ", g[x][j]);
}
printf(" \n ");
}*/
double start;
double end;
start = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
dfs(0);
}
}
end = omp_get_wtime();
cout << endl;
printf("Work took %f seconds\n", end - start);
cout<<global;
ofstream fout("output.txt");
for(int i=0;i<output.size();i++){
fout<<output[i]<<" ";
}
fout.close();
}
Graph "g" is generated and written to the file input.txt. The result of the program is written to the file output.txt.
But this does not work on any number of threads and is much slower.
I tried to use taskwait but in that case, only one thread works.
A critical section protects a block of code so that no more than one thread can execute it at any given time. Having the recursive call to dfs() inside a critical section means that no two tasks could make that call simultaneously. Moreover, since dfs() is recursive, any top-level task will have to wait for the entire recursion to finish before it could exit the critical section and allow a task in another thread to execute.
You need to sychronise where it will not interfere with the recursive call and only protect the update to shared data that does not provide its own internal synchronisation. This is the original code:
void dfs(int v){
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
dfs(g[v][i]);
}
}
}
A naive but still parallel version would be:
void dfs(int v){
#pragma omp critical
{
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task
dfs(g[v][i]);
}
}
}
}
Here, the code leaves the critical section as soon as the tasks are created. The problem here is that the entire body of dfs() is one critical section, which means that even if there are 1000 recursive calls in parallel, they will execute one after another sequentially and not in parallel. It will even be slower than the sequential version because of the constant cache invalidation and the added OpenMP overhead.
One important note is that OpenMP critical sections, just as regular OpenMP locks, are not re-entrant, so a thread could easily deadlock itself due to encountering the same critical section in a recursive call from inside that same critical section, e.g., if a task gets executed immediately instead of being postponed. It is therefore better to implement a re-entrant critical section using OpenMP nested locks.
The reason for that code being slower than sequential is that it does nothing else except traversing the graph. If it was doing some additional work at each node, e.g., accessing data or computing node-local properties, then this work could be inserted between updating visited and the loop over the unvisited neighbours:
void dfs(int v){
#pragma omp critical
visited[v] = true;
// DO SOME WORK
#pragma omp critical
{
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task
dfs(g[v][i]);
}
}
}
}
The parts in the critical sections will still execute sequentially, but the processing represented by // DO SOME WORK will overlap in parallel.
There are tricks to speed things up by reducing the lock contention introduced by having one big lock / critical section. One could, e.g., use a set of OpenMP locks and map the index of visited onto those locks, e.g., using simple modulo arithmetic as described here. It is also possible to stop creating tasks at certain level of recursion and call a sequential version of dfs() instead.
void p_dfs(int v)
{
#pragma omp critical
visited[v] = true;
#pragma omp parallel for
for (int i = 0; i < graph[v].size(); ++i)
{
#pragma omp critical
if (!visited[graph[v][i]])
{
#pragma omp task
p_dfs(graph[v][i]);
}
}
}
OpenMP is good for data-parallel code, when the amount of work is known in advance. Doesn’t work well for graph algorithms like this one.
If the only thing you do is what’s in your code (push elements into a vector), parallelism is going to make it slower. Even if you have many gigabytes of data on your graph, the bottleneck is memory not compute, multiple CPU cores won’t help. Also, if all threads gonna push results to the same vector, you’ll need synchronization. Also, reading memory recently written by another CPU core is expensive on modern processors, even more so than a cache miss.
If you have some substantial CPU-bound work besides just copying integers, look for alternatives to OpenMP. On Windows, I usually use CreateThreadpoolWork and SubmitThreadpoolWork APIs. On iOS and OSX, see grand central dispatch. On Linux see cp_thread_pool_create(3) but unlike the other two I don’t have any hands-on experience with it, just found the docs.
Regardless on the thread pool implementation you gonna use, you’ll then be able to post work to the thread pool dynamically, as you’re traversing the graph. OpenMP also has a thread pool under the hood, but the API is not flexible enough for dynamic parallelism.
First of all, I think it is important to say that I am new to multithreading and know very little about it. I was trying to write some programs in C++ using threads and ran into a problem (question) that I will try to explain to you now:
I wanted to use several threads to fill an array, here is my code:
static const int num_threads = 5;
int A[50], n;
//------------------------------------------------------------
void ThreadFunc(int tid)
{
for (int q = 0; q < 5; q++)
{
A[n] = tid;
n++;
}
}
//------------------------------------------------------------
int main()
{
thread t[num_threads];
n = 0;
for (int i = 0; i < num_threads; i++)
{
t[i] = thread(ThreadFunc, i);
}
for (int i = 0; i < num_threads; i++)
{
t[i].join();
}
for (int i = 0; i < n; i++)
cout << A[i] << endl;
return 0;
}
As a result of this program I get:
0
0
0
0
0
1
1
1
1
1
2
2
2
2
2
and so on.
As I understand, the second thread starts writing elements to an array only when the first thread finishes writing all elements to an array.
The question is why threads dont't work concurrently? I mean why don't I get something like that:
0
1
2
0
3
1
4
and so on.
Is there any way to solve this problem?
Thank you in advance.
Since n is accessed from more than one thread, those accesses need to be synchronized so that changes made in one thread don't conflict with changes made in another. There are (at least) two ways to do this.
First, you can make n an atomic variable. Just change its definition, and do the increment where the value is used:
std::atomic<int> n;
...
A[n++] = tid;
Or you can wrap all the accesses inside a critical section:
std::mutex mtx;
int next_n() {
std::unique_lock<std::mutex> lock(mtx);
return n++;
}
And in each thread, instead of directly incrementing n, call that function:
A[next_n()] = tid;
This is much slower than the atomic access, so not appropriate here. In more complex situations it will be the right solution.
The worker function is so short, i.e., finishes executing so quickly, that it's possible that each thread is completing before the next one even starts. Also, you may need to link with a thread library to get real threads, e.g., -lpthread. Even with that, the results you're getting are purely by chance and could appear in any order.
There are two corrections you need to make for your program to be properly synchronized. Change:
int n;
// ...
A[n] = tid; n++;
to
std::atomic_int n;
// ...
A[n++] = tid;
Often it's preferable to avoid synchronization issues altogether and split the workload across threads. Since the work done per iteration is the same here, it's as easy as dividing the work evenly:
void ThreadFunc(int tid, int first, int last)
{
for (int i = first; i < last; i++)
A[i] = tid;
}
Inside main, modify the thread create loop:
for (int first = 0, i = 0; i < num_threads; i++) {
// possible num_threads does not evenly divide ASIZE.
int last = (i != num_threads-1) ? std::size(A)/num_threads*(i+1) : std::size(A);
t[i] = thread(ThreadFunc, i, first, last);
first = last;
}
Of course by doing this, even though the array may be written out of order, the values will be stored to the same locations every time.
As stated above, I have been trying to craft a simple parallel loop, but it has inconsistent behaviour for different number of threads. Here is my code (testable!)
#include <iostream>
#include <stdio.h>
#include <vector>
#include <utility>
#include <string>
using namespace std;
int row = 5, col = 5;
int token = 1;
int ar[20][20] = {0};
int main (void)
{
unsigned short j_end = 1, k = 1;
unsigned short mask;
for (unsigned short i=1; i<=(row + col -1); i++)
{
#pragma omp parallel default(none) shared(ar) firstprivate(k, row, col, i, j_end, token) private(mask)
{
if(i > row) {
mask = row;
}
else {
mask = i;
}
#pragma omp for schedule(static, 2)
for(unsigned short j=k; j<=j_end; j++)
{
ar[mask][j] = token;
if(mask > 1) {
#pragma omp critical
{
mask--;
}
}
} //inner loop - barrier
}//end parallel
token++;
if(j_end == col) {
k++;
j_end = col;
}
else {
j_end++;
}
} // outer loop
// print the array
for (int i = 0; i < row + 2; i++)
{
for (int j = 0; j < col + 2; j++)
{
cout << ar[i][j] << " ";
}
cout << endl;
}
return 0;
} // main
I believe most of the code is self explanatory, but to sum it up, I have 2 loops, the inner one iterates through the inverse-diagonals of the square matrix ar[row][col], (row & col variables can be used to change the total size of ar).
Visual aid: desired output for 5x5 ar (serial version)
(Note: This does happen when OMP_NUM_THREADS=1 too.)
But when OMP_NUM_THREADS=2 or OMP_NUM_THREADS=4 the output looks like this:
The serial (and for 1 thread) code is consistent so I don't think the implementation is problematic. Also, given the output of the serial code, there shouldn't be any dependencies in the inner loop.
I have also tried:
Vectorizing
threadpivate counters for the inner loop
But nothing seems to work so far...
Is there a fault in my approach, or did I miss something API-wise that led to this behavior?
Thanks for your time in advance.
Analyzing the algorithm
As you noted, the algorithm itself has no dependencies in the inner or outer loop. An easy way to show this is to move the parallelism "up" to the outer loop so that you can iterate across all the different inverse diagonals simultaneously.
Right now, the main problem with the algorithm you've written is that it's presented as a serial algorithm in both the inner and outer loop. If you're going to parallelize across the inner loop, then mask needs to be handled specially. If you're going to parallelize across the outer loop, then j_end, token, and k need to be handled specially. By "handled specially," I mean they need to be computed independently of the other threads. If you try adding critical regions into your code, you will kill all performance benefits of adding OpenMP in the first place.
Fixing the problem
In the following code, I parallelize over the outer loop. i corresponds to what you call token. That is, it is both the value to be added to the inverse diagonal and the assumed starting length of this diagonal. Note that for this to parallelize correctly, length, startRow, and startCol must be calculated as a function of i independently from other iterations.
Finally note that once the algorithm is re-written this way, the actual OpenMP pragma is incredibly simple. Every variable is assumed to be shared by default because they're all read-only. The only exception is ar in which we are careful never to overwrite another thread's value of the array. All variables that must be private are only created inside the parallel loop and thus are thread-private by definition. Lastly, I've changed the schedule to dynamic to showcase that this algorithm exhibits load-imbalance. In your example if you had 9 threads (the worst case scenario), you can see how the thread assigned to i=5 has to do much more work than the thread assigned to i=1 or i=9.
Example code
#include <iostream>
#include <omp.h>
int row = 5;
int col = 5;
#define MAXSIZE 20
int ar[MAXSIZE][MAXSIZE] = {0};
int main(void)
{
// What an easy pragma!
#pragma omp parallel for default(shared) schedule(dynamic)
for (unsigned short i = 1; i < (row + col); i++)
{
// Calculates the length of the current diagonal to consider
// INDEPENDENTLY from other i iterations!
unsigned short length = i;
if (i > row) {
length -= (i-row);
}
if (i > col) {
length -= (i-col);
}
// Calculates the starting coordinate to start at
// INDEPENDENTLY from other i iterations!
unsigned short startRow = i;
unsigned short startCol = 1;
if (startRow > row) {
startCol += (startRow-row);
startRow = row;
}
for(unsigned short offset = 0; offset < length; offset++) {
ar[startRow-offset][startCol+offset] = i;
}
} // outer loop
// print the array
for (int i = 0; i <= row; i++)
{
for (int j = 0; j <= col; j++)
{
std::cout << ar[i][j] << " ";
}
std::cout << std::endl;
}
return 0;
} // main
Final points
I want to leave with a few last points.
If you are only adding parallelism on a small array (row,col < 1e6), you will most likely not get any benefits from OpenMP. On a small array, the algorithm itself will take microseconds, while setting up the threads could take milliseconds... slowing down execution time considerably from your original serial code!
While I did rewrite this algorithm and change around variable names, I tried to keep the spirit of your implementation as best as I could. Thus, the inverse-diagonal scanning and nested loop pattern remains.
There is a better way to parallelize this algorithm to avoid load balance, though. If instead you give each thread a row and have it instead iterate its token value (i.e. row/thread 2 places the numbers 2->6), then each thread will work on exactly the same amount of numbers and you can change the pragma to schedule(static).
As I mentioned in the comments above, don't use firstprivate when you mean shared. A good rule of thumb is that all read-only variables should be shared.
It is erroneous to assume that getting correct output when running parallel code on 1 thread implies the implementation is correct. In fact, barring disastrous use of OpenMP, you are incredibly unlikely to get the wrong output with only 1 thread. Testing with multiple threads reveals that your previous implementation was not correct.
Hope this helps.
EDIT: The output I get is the same as yours for a 5x5 matrix.
#include <stdio.h>
#include<array>
#include<vector>
#include <omp.h>
std::vector<int> pNum;
std::array<int, 4> arr;
int pGen(int);
int main()
{
pNum.push_back(2);
pNum.push_back(3);
pGen(10);
for (int i = 0; i < pNum.size(); i++)
{
printf("%d \n", pNum[i]);
}
printf("top say: %d", pNum.size());
getchar();
}
int pGen(int ChunkSize)
{
//
if (pNum.size() == 50) return 0;
int i, k, n, id;
int state = 0;
//
#pragma omp parallel for schedule(dynamic) private(k, n, id) num_threads(4)
for (i = 1; i < pNum.back() * pNum.back(); i++)
{
//
id = omp_get_thread_num();
n = pNum.back() + i * 2;
for (k = 1; k < pNum.size(); k++)
{
//
if (n % pNum[k] == 0) break;
if (n / pNum[k] <= pNum[k])
{
//
#pragma omp critical
{
//
if (state == 0)
{
//
state = 1; pNum.push_back(n); printf("id: %d; number: %d \n", id, n); pGen(ChunkSize); break;
}
}
}
}
if (state == 1) break;
}
}
This is my code above. I am trying to find first 50 prime number with openMP scheduling for each dynamic, static and guided. I started with dynamic. And somehow I realized I have to use recursive function since I cant use do - while in parallel structures.
When I debug the code above, console opens up and close down immediately, I can only see "id:0, number:5" and an "error: blablabla(something)"
The strange thing is I never get to getchar() and output the vector I use to store prime numbers. I think this is about recursion function. Any other theories?
edit: I happened to catch the error:
this is the error
I don't know if this is significant for your algorithm, but since you add numbers in your pNum vector during the main loop, pNum.back() will change over iterations. Therefore, the boundaries of the parallelised loop will change during the loop itself: for (i = 1; i < pNum.back() * pNum.back(); i++)
This isn't supported by OpenMP. Loops can only be parallelised with OpenMP if they are in Canonical Loop Form. The link explains it in details, but it boils down for you that the boundaries should be known and fixed prior to entering the loop:
lb and b: Loop invariant expressions of a type compatible with the type of var
Therefore, your code has an Undefined Behaviour. It may or may not compile, may or may not run and can give whatever result if any (or just reformat your hard drive).
If it is not important that pNum.back() evolves over iterations, then you can simply evaluate it prior to the loop and use this value as upper bound in the for statement. But if it is important, then you'll have to find another method to parallelise your loop.
Finally, a side note: this algorithm uses nested parallelism, but you didn't explicitly allow it so, as the nested parallelism is disabled by default, only the outermost call to pGen() will generate OpenMP threads.
I implemented the following code that use the data points in "dat" to calculate the distance matrix between each point and all the other points "dist". Then I use this distance matrix to find the K closest points to each point in the data "smallest", then use this to find the sum of the K nearest neighbor.
The following algorithm is a parallel algorithm using OpenMP and it's working very fine. I just need suggestions to make it run faster. Any suggestion is highly appreciated.
vector<vector<double> > dist(dat.size(), vector<double>(dat.size()));
size_t p,j;
ptrdiff_t i;
double* sumKnn = new double[dat.size()];
vector<vector<int > > smallest(dat.size(), vector<int>(k));
#pragma omp parallel for private(p,j,i) default(shared)
for(p=0;p<dat.size();++p)
{
int mycont=0;
for (j = 0; j < dat.size(); ++j)
{
double ecl = 0.0;
for (i = 0; i < c; ++i)
{
ecl += (dat[p][i] - dat[j][i]) * (dat[p][i] - dat[j][i]);
}
ecl = sqrt(ecl);
dist[p][j] = ecl;
//dist[j][p] = ecl;
int index=0;
if(mycont<k && j!=p)
{
smallest[p][mycont]=j;
mycont++;
}
else if(j!=p)
{
double max=0.0;
int index=0;
for(int i=0;i<smallest[p].size();i++)
{
if(max < dist[p][smallest[p][i]])
{
index=i;
max=dist[p][smallest[p][i]];
}
}
if(max>dist[p][j])
{
smallest[p].erase(smallest[p].begin()+index);
smallest[p].push_back(j);
}
}
}
double sum=0.0;
for(int r=0;r<k;r++)
sum+= dist[p][smallest[p][r]];
sumKnn[p]=sum;
}
This is more of a comment than an answer, but the comment box is too small, ...
One of the useful aspects of OpenMP is that you can parallelise a serial program in steps. So your first step should be to write a serial code which solves your problem. When you've done that you could post again and ask for help on parallelising it.
To parallelise your program, find the outermost loop statement and think how distributing the loop iterations across threads will affect the calculations. I suspect that you'll want to create a shared vector of close points as the loops go round, then sort it at the end on one thread only. Or perhaps not.