I am trying to figure it out with OpenMP. I need to parallelize depth-first traversal.
This is the algorithm:
void dfs(int v){
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
dfs(g[v][i]);
}
}
}
I try:
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <queue>
#include <sstream>
#include <omp.h>
#include <fstream>
#include <vector>
using namespace std;
vector<int> output;
vector<bool> visited;
vector < vector <int> >g;
int global = 0;
void dfs(int v)
{
printf(" potoki %i",omp_get_thread_num());
//cout<<endl;
visited[v] = true;
/*for(int i =0;i<visited.size();i++){
cout <<visited[i]<< " ";
}*/
//cout<<endl;
//global++;
output.push_back(v);
int i;
//printf(" potoki %i",omp_get_num_threads());
//cout<<endl;
for (i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task shared(visited)
{
#pragma omp critical
{
dfs(g[v][i]);
}
}
}
}
}
main(){
omp_set_num_threads(5);
int length = 1000;
int e = 4;
for (int i = 0; i < length; i++) {
visited.push_back(false);
}
int limit = (length / 2) - 1;
g.resize(length);
for (int x = 0; x < g.size(); x++) {
int p=0;
while(p<e){
int new_e = rand() % length ;
if(new_e!=x){
bool check=false;
for(int c=0;c<g[x].size();c++){
if(g[x][c]==new_e){
check=true;
}
}
if(check==false){
g[x].push_back(new_e);
p++;
}
}
}
}
ofstream fin("input.txt");
for (int i = 0; i < g.size(); i++)
{
for (int j = 0; j < g[i].size(); j++)
{
fin << g[i][j] << " ";
}
fin << endl;
}
fin.close();
/*for (int x = 0; x < g.size(); x++) {
for(int j=0;j<g[x].size();j++){
printf(" %i ", g[x][j]);
}
printf(" \n ");
}*/
double start;
double end;
start = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
dfs(0);
}
}
end = omp_get_wtime();
cout << endl;
printf("Work took %f seconds\n", end - start);
cout<<global;
ofstream fout("output.txt");
for(int i=0;i<output.size();i++){
fout<<output[i]<<" ";
}
fout.close();
}
Graph "g" is generated and written to the file input.txt. The result of the program is written to the file output.txt.
But this does not work on any number of threads and is much slower.
I tried to use taskwait but in that case, only one thread works.
A critical section protects a block of code so that no more than one thread can execute it at any given time. Having the recursive call to dfs() inside a critical section means that no two tasks could make that call simultaneously. Moreover, since dfs() is recursive, any top-level task will have to wait for the entire recursion to finish before it could exit the critical section and allow a task in another thread to execute.
You need to sychronise where it will not interfere with the recursive call and only protect the update to shared data that does not provide its own internal synchronisation. This is the original code:
void dfs(int v){
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
dfs(g[v][i]);
}
}
}
A naive but still parallel version would be:
void dfs(int v){
#pragma omp critical
{
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task
dfs(g[v][i]);
}
}
}
}
Here, the code leaves the critical section as soon as the tasks are created. The problem here is that the entire body of dfs() is one critical section, which means that even if there are 1000 recursive calls in parallel, they will execute one after another sequentially and not in parallel. It will even be slower than the sequential version because of the constant cache invalidation and the added OpenMP overhead.
One important note is that OpenMP critical sections, just as regular OpenMP locks, are not re-entrant, so a thread could easily deadlock itself due to encountering the same critical section in a recursive call from inside that same critical section, e.g., if a task gets executed immediately instead of being postponed. It is therefore better to implement a re-entrant critical section using OpenMP nested locks.
The reason for that code being slower than sequential is that it does nothing else except traversing the graph. If it was doing some additional work at each node, e.g., accessing data or computing node-local properties, then this work could be inserted between updating visited and the loop over the unvisited neighbours:
void dfs(int v){
#pragma omp critical
visited[v] = true;
// DO SOME WORK
#pragma omp critical
{
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task
dfs(g[v][i]);
}
}
}
}
The parts in the critical sections will still execute sequentially, but the processing represented by // DO SOME WORK will overlap in parallel.
There are tricks to speed things up by reducing the lock contention introduced by having one big lock / critical section. One could, e.g., use a set of OpenMP locks and map the index of visited onto those locks, e.g., using simple modulo arithmetic as described here. It is also possible to stop creating tasks at certain level of recursion and call a sequential version of dfs() instead.
void p_dfs(int v)
{
#pragma omp critical
visited[v] = true;
#pragma omp parallel for
for (int i = 0; i < graph[v].size(); ++i)
{
#pragma omp critical
if (!visited[graph[v][i]])
{
#pragma omp task
p_dfs(graph[v][i]);
}
}
}
OpenMP is good for data-parallel code, when the amount of work is known in advance. Doesn’t work well for graph algorithms like this one.
If the only thing you do is what’s in your code (push elements into a vector), parallelism is going to make it slower. Even if you have many gigabytes of data on your graph, the bottleneck is memory not compute, multiple CPU cores won’t help. Also, if all threads gonna push results to the same vector, you’ll need synchronization. Also, reading memory recently written by another CPU core is expensive on modern processors, even more so than a cache miss.
If you have some substantial CPU-bound work besides just copying integers, look for alternatives to OpenMP. On Windows, I usually use CreateThreadpoolWork and SubmitThreadpoolWork APIs. On iOS and OSX, see grand central dispatch. On Linux see cp_thread_pool_create(3) but unlike the other two I don’t have any hands-on experience with it, just found the docs.
Regardless on the thread pool implementation you gonna use, you’ll then be able to post work to the thread pool dynamically, as you’re traversing the graph. OpenMP also has a thread pool under the hood, but the API is not flexible enough for dynamic parallelism.
Related
I am trying to parallelise these recursive functions with openMP tasks,
when I compile with gcc it runs only on 1 thread. When i compile it with clang it runs on multiple threads
The second function calls the first one which doesn't generate new tasks to stop wasting time.
gcc does work when there is only one function that calls itself.
Why is this?
Am I doing something wrong in the code?
Then why does it work with clang?
I am using gcc 9.3 on windows with Msys2.
The code was compiled with -O3 -fopenmp
//the program compiled by gcc only runs on one thread
#include<vector>
#include<omp.h>
#include<iostream>
#include<ctime>
using namespace std;
vector<int> vec;
thread_local double steps;
void excalibur(int current_node, int current_depth) {
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
excalibur(i, current_depth);
}
if(current_depth > 0){
int new_depth = current_depth - 1;
#pragma omp simd
for(int i = current_node;i <= vec[current_node];i++){
++steps;
excalibur(i + 1,new_depth);
}
}
}
void mario( int current_node, int current_depth) {
#pragma omp task firstprivate(current_node,current_depth)
{
if(current_depth > 0){
int new_depth = current_depth - 1;
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
}
}
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
excalibur(i, current_depth);
}
}
int main() {
double total = 0;
clock_t tim = clock();
omp_set_dynamic(0);
int nodes = 10;
int timesteps = 3;
omp_set_num_threads(4);
vec.assign( nodes, nodes - 2 );
#pragma omp parallel
{
steps = 0;
#pragma omp single
{
mario(nodes - 1, timesteps - 1);
}
#pragma omp atomic
total += steps;
}
double time_taken = (double)(tim) / CLOCKS_PER_SEC;
cout <<fixed<<total<<" steps, "<< fixed << time_taken << " seconds"<<endl;
return 0;
}
while this works with gcc
#include<vector>
#include<omp.h>
#include<iostream>
#include<ctime>
using namespace std;
vector<int> vec;
thread_local double steps;
void mario( int current_node, int current_depth) {
#pragma omp task firstprivate(current_node,current_depth)
{
if(current_depth > 0){
int new_depth = current_depth - 1;
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
}
}
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
mario(i, current_depth);
}
}
int main() {
double total = 0;
clock_t tim = clock();
omp_set_dynamic(0);
int nodes = 10;
int timesteps = 3;
omp_set_num_threads(4);
vec.assign( nodes, nodes - 2 );
#pragma omp parallel
{
steps = 0;
#pragma omp single
{
mario(nodes - 1, timesteps - 1);
}
#pragma omp atomic
total += steps;
}
double time_taken = (double)(tim) / CLOCKS_PER_SEC;
cout <<fixed<<total<<" steps, "<< fixed << time_taken << " seconds"<<endl;
return 0;
}
Your program doesn't run in parallel because there is simply nothing to run in parallel. Upon first entry in mario, current_node is 9 and vec is all 8s, so this loop in the first and only task never executes:
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
Hence, no recursive creation of new tasks. How and what runs in parallel when you compile it with Clang is well beyond me, since when I compile it with Clang 9, the executable behaves exactly the same as the one produced by GCC.
The second code runs in parallel because of the recursive call in the loop after the task region. But it also isn't a correct OpenMP program - the specification forbids nesting task regions inside a simd construct (see under Restrictions here):
The only OpenMP constructs that can be encountered during execution of a simd region are the atomic construct, the loop construct, the simd construct and the ordered construct with the simd clause.
None of the two compilers catches that problem when the nesting is in the dynamic and not in the lexical scope of the simd construct though.
Edit: I actually looked it a bit closer into it and I may have a suspicion about what might have caused your confusion. I guess you determine if your program works in parallel or not by looking at the CPU utilisation while it runs. This often leads to confusion. The Intel OpenMP runtime that Clang uses has a very aggressive waiting policy. When the parallel region in the main() function spawns a team of four threads, one of them goes executing mario() and the other three hit the implicit barrier at the end of the region. There they spin, waiting for new tasks to be eventually assigned to them. They never get one, but keep on spinning anyway, and that's what you see in the CPU utilisation. If you want to replicate the same with GCC, set OMP_WAIT_POLICY to ACTIVE and you'll see the CPU usage soar while the program runs. Still, if you profile the program's execution, you'll see that CPU time is spent inside your code in one thread only.
My question pertains to nested parallelism and OpenMP. Let's start with the following single threaded code snippet:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Now let's say we want to make our calls to performAnotherTask in parallel utilizing OpenMP.
So we get the following code:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
My understanding is that the calls to performAnotherTask will be performed in parallel, and by default openMP will try and use all available threads on your machine (perhaps this assumption is incorrect).
Let's say we now also want to parallelize the calls to performTask such that we get the following code:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
How will this work? Will both the for loops still be multithreaded? Can we say anything on the number of threads each loop will use? Is there a way to enforce the inner for loop (within performTask) to only utilize a single thread while the outer for loop uses all available threads?
In your last example, the execution behavior depends on a few environmental settings.
First, OpenMP indeed does support such patterns, but by default disables parallel execution in a nested parallel region. To enabled it, you must set OMP_NESTED=true or call omp_set_nested(1) in your code. Then the support for nested parallel execution is enabled.
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Second, when OpenMP reaches the outer parallel region, it might grab all the available cores and assume that it can execute a thread on them, so you might want to reduce the number of threads for the outer level, so that some cores are available for in nested region. Say, if you have 32 cores, you could do this:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for num_threads(8)
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel for num_threads(4)
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
The outer parallel region will execute using 4 threads, each of which will execute the inner region with 8 threads. Note, each of the 4 outer threads will be one of the master threads of the four concurrently executing nested parallel regions. If you want to be more flexible, you can inject the number of threads to use for each level using the environment variable OMP_NUM_THREADS. If you set it to OMP_NUM_THREADS=4,8 you get the same behavior as the above the first code snippet that I have posted.
The problem with the coding pattern is that you need to be careful in balancing each level to not overload the system or get load imbalances between the nested parallel regions. An alternative solution is to use OpenMP tasks instead:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp taskloop
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel
#pragma omp single
#pragma omp taskloop
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Here each of the taskloop constructs will generate OpenMP task that are scheduled to execute on the threads that have been created by the single parallel region in the code. Caveate is that tasks are inherently dynamic in their behavior, so you might lose locality properties as you do not know where exactly the tasks will be executing in the system.
Hye
I'm trying to multithread the function below. I fail to get the counter to be properly shared among OpenMP threads, I tried atomic and int, atomic seem to not be working, neither do INT. Not sure, I'm lost, how can I solve this?
std::vector<myStruct> _myData(100);
int counter;
counter =0
int index;
#pragma omp parallel for private(index)
for (index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s); // processDataA(myStruct &data)
processDataB(s);
_myData[counter++] = s; // each thread should have unique int not going over 100 of initially allocated items in _myData
}
}
Edit. Update bad syntax/missing parts
If you cannot use OpenMP atomic capture, I would try:
std::vector<myStruct> _myData(100);
int counter = 0;
#pragma omp parallel for schedule(dynamic)
for (int index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s);
processDataB(s);
int temp;
#pragma omp critical
temp = counter++;
assert(temp < _myData.size());
_myData[temp] = s;
}
}
Or:
#pragma omp parallel for schedule(dynamic,c)
and experiment with chunk size c.
However, atomics would be likely more efficient than critical sections. There should be some form of atomics supported by your compiler.
Note that your solution is kind of fragile, since it works only if the condition inside the loop is evaluated to true less than 101x. That's why I added assertion into the code. Maybe a better solution:
std::vector<myStruct> _myData;
size_t size = 0;
#pragma omp parallel for reduction(+,size)
for (int index = 0; index < data.size(); ++index)
if (data[index].type == "xx") size++;
v.resize(size);
...
Then, you don't need to care about the vector size and also don't waste memory space.
I have a piece of code that i want to parallelize and the openmp program is much slower than the serial version, so what is wrong with my implementation?. This is the code of the program
#include <iostream>
#include <gsl/gsl_math.h>
#include "Chain.h"
using namespace std;
int main(){
int const N=1000;
int timeSteps=100;
double delta=0.0001;
double qq[N];
Chain ch(N);
ch.initCond();
for (int t=0; t<timeSteps; t++){
ch.changeQ(delta*t);
ch.calMag_i();
ch.calForce001();
}
ch.printSomething();
}
The Chain.h is
class Chain{
public:
int N;
double *q;
double *mx;
double *my;
double *force;
Chain(int const Np);
void initCond();
void changeQ(double delta);
void calMag_i();
void calForce001();
};
And the Chain.cpp is
Chain::Chain(int const Np){
this->N = Np;
this->q = new double[Np];
this->mx = new double[Np];
this->my = new double[Np];
this->force = new double[Np];
}
void Chain::initCond(){
for (int i=0; i<N; i++){
q[i] = 0.0;
force[i] = 0.0;
}
}
void Chain::changeQ(double delta){
int i=0;
#pragma omp parallel
{
#pragma omp for
for (int i=0; i<N; i++){
q[i] = q[i] + delta*i + 1.0*i/N;
}
}
}
void Chain::calMag_i(){
int i =0;
#pragma omp parallel
{
#pragma omp for
for (i=0; i<N; i++){
mx[i] = cos(q[i]);
my[i] = sin(q[i]);
}
}
}
void Chain::calForce001(){
int i;
int j;
double fij =0.0;
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp for private(j, fij)
for (i=0; i<N; i++){
force[i] = 0.0;
for (j=0; j<i; j++){
fij = my[i]*mx[j] - mx[i]*my[j];
#pragma omp critical
{
force[i] += fij;
force[j] += -fij;
}
}
}
}
double time = omp_get_wtime() - start_time;
cout <<"time = " << time <<endl;
}
So the methods changeQ() and calMag_i() are in fact faster than the serial code, but my problem is the calForce001(). The execution time are:
with openMP 3.939s
without openMP 0.217s
Now, clearly i'm doing something wrong or the code can't be parallelize. Please any help with be usefull.
Thanks in advance.
Carlos
Edit:
In order to clarify the question i add the functions omp_get_wtime() to calculate the execution time for the function calForce001() and the times for one execution are
with omp :0.0376656
without omp: 0.00196766
So with omp method is 20 times slower.
Otherwise, i'm also calculate the time for the calMag_i() method
with omp: 3.3845e-05
without omp: 9.9516e-05
for this method omp is 3 times faster.
I hope this confirm that the latency problem is in the calForce001() method.
There are three reasons why you don't benefit from any speedup.
you have #pragma omp parallel all over your code. What this pragma does, is start the "team of threads". At the end of the block, this team is disbanded. This is quite costly. Removing those and using #pragma omp parallel for instead of #pragma omp for will start the team upon first encounter and put it to sleep after each block. This made the application 4x faster for me.
you use #pragma omp critical. On most platforms, this will force the use of a mutex - which is heavily contended because all threads want to write to that variable at the same time. So, don't use a critical section here. You could use atomic updates, but in this case, that won't make much of a difference - see third item. Just removing the critical section improved the speed by another 3x.
Parallelism only makes sense when you have an actual workload. All of your code is too small to benefit from parallelism. There's simply too little workload to win back the time lost on starting/waking/destroying the threads. If your workload would be ten times this, some of the parallel for statements would make sense. But especially Chain::calForce001() will never be worth it if you have to do atomic updates.
With respect to programming style: you're programming in C++. Please use local scope variables wherever you can - in e.g. Chain::calForce001(), use a local double fij inside the inner loop. That saves you from having to write private clauses. Compilers are smart enough to optimize that. Correct scoping allows for better optimizations.
I have a parallel for in a C++ program that has to loop up to some number of iterations. Each iteration computes a possible solution for an algorithm, and I want to exit the loop once I find a valid one (it is ok if a few extra iterations are done). I know the number of iterations should be fixed from the beginning in the parallel for, but since I'm not increasing the number of iterations in the following code, is there any guarantee of that threads check the condition before proceeding with their current iteration?
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition)
max_its = t; // valid to make threads exit the for?
}
}
Modifying the loop counter works for most implementations of OpenMP worksharing constructs, but the program will no longer be conforming to OpenMP and there is no guarantee that the program works with other compilers.
Since the OP is OK with some extra iterations, OpenMP cancellation will be the way to go. OpenMP 4.0 introduced the "cancel" construct exactly for this purpose. It will request termination of the worksharing construct and teleport the threads to the end of it.
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition) {
#pragma omp cancel for
}
#pragma omp cancellation point for
}
}
Be aware that might there might be a price to pay in terms of performance, but you might want to accept this if the overall performance is better when aborting the loop.
In pre-4.0 implementations of OpenMP, the only OpenMP-compliant solution would be to have an if statement to approach the regular end of the loop as quickly as possible without execution the actual loop body:
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
if(!some condition) {
... loop body ...
}
}
}
Hope that helps!
Cheers,
-michael
You can't modify max_its as the standard says it must be a loop invariant expression.
What you can do, though, is using a boolean shared variable as a flag:
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel for schedule(dynamic, 1) shared(found)
for(int t = 0; t < max_its; ++t)
{
if( ! found ) {
...
}
if(some condition) {
#pragma omp atomic
found = true; // valid to make threads exit the for?
}
}
}
A logic of this kind may be also implemented with tasks instead of a work-sharing construct. A sketch of the code would be something like the following:
void algorithm(int t, bool& found) {
#pragma omp task shared(found)
{
if( !found ) {
// Do work
if ( /* conditionc*/ ) {
#pragma omp atomic
found = true
}
}
} // task
} // function
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel
{
#pragma omp single
{
for(int t = 0; t < max_its; ++t)
{
algorithm(t,found);
}
} // single
} // parallel
}
The idea is that a single thread creates max_its tasks. Each task will be assigned to a waiting thread. If some of the tasks find a valid solution, then all the others will be informed by the shared variable found.
If some_condition is a logical expression that is "always valid", then you could do:
for(int t = 0; t < max_its && !some_condition; ++t)
That way, it's very clear that !some_condition is required to continue the loop, and there is no need to read the rest of the code to find out that "if some_condition, loop ends"
Otherwise (for example if some_condition is the result of some calculation inside the loop and it's complicated to "move" the some_condition to the for-loop condition, then using break is clearly the right thing to do.