optimize computing time - c++

I was experiencing with parallel scalar producting two vectors and measuring the time elapsed.
I was comparing sequential vs parallel scalar product:
seq: double scalar(int n, double x[], double y[])
for (int i=0; i<n; i++)
{
sum += x[i]*y[i];
}
parallel: double scalar_shm(int n, double x[], double y[])
#pragma omp parallel for private(i) shared(x,y) reduction(+:sum)
for (i=0; i<n; i++)
{
sum += x[i]*y[i];
}
I called these one after the other:
//sequential loop
for (int n=0; n<loops; n++)
{ scalar(vlength,x,y); }
//measure sequential time
t1 = omp_get_wtime() - tstart;
//parallel loop
for (int n=0; n<loops; n++)
{ scalar_shm(vlength,x,y); }
//measure parallel time
t2 = omp_get_wtime() - t1 - tstart;
//print the times elapsed
cout<< "total time (sequential): " <<t1 <<" sec" <<endl;
cout<< "total time (parallel ): " <<t2 <<" sec" <<endl;
Every cycle I filled up the vectors with random doubles, I removed that part, because I consider it irrelevant.
The output for this was:
total time (sequential): 15.3439 sec
total time (parallel ): 24.5755 sec
My question is why is the parallel one slower? What is it good for if it's slower? I expected it to be way faster, because I kind of thought that computations like this were the point of it.
note: I ran this on an Intel Core i7-740QM

You are creating and destroying a new parallel section code for each iteration. This operation is very slow. You could try to create the parallel section outside the internal loop:
//parallel loop
int sum;
#pragma omp parallel private(n) reduction(+:sum)
{
for (int n=0; n<loops; n++)
{
scalar_shm(vlength,x,y, sum);
}
}
Inside scalar_shm function, the OpenMP pragma would be:
#pragma omp for private(i)
for (i=0; i<n; i++)
{
sum += x[i]*y[i];
}

Related

how to use parallelize two serial for loops such that the work of the two for loops are distributed over the thread

I have written the below code to parallelize two 'for' loops.
#include <iostream>
#include <omp.h>
#define SIZE 100
int main()
{
int arr[SIZE];
int sum = 0;
int i, tid, numt, prod;
double t1, t2;
for (i = 0; i < SIZE; i++)
arr[i] = 0;
t1 = omp_get_wtime();
#pragma omp parallel private(tid, prod)
{
tid = omp_get_thread_num();
numt = omp_get_num_threads();
std::cout << "Tid: " << tid << " Thread: " << numt << std::endl;
#pragma omp for reduction(+: sum)
for (i = 0; i < 50; i++) {
prod = arr[i]+1;
sum += prod;
}
#pragma omp for reduction(+: sum)
for (i = 50; i < SIZE; i++) {
prod = arr[i]+1;
sum += prod;
}
}
t2 = omp_get_wtime();
std::cout << "Time taken: " << (t2 - t1) << ", Parallel sum: " << sum << std::endl;
return 0;
}
In this case the execution of 1st 'for' loop is done in parallel by all the threads and the result is accumulated in sum variable. After the execution of the 1st 'for' loop is done, threads start executing the 2nd 'for' loop in parallel and the result is accumulated in sum variable. In this case clearly the execution of the 2nd 'for' loop waits for the execution of the 1st 'for' loop to get over.
I want to do the processing of the two 'for' loop simultaneously over threads. How can I do that? Is there any other way I can write this code more efficiently. Ignore the dummy work that I am doing inside the 'for' loop.
You can declare the loops nowait and move the reduction to the end of the parallel section. Something like this:
# pragma omp parallel private(tid, prod) reduction(+: sum)
{
# pragma omp for nowait
for (i = 0; i < 50; i++) {
prod = arr[i]+1;
sum += prod;
}
# pragma omp for nowait
for (i = 50; i < SIZE; i++) {
prod = arr[i]+1;
sum += prod;
}
}
If you use #pragma omp for nowait all threads are assigned to the first loop, the second loop will only start if at least one thread finished in the first loop. Unfortunately, there is no way to tell the omp for construct to use e.g. only half of the threads.
Fortunately, there is a solution to do so (i.e. to run the 2 loops parallel) by using tasks. The following code will use half of the threads to run the first loop, the other half to run the second one using the taskloop construct and num_threads clause to control the threads assigned for a loop. This will do exactly what you intended, but you have to test which solution is faster in your case.
#pragma omp parallel
#pragma omp single
{
int n=omp_get_num_threads();
#pragma omp taskloop num_tasks(n/2)
for (int i = 0; i < 50; i++) {
//do something
}
#pragma omp taskloop num_tasks(n/2)
for (int i = 50; i < SIZE; i++) {
//do something
}
}
UPDATE: The first paragraph is not entirely correct, by changing the chunk_size you have some control how many threads will be used in the first loop. It can be done by using e.g. schedule(linear, chunk_size) clause. So, I thought setting the chunk_size will do the trick:
#pragma omp parallel
{
int n=omp_get_num_threads();
#pragma omp single
printf("num_threads=%d\n",n);
#pragma omp for schedule(static,2) nowait
for (int i = 0; i < 4; i++) {
printf("thread %d running 1st loop\n", omp_get_thread_num());
}
#pragma omp for schedule(static,2)
for (int i = 4; i < SIZE; i++) {
printf("thread %d running 2nd loop\n", omp_get_thread_num());
}
}
BUT at first the result seems surprising:
num_threads=4
thread 0 running 1st loop
thread 0 running 1st loop
thread 0 running 2nd loop
thread 0 running 2nd loop
thread 1 running 1st loop
thread 1 running 1st loop
thread 1 running 2nd loop
thread 1 running 2nd loop
What is going on? Why threads 2 and 3 not used? OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration ranges in both parallel regions.
On the other hand result of using schedule(dynamic,2) clause was quite surprising - only one thread is used, CodeExplorer link is here.

Why does gcc's implementation of openMP fail to parallelise a recursive function inside another recursive function

I am trying to parallelise these recursive functions with openMP tasks,
when I compile with gcc it runs only on 1 thread. When i compile it with clang it runs on multiple threads
The second function calls the first one which doesn't generate new tasks to stop wasting time.
gcc does work when there is only one function that calls itself.
Why is this?
Am I doing something wrong in the code?
Then why does it work with clang?
I am using gcc 9.3 on windows with Msys2.
The code was compiled with -O3 -fopenmp
//the program compiled by gcc only runs on one thread
#include<vector>
#include<omp.h>
#include<iostream>
#include<ctime>
using namespace std;
vector<int> vec;
thread_local double steps;
void excalibur(int current_node, int current_depth) {
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
excalibur(i, current_depth);
}
if(current_depth > 0){
int new_depth = current_depth - 1;
#pragma omp simd
for(int i = current_node;i <= vec[current_node];i++){
++steps;
excalibur(i + 1,new_depth);
}
}
}
void mario( int current_node, int current_depth) {
#pragma omp task firstprivate(current_node,current_depth)
{
if(current_depth > 0){
int new_depth = current_depth - 1;
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
}
}
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
excalibur(i, current_depth);
}
}
int main() {
double total = 0;
clock_t tim = clock();
omp_set_dynamic(0);
int nodes = 10;
int timesteps = 3;
omp_set_num_threads(4);
vec.assign( nodes, nodes - 2 );
#pragma omp parallel
{
steps = 0;
#pragma omp single
{
mario(nodes - 1, timesteps - 1);
}
#pragma omp atomic
total += steps;
}
double time_taken = (double)(tim) / CLOCKS_PER_SEC;
cout <<fixed<<total<<" steps, "<< fixed << time_taken << " seconds"<<endl;
return 0;
}
while this works with gcc
#include<vector>
#include<omp.h>
#include<iostream>
#include<ctime>
using namespace std;
vector<int> vec;
thread_local double steps;
void mario( int current_node, int current_depth) {
#pragma omp task firstprivate(current_node,current_depth)
{
if(current_depth > 0){
int new_depth = current_depth - 1;
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
}
}
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
mario(i, current_depth);
}
}
int main() {
double total = 0;
clock_t tim = clock();
omp_set_dynamic(0);
int nodes = 10;
int timesteps = 3;
omp_set_num_threads(4);
vec.assign( nodes, nodes - 2 );
#pragma omp parallel
{
steps = 0;
#pragma omp single
{
mario(nodes - 1, timesteps - 1);
}
#pragma omp atomic
total += steps;
}
double time_taken = (double)(tim) / CLOCKS_PER_SEC;
cout <<fixed<<total<<" steps, "<< fixed << time_taken << " seconds"<<endl;
return 0;
}
Your program doesn't run in parallel because there is simply nothing to run in parallel. Upon first entry in mario, current_node is 9 and vec is all 8s, so this loop in the first and only task never executes:
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
Hence, no recursive creation of new tasks. How and what runs in parallel when you compile it with Clang is well beyond me, since when I compile it with Clang 9, the executable behaves exactly the same as the one produced by GCC.
The second code runs in parallel because of the recursive call in the loop after the task region. But it also isn't a correct OpenMP program - the specification forbids nesting task regions inside a simd construct (see under Restrictions here):
The only OpenMP constructs that can be encountered during execution of a simd region are the atomic construct, the loop construct, the simd construct and the ordered construct with the simd clause.
None of the two compilers catches that problem when the nesting is in the dynamic and not in the lexical scope of the simd construct though.
Edit: I actually looked it a bit closer into it and I may have a suspicion about what might have caused your confusion. I guess you determine if your program works in parallel or not by looking at the CPU utilisation while it runs. This often leads to confusion. The Intel OpenMP runtime that Clang uses has a very aggressive waiting policy. When the parallel region in the main() function spawns a team of four threads, one of them goes executing mario() and the other three hit the implicit barrier at the end of the region. There they spin, waiting for new tasks to be eventually assigned to them. They never get one, but keep on spinning anyway, and that's what you see in the CPU utilisation. If you want to replicate the same with GCC, set OMP_WAIT_POLICY to ACTIVE and you'll see the CPU usage soar while the program runs. Still, if you profile the program's execution, you'll see that CPU time is spent inside your code in one thread only.

Parallelizing two for loops with OpenMP in C++ does not give better performance

I have an issue with parallelizing two for loops with OpenMP in C++. I have a memberfunction CallFunction(i,j) which sets for every i and j independent member variables to a specific value and returns a weighted sum of this values. Because these functions are independent for different combinations of i and j, I want to parallelize this process. I tried it in the following way:
double optimal_value = 0;
#pragma omp parallel for reduction(+:optimal_value)
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
if(i == j) continue;
optimal_value += CallFunction(i,j);
}
}
Above code does not have a significant effect on my runtime. I achieve almost the same runtime with and without "#pragma omp parallel for". Would it be better to write the nested loop as one loop and parallelize it? I have to idea how to make it work. Do I need further commands or settings except for activated openmp?
My system is running with a dual core cpu.
Would you please help me how I have to do it right?
Many thanks in advance!
Here is the parallelization of two loops
double optimal_value = 0;
double begin = omp_get_wtime();
#pragma omp parallel for reduction(+:optimal_value)
for (int i = 0; i < n; i++)
{
num_tr = omp_get_num_threads();
double optimal_value_in = 0.0;
#pragma omp parallel for reduction(+:optimal_value_in)
for (int j = 0; j < n; j++)
{
if((i == j)) continue;
optimal_value_in += CallFunction(i,j);
}
optimal_value += optimal_value_in;
}
double end = omp_get_wtime();
double elapsed_secs = double(end - begin);
cout<<"############# "<<"Using #Threads "<<num_tr<<endl;
cout<<"############# "<<optimal_value<<" Time For Parallel Execution :: "<<elapsed_secs<<endl;
The thing here is (also mentioned above in comments by others) ... I am not sure if you will see some speedup with just n=25 with the body of CallFunction as
double CallFunction(int i, int j){
return i*j;
}
with n=250000 and with 8 threads, I got a speed up of 4.43 so it will strongly depend on what is done in CallFunction.

Parallel program using openMP

I am trying to calculate the integral of 4/(1+x^2) from 0 to 1 in c++ with multi-threading using openMP.
I took a serial program (which is correct) and changed it.
My idea is:
Assume that X is the number of threads.
Divide the area beneath the function into X parts, first from 0 to 1/X, 1/X to 2/X...
Each thread will calculate it's area, and I will sum it all up.
This is how I implemented it:
`//N.o. of threads to do the task
cout<<"Enter num of threads"<<endl;
int num_threads;
cin>>num_threads;
int i; double x,pi,sum=0.0;
step=1.0/(double)num_steps;
int steps_for_thread=num_steps/num_threads;
cout<<"Steps for thread : "<<steps_for_thread<<endl;
//Split to threads
omp_set_num_threads(num_threads);
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
thread_id++;
if (thread_id == 1)
{
double sum1=0.0;
double x1;
for(i=0;i<num_steps/num_threads;i++)
{
x1=(i+0.5)*step;
sum1 = sum1+4.0/(1.0+x1*x1);
}
sum+=sum1;
}
else
{
double sum2=0.0;
double x2;
for(i=num_steps/thread_id;i<num_steps/(num_threads-thread_id+1);i++)
{
x2=(i+0.5)*step;
sum2 = sum2+4.0/(1.0+x2*x2);
}
sum+=sum2;
}
} '
Explanation:
The i'th thread will calculate the area between i/n to (i+1)/n and add it to the sum.
The problem is that not only that the output is wrong, but also each time I run the program I get different output.
Any help will be welcomed
Thanks
You're making this problem much harder than it needs to be. One of OpenMP's goals is to not have to change your serial code. You usually only need to add some pragma statements. So you should write the serial method first.
#include <stdio.h>
double pi(int n) {
int i;
double dx, sum, x;
dx = 1.0/n;
#pragma omp parallel for reduction(+:sum) private(x)
for(i=0; i<n; i++) {
x = i*dx;
sum += 1.0/(1+x*x);
}
sum *= 4.0/n;
return sum;
}
int main(void) {
printf("%f\n",pi(100000000));
}
Output: 3.141593
Notice that in the function pi the only difference between the serial code and the parallel version is the statement
#pragma omp parallel for reduction(+:sum) private(x)
You should also not normally worry about setting the number of threads.

OpenMP/C++: number of elements in for-loop

I am doing some very simple tests with OpenMP in C++ and I encounter a problem that is probably silly, but I can't find out what's wrong. In the following MWE:
#include <iostream>
#include <ctime>
#include <vector>
#include <omp.h>
int main()
{
int nthreads=1, threadid=0;
clock_t tstart, tend;
const int nx=10, ny=10, nz=10;
int i, j, k;
std::vector<std::vector<std::vector<long long int> > > arr_par;
arr_par.resize(nx);
for (i=0; i<nx; i++) {
arr_par[i].resize(ny);
for (j = 0; j<ny; j++) {
arr_par[i][j].resize(nz);
}
}
tstart = clock();
#pragma omp parallel default(shared) private(threadid)
{
#ifdef _OPENMP
nthreads = omp_get_num_threads();
threadid = omp_get_thread_num();
#endif
#pragma omp master
std::cout<<"OpenMP execution with "<<nthreads<<" threads"<<std::endl;
#pragma omp end master
#pragma omp barrier
#pragma omp critical
{
std::cout<<"Thread id: "<<threadid<<std::endl;
}
#pragma omp for
for (i=0; i<nx; i++) {
for (j=0; j<ny; j++) {
for (k=0; k<nz; k++) {
arr_par[i][j][k] = i*j + k;
}
}
}
}
tend = clock();
std::cout<<"Elapsed time: "<<(tend - tstart)/double(CLOCKS_PER_SEC)<<" s"<<std::endl;
return 0;
}
if nx, ny and nz are equal to 10, the code is running smoothly. If I increase these numbers to 20, I get a segfault. It runs without problem sequentially or with OMP_NUM_THREADS=1, whatever the number of elements.
I compiled the damn thing with
g++ -std=c++0x -fopenmp -gstabs+ -O0 test.cpp -o test
using GCC 4.6.3.
Any thought would be appreciated!
You have a data race in your loop counters:
#pragma omp for
for (i=0; i<nx; i++) {
for (j=0; j<ny; j++) { // <--- data race
for (k=0; k<nz; k++) { // <--- data race
arr_par[i][j][k] = i*j + k;
}
}
}
Since neither j nor k are given the private data-sharing class, their values might exceed the corresponding limits when several threads try to increase them at once, resulting in out-of-bound access to arr_par. The chance to have several threads increase j or k at the same time increases with the number of iterations.
The best way to treat those cases is to simply declare the loop variables inside the loop operator itself:
#pragma omp for
for (int i=0; i<nx; i++) {
for (int j=0; j<ny; j++) {
for (int k=0; k<nz; k++) {
arr_par[i][j][k] = i*j + k;
}
}
}
The other way is to add the private(j,k) clause to the head of the parallel region:
#pragma omp parallel default(shared) private(threadid) private(j,k)
It is not strictly necessary to make i private in your case since the loop variable of parallel loops are implicitly made private. Still, if i is used somewhere else in the code, it might make sense to make it private to prevent other data races.
Also, don't use clock() to measure the time for parallel applications since on most Unix OSes it returns the total CPU time for all threads. Use omp_get_wtime() instead.