How to add two arrays using openMP? - c++

I am trying to parallelize this: c[i]=a[i]+b[i]
By using C program I am getting:
Elapsed time = 1667417 nanoseconds
with OpenMP I get:
Elapsed time = 8673966 nanoseconds
I don't clearly understand why this is happening and what needs to be done to parallelize this code. I am assuming that it is very simple addition so probably parallelism is not getting exploited here but would like to know the correct reason and any other way by which I could effectively parallelize this addition. I also tried using dynamic, guided and various chunksizes but it gives more or less similar results.
#define N 100
int main (int argc, char *argv[])
{
int i;
float a[N], b[N], c[N];
uint64_t diff; /* Elapsed time */
struct timespec start, end;
/* Some initializations */
#pragma omp parallel for schedule(static,10) num_threads(4)
for (i=0; i < N; i++){
a[i] = b[i] = i * 1.0;
}
/*add two arrays*/
clock_gettime(CLOCK_MONOTONIC, &start); /* mark start time */
#pragma omp parallel for schedule(static) num_threads(4)
for (i=0; i<N; i++){
c[i] = a[i] + b[i];
printf("Thread number:%d,c[%d]= %f\n", omp_get_thread_num(),i,c[i]);
}
clock_gettime(CLOCK_MONOTONIC, &end); /* mark the end time */
diff = BILLION * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
printf("\nElapsed time = %llu nanoseconds\n", (long long unsigned int) diff);
}

Related

Why does gcc's implementation of openMP fail to parallelise a recursive function inside another recursive function

I am trying to parallelise these recursive functions with openMP tasks,
when I compile with gcc it runs only on 1 thread. When i compile it with clang it runs on multiple threads
The second function calls the first one which doesn't generate new tasks to stop wasting time.
gcc does work when there is only one function that calls itself.
Why is this?
Am I doing something wrong in the code?
Then why does it work with clang?
I am using gcc 9.3 on windows with Msys2.
The code was compiled with -O3 -fopenmp
//the program compiled by gcc only runs on one thread
#include<vector>
#include<omp.h>
#include<iostream>
#include<ctime>
using namespace std;
vector<int> vec;
thread_local double steps;
void excalibur(int current_node, int current_depth) {
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
excalibur(i, current_depth);
}
if(current_depth > 0){
int new_depth = current_depth - 1;
#pragma omp simd
for(int i = current_node;i <= vec[current_node];i++){
++steps;
excalibur(i + 1,new_depth);
}
}
}
void mario( int current_node, int current_depth) {
#pragma omp task firstprivate(current_node,current_depth)
{
if(current_depth > 0){
int new_depth = current_depth - 1;
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
}
}
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
excalibur(i, current_depth);
}
}
int main() {
double total = 0;
clock_t tim = clock();
omp_set_dynamic(0);
int nodes = 10;
int timesteps = 3;
omp_set_num_threads(4);
vec.assign( nodes, nodes - 2 );
#pragma omp parallel
{
steps = 0;
#pragma omp single
{
mario(nodes - 1, timesteps - 1);
}
#pragma omp atomic
total += steps;
}
double time_taken = (double)(tim) / CLOCKS_PER_SEC;
cout <<fixed<<total<<" steps, "<< fixed << time_taken << " seconds"<<endl;
return 0;
}
while this works with gcc
#include<vector>
#include<omp.h>
#include<iostream>
#include<ctime>
using namespace std;
vector<int> vec;
thread_local double steps;
void mario( int current_node, int current_depth) {
#pragma omp task firstprivate(current_node,current_depth)
{
if(current_depth > 0){
int new_depth = current_depth - 1;
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
}
}
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
mario(i, current_depth);
}
}
int main() {
double total = 0;
clock_t tim = clock();
omp_set_dynamic(0);
int nodes = 10;
int timesteps = 3;
omp_set_num_threads(4);
vec.assign( nodes, nodes - 2 );
#pragma omp parallel
{
steps = 0;
#pragma omp single
{
mario(nodes - 1, timesteps - 1);
}
#pragma omp atomic
total += steps;
}
double time_taken = (double)(tim) / CLOCKS_PER_SEC;
cout <<fixed<<total<<" steps, "<< fixed << time_taken << " seconds"<<endl;
return 0;
}
Your program doesn't run in parallel because there is simply nothing to run in parallel. Upon first entry in mario, current_node is 9 and vec is all 8s, so this loop in the first and only task never executes:
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
Hence, no recursive creation of new tasks. How and what runs in parallel when you compile it with Clang is well beyond me, since when I compile it with Clang 9, the executable behaves exactly the same as the one produced by GCC.
The second code runs in parallel because of the recursive call in the loop after the task region. But it also isn't a correct OpenMP program - the specification forbids nesting task regions inside a simd construct (see under Restrictions here):
The only OpenMP constructs that can be encountered during execution of a simd region are the atomic construct, the loop construct, the simd construct and the ordered construct with the simd clause.
None of the two compilers catches that problem when the nesting is in the dynamic and not in the lexical scope of the simd construct though.
Edit: I actually looked it a bit closer into it and I may have a suspicion about what might have caused your confusion. I guess you determine if your program works in parallel or not by looking at the CPU utilisation while it runs. This often leads to confusion. The Intel OpenMP runtime that Clang uses has a very aggressive waiting policy. When the parallel region in the main() function spawns a team of four threads, one of them goes executing mario() and the other three hit the implicit barrier at the end of the region. There they spin, waiting for new tasks to be eventually assigned to them. They never get one, but keep on spinning anyway, and that's what you see in the CPU utilisation. If you want to replicate the same with GCC, set OMP_WAIT_POLICY to ACTIVE and you'll see the CPU usage soar while the program runs. Still, if you profile the program's execution, you'll see that CPU time is spent inside your code in one thread only.

Why is serial execution taking less time than parallel? [duplicate]

This question already has answers here:
OpenMP time and clock() give two different results
(3 answers)
Closed 3 years ago.
I have to add two vectors and compare serial performance against parallel performance.
However, my parallel code seems to take longer to execute than the serial code.
Could you please suggest changes to make the parallel code faster?
#include <iostream>
#include <time.h>
#include "omp.h"
#define ull unsigned long long
using namespace std;
void parallelAddition (ull N, const double *A, const double *B, double *C)
{
ull i;
#pragma omp parallel for shared (A,B,C,N) private(i) schedule(static)
for (i = 0; i < N; ++i)
{
C[i] = A[i] + B[i];
}
}
int main(){
ull n = 100000000;
double* A = new double[n];
double* B = new double[n];
double* C = new double[n];
double time_spent = 0.0;
for(ull i = 0; i<n; i++)
{
A[i] = 1;
B[i] = 1;
}
//PARALLEL
clock_t begin = clock();
parallelAddition(n, &A[0], &B[0], &C[0]);
clock_t end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
cout<<"time elapsed in parallel : "<<time_spent<<endl;
//SERIAL
time_spent = 0.0;
for(ull i = 0; i<n; i++)
{
A[i] = 1;
B[i] = 1;
}
begin = clock();
for (ull i = 0; i < n; ++i)
{
C[i] = A[i] + B[i];
}
end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
cout<<"time elapsed in serial : "<<time_spent;
return 0;
}
These are results:
time elapsed in parallel : 0.824808
time elapsed in serial : 0.351246
I've read on another thread that there are factors like spawning of threads, allocation of resources. But I don't know what to do to get the expected result.
EDIT:
Thanks! #zulan and #Daniel Langr 's answers actually helped!
I used omp_get_wtime() instead of clock().
It happens to be that clock() measures cumulative time of all threads as against omp_get_wtime() which can be used to measure the time elasped from an arbitrary point to some other arbitrary point
This answer too answers this query pretty well: https://stackoverflow.com/a/10874371/4305675
Here's the fixed code:
void parallelAddition (ull N, const double *A, const double *B, double *C)
{
....
}
int main(){
....
//PARALLEL
double begin = omp_get_wtime();
parallelAddition(n, &A[0], &B[0], &C[0]);
double end = omp_get_wtime();
time_spent += (double)(end - begin);
cout<<"time elapsed in parallel : "<<time_spent<<endl;
....
//SERIAL
begin = omp_get_wtime();
for (ull i = 0; i < n; ++i)
{
C[i] = A[i] + B[i];
}
end = omp_get_wtime();
time_spent += (double)(end - begin);
cout<<"time elapsed in serial : "<<time_spent;
return 0;
}
RESULT AFTER CHANGES:
time elapsed in parallel : 0.204763
time elapsed in serial : 0.351711
There are multiple factors that influence your measurements:
Use omp_get_wtime() as #zulan suggested, otherwise, you may actually calculate combined CPU time, instead of wall time.
Threading has some overhead and typically does not pay off for short calculations. You may want to use higher n.
"Touch" data in C array before running parallelAddition. Otherwise, the memory pages are actually allocated from OS inside parallelAddition. Easy fix since C++11: double* C = new double[n]{};.
I tried your program for n being 1G and the last change reduced runtime of parallelAddition from 1.54 to 0.94 [s] for 2 threads. Serial version took 1.83 [s], therefore, the speedup with 2 threads was 1.95, which was pretty close to ideal.
Other considerations:
Generally, if you profile something, make sure that the program has some observable effect. Otherwise, a compiler may optimize a lot of code away. Your array addition has no observable effect.
Add some form of restrict keyword to the C parameter. Without it, a compiler might not be able to apply vectorization.
If you are on a multi-socket system, take care about affinity of threads and NUMA effects. On my dual-socket system, runtime of a parallel version for 2 threads took 0.94 [s] (as mentioned above) when restricting threads to a single NUMA node (numactl -N 0 -m 0). Without numactl, it took 1.35 [s], thus 1.44 times more.

OpenMP parallel for inside do-while

I am trying to refactor a OpenMP-based program and encountered a terrible scalability issue. The following (obviously not very meaningful) OpenMP program seems to reproduce the problem. Of course, the tiny sample code can be rewritten as a nested for-loop and using collapse(2) almost perfect scalability can be achieved. However, the original program I am working on does not allow to do that.
Therefore, I am looking for a fix, the keeps the do-while structure. From my understanding, OpenMP should be smart enough to keep the threads alive between the iterations and I expected good scalability. Why is this not the case?
int main() {
const int N = 6000;
const int MAX_ITER = 2000000;
double max = DBL_MIN;
int iter = 0;
do {
#pragma omp parallel for reduction(max:max) schedule(static)
for(int i = 1; i < N; ++i) {
max = MAX(max, 3.3*i);
}
++iter;
} while(iter < MAX_ITER);
printf("max=%f\n", max);
}
I have measured the following runtimes with Cray compiler Version 8.3.4.
OMP_NUM_THREADS=1 : 0m21.535s
OMP_NUM_THREADS=2 : 0m12.191s
OMP_NUM_THREADS=4 : 0m9.610s
OMP_NUM_THREADS=8 : 0m9.767s
OMP_NUM_THREADS=16: 0m13.571s
This seems to be similar to this question. Thanks in advance. Help is appreciated! :)
Your could go for something like this:
#include <stdio.h>
#include <float.h>
#include <omp.h>
#define MAX( a, b ) ((a)>(b))?(a):(b)
int main() {
const int N = 6000;
const int MAX_ITER = 2000000;
double max = DBL_MIN;
#pragma omp parallel reduction( max : max )
{
int iter = 0;
int nbth = omp_get_num_threads();
int tid = omp_get_thread_num();
int myMaxIter = MAX_ITER / nbth;
if ( tid < MAX_ITER % nbth ) myMaxIter++;
int chunk = N / nbth;
do {
#pragma omp for schedule(dynamic,chunk) nowait
for(int i = 1; i < N; ++i) {
max = MAX(max, 3.3*i);
}
++iter;
} while(iter < myMaxIter);
}
printf("max=%f\n", max);
}
I'm pretty sure scalability should improve notoriously.
NB: I had to come back to this a few times since I realised that the number of iterations for the outer loop (the do-while one) being potentially different for the different threads, it was of crucial importance that the scheduling of the omp for loop wasn't static, otherwise, there was a potential for deadlock at the last iteration.
I did a few tests and I think that the proposed solution is both safe and effective.

OpenMP measure time correct

I want to measure performance of parallel program implemented using C++(openMP)
Recommended way to measure time using this technology is
double start = omp_get_wtime();
// some code here
double end = omp_get_wtime();
printf_s("Time = %.16g", end - start);
But i get time near 0, despite i wait for program about 8 seconds.
All other methods to get execution time return 0.0
Also tried use these code examples
DWORD st = GetTickCount();
time_t time_start = time(NULL);
clock_t start = clock();
auto t1 = Clock::now();
time_t time_finish = time(NULL);
DWORD fn = GetTickCount();
clock_t finish = clock();
auto t2 = Clock::now();
All without success. Program spend running a lot of time. But results always zero. (In debug and release mode)
If i debug step-by-step results differs from zero.
Here is my parallel #pragma directive
#pragma omp parallel default(none) private(i) shared(nSum, nTheads, nMaxThreads, nStart, nEnd, data, modulo) {
#pragma omp master
nTheads = omp_get_num_threads();
nMaxThreads = omp_get_max_threads();
#pragma omp for
for (int i = nStart; i < nEnd; ++i)
{
#pragma omp atomic
nSum += (power(data[i], i) * i) % modulo;
}
}
Where is my error? Please help me. I spend a lot of time with this problem.

optimize computing time

I was experiencing with parallel scalar producting two vectors and measuring the time elapsed.
I was comparing sequential vs parallel scalar product:
seq: double scalar(int n, double x[], double y[])
for (int i=0; i<n; i++)
{
sum += x[i]*y[i];
}
parallel: double scalar_shm(int n, double x[], double y[])
#pragma omp parallel for private(i) shared(x,y) reduction(+:sum)
for (i=0; i<n; i++)
{
sum += x[i]*y[i];
}
I called these one after the other:
//sequential loop
for (int n=0; n<loops; n++)
{ scalar(vlength,x,y); }
//measure sequential time
t1 = omp_get_wtime() - tstart;
//parallel loop
for (int n=0; n<loops; n++)
{ scalar_shm(vlength,x,y); }
//measure parallel time
t2 = omp_get_wtime() - t1 - tstart;
//print the times elapsed
cout<< "total time (sequential): " <<t1 <<" sec" <<endl;
cout<< "total time (parallel ): " <<t2 <<" sec" <<endl;
Every cycle I filled up the vectors with random doubles, I removed that part, because I consider it irrelevant.
The output for this was:
total time (sequential): 15.3439 sec
total time (parallel ): 24.5755 sec
My question is why is the parallel one slower? What is it good for if it's slower? I expected it to be way faster, because I kind of thought that computations like this were the point of it.
note: I ran this on an Intel Core i7-740QM
You are creating and destroying a new parallel section code for each iteration. This operation is very slow. You could try to create the parallel section outside the internal loop:
//parallel loop
int sum;
#pragma omp parallel private(n) reduction(+:sum)
{
for (int n=0; n<loops; n++)
{
scalar_shm(vlength,x,y, sum);
}
}
Inside scalar_shm function, the OpenMP pragma would be:
#pragma omp for private(i)
for (i=0; i<n; i++)
{
sum += x[i]*y[i];
}