OpenMP implementation increasingly slow with thread count increase

OpenMP implementation increasingly slow with thread count increase - c++

I have been trying to learn to use OpenMP. However my code seemed to be running more quickly in series that parallel.
Indeed the more threads used, the slower the computation time.
To illustrate this I ran an experiment. I am trying to do the following operation:
long int C[num], D[num];
for (i=0; i<num; i++) C[i] = i;
for (i=0; i<num; i++){
for (j=0; j<N; j++) {
D[i] = pm(C[i]);
}
}
where the function pm is simply
int pm(int val) {
val++;
val--;
return val;
}
I implemented the inner loop in parallel and compared the run times as a function of the number of iterations on the inner loop (N) and the number of threads used. The code for the experiment is below.
#include <stdio.h>
#include <iostream>
#include <time.h>
#include "omp.h"
#include <fstream>
#include <cstdlib>
#include <cmath>
static long num = 1000;
using namespace std;
int pm(int val) {
val++;
val--;
return val;
}
int main() {
int i, j, k, l;
int iter = 8;
int iterT = 4;
long inum[iter];
for (i=0; i<iter; i++) inum[i] = pow(10, i);
double serial[iter][iterT], parallel[iter][iterT];
ofstream outdata;
outdata.open("output.dat");
if (!outdata) {
std::cerr << "Could not open file." << std::endl;
exit(1);
}
"""Experiment Start"""
for (l=1; l<iterT+1; l++) {
for (k=0; k<iter; k++) {
clock_t start = clock();
long int A[num], B[num];
omp_set_num_threads(l);
for (i=0; i<num; i++) A[i] = i;
for (i=0; i<num; i++){
#pragma omp parallel for schedule(static)
for (j=0; j<inum[k]; j++) {
B[i] = pm(A[i]);
}
}
clock_t finish = clock();
parallel[k][l-1] = (double) (finish - start) /\
CLOCKS_PER_SEC * 1000.0;
start = clock();
long int C[num], D[num];
for (i=0; i<num; i++) C[i] = i;
for (i=0; i<num; i++){
for (j=0; j<inum[k]; j++) {
D[i] = pm(C[i]);
}
}
finish = clock();
serial[k][l-1] = (double) (finish - start) /\
CLOCKS_PER_SEC * 1000.0;
}
}
"""Experiment End"""
for (j=0; j<iterT; j++) {
for (i=0; i<iter; i++) {
outdata << inum[i] << " " << j + 1 << " " << serial[i][j]\
<< " " << parallel[i][j]<< std::endl;
}
}
outdata.close();
return 0;
}
The link below is a plot of log(T) against log(N) for each thread count.
A plot of the run times for varying number of threads and magnitude of computational task.
(I just noticed that the legend labels for serial and parallel are the wrong way around).
As you can see using more than one thread increases the time greatly. Adding more threads increases the time taken linearly as a function of number of threads.
Can anyone tell me whats going on?
Thanks!

Freakish above was correct about the pm() function doing nothing, and the compiler was getting confused.
It also turns out that the rand() function does not play well withing OpenMP for loops.
Adding the function sqrt(i) (i being the loop index) I achieved the expected speedup to my code.

Related

Parallel execution taking more time than serial

I am basically writing code to count if a pair sum is even(among all pairs from 1 to 100000). I wrote a code using pthreads and without pthreads. But the code with pthreads is taking more time than the serial one. Here is my serial code
#include<bits/stdc++.h>
using namespace std;
int main()
{
long long sum = 0, count = 0, n = 100000;
auto start = chrono::high_resolution_clock::now();
for(int i = 1; i <= n; i++)
for(int j = i-1; j >= 0; j--)
{
sum = i + j;
if(sum%2 == 0)
count++;
}
cout<<"count is "<<count<<endl;
auto end = chrono::high_resolution_clock::now();
double time_taken = chrono::duration_cast<chrono::nanoseconds>(end - start).count();
time_taken *= 1e-9;
cout << "Time taken by program is : " << fixed << time_taken << setprecision(9)<<" secs"<<endl;
return 0;
}
and here is my parallel code
#include<bits/stdc++.h>
using namespace std;
#define MAX_THREAD 3
long long cnt[5] = {0};
long long n = 100000;
int work_per_thread;
int start[] = {1, 60001, 83001, 100001};
void *count_array(void* arg)
{
int t = *((int*)arg);
long long sum = 0;
for(int i = start[t]; i < start[t+1]; i++)
for(int j = i-1; j >=0; j--)
{
sum = i + j;
if(sum%2 == 0)
cnt[t]++;
}
cout<<"thread"<<t<<" finished work "<<cnt[t]<<endl;
return NULL;
}
int main()
{
pthread_t threads[MAX_THREAD];
int arr[] = {0,1,2};
long long total_count = 0;
work_per_thread = n/MAX_THREAD;
auto start = chrono::high_resolution_clock::now();
for(int i = 0; i < MAX_THREAD; i++)
pthread_create(&threads[i], NULL, count_array, &arr[i]);
for(int i = 0; i < MAX_THREAD; i++)
pthread_join(threads[i], NULL);
for(int i = 0; i < MAX_THREAD; i++)
total_count += cnt[i];
cout << "count is " << total_count << endl;
auto end = chrono::high_resolution_clock::now();
double time_taken = chrono::duration_cast<chrono::nanoseconds>(end - start).count();
time_taken *= 1e-9;
cout << "Time taken by program is : " << fixed << time_taken << setprecision(9)<<" secs"<<endl;
return 0;
}
In the parallel code I am creating three threads and 1st thread will be doing its computation from 1 to 60000, 2nd thread from 60001 to 83000 and so on. I have chosen these numbers so that each thread gets to do approximately similar number of computations. The parallel execution takes 10.3 secs whereas serial one takes 7.7 secs. I have 6 cores and 2 threads per core. I also used htop command to check if the required number of threads are running or not and it seems to be working fine. I don't understand where the problem is.

The all cores in the threaded version compete for cnt[].
Use a local counter inside the loop and copy the result into cnt[t] after the loop is ready.

Why am I getting a Seg Fault when creating threads?

I am confused as to why I am getting a segmentation fault when creating and firing off threads here. It happens in the t[j] = thread(getMax, A); line and I am very confused as to why this is happening. threadMax[] is the max of each thread. getMax() returns the maximum value of an array.
#include <iostream>
#include <stdlib.h>
#include <sys/time.h>
#include <thread>
#define size 10
#define numThreads 10
using namespace std;
int threadMax[numThreads] = {0};
int num =0;
void getMax(double *A){
num += 1;
double max = A[0];
double min = A[0];
for (int i =0; i<size; i++){
if(A[i] > max){
max = A[i];
}
}
threadMax[num] = max;
}
int main(){
int max =0;
double S,E;
double *A = new double[size];
srand(time(NULL));
thread t[numThreads];
//Assign random values to array
for(int i = 0; i<size; i++){
A[i] = (double(rand()%100));
}
//create Threads
for(int j =0; j <numThreads; j++){
cout << A[j] << " " << j << "\n";
t[j] = thread(getMax, A);
}
//join threads
for(int i =0; i< numThreads; i++){
t[i].join();
}
//Find Max from all threads
for(int i =0; i < numThreads; i++){
if(threadMax[i] > max){
max = threadMax[i];
}
}
cout <<max;
delete [] A;
return 0;
}

The behavior of this code is undefined:
void getMax(double *A){
num += 1;
double max = A[0];
double min = A[0];
for (int i =0; i<size; i++){
if(A[i] > max){
max = A[i];
}
}
threadMax[num] = max;
}
The num += 1 can allow multiple threads to attempt to modify num at the same time. Worse, when num is read in the threadMax[num] = max;, threads may see values of num modified by other threads while they were running.
You need to assign each thread a number in some safe way.
Here are three ways it can fail:
Two threads do num += 1; at exactly the same time and as a result, num only increments once.
Every thread does num += 1; before any thread does threadMax[num] = max;. All threads overwrite the same entry in the array. (Which, actually, is out of bounds!)
The code crashes because its behavior is undefined.

As others have stated, your num variable is not protected from race conditions inside of getMax(), which can lead to it being corrupted, thus causing getMax() to access the threadMax[] array out of bounds.
You can avoid that by simply getting rid of that num variable altogether and pass the array index as an input parameter to std::thread instead.
Try something more like this:
#include <iostream>
#include <vector>
#include <array>
#include <thread>
#include <algorithm>
#include <cstdlib>
#include <ctime>
using namespace std;
const size_t size = 10;
const size_t numThreads = 10;
double threadMax[numThreads] = {};
void getMax(int idx, double *A){
threadMax[idx] = *max_element(A, A + size);
}
int main(){
srand(time(nullptr));
vector<double> A(size);
array<thread, numThreads> t;
//Assign random values to array
generate_n(A.begin(), size, [](){ return double(rand() % 100); });
/* or:
for(double &d : A){
d = double(rand() % 100);
}
*/
//create Threads
for(int j = 0; j < numThreads; ++j){
cout << A[j] << " " << j << "\n";
t[j] = thread(getMax, j, A.data());
}
//join threads
for(thread &thd : t){
thd.join();
}
//Find Max from all threads
double max = *max_element(threadMax.begin(), threadMax.end());
cout << max;
return 0;
}

Openmp - for's inside for's

I'm trying to parallelize a "for" with openmp.
However the result, parallel code vs nonparallel, is different. I believe that it is related with the definition of the sum variable outside of the loop, but I don't know how to solve the problem.
What I want is to parallelize the first "for" loop.
Edit: 1
Here is the simplest example I could find.
//g++ -o test2 test2.cpp -fopenmp
//
//
#include <cmath>
#include <iostream>
using namespace std;
double f(double i, double j)
{
return i + j;
}
int main()
{
const int size = 256;
double sum = 0;
//will use openmp
#pragma omp parallel for
for(int i = 0; i < size; i = i + 1)
{
for(int j = 0; j < size; j=j+1)
{
if(i != j)
{
sum = sum + f(i,j);
}
}
}
cout << "sum = " << sum << endl;
//not using openmp
sum = 0;
for(int i = 0; i < size; i = i + 1)
{
for(int j = 0; j < size; j=j+1)
{
if(i != j)
{
sum = sum + f(i,j);
}
}
}
cout << "sum = " << sum << endl;
}

Your problem is the access to sum being performed by several threads. I.e. when the first thread reaches
sum=sum+f(i,j);
it grabs sum, does the calculations, writes the result to sum. When another thread in the meantime arrived at that line, it grabs the old value of sum and dumps its result, overwriting the first threads results.
A solution would be to set
double increment=f(i,j);
#pragma omp critical
sum+=increment;
Also note that your code's results are not predictable and change when you run it several times.

Thank you for your answer, it finally works.
The following code is a working code with Christoph Solution.
//g++ -o test2 test2.cpp -fopenmp
#include <cmath>
#include <iostream>
using namespace std;
double f(double i, double j)
{
return i + j;
}
int main()
{
const int size = 256;
double sum = 0;
//will use openmp
#pragma omp parallel for
for(int i = 0; i < size; i = i + 1)
{
for(int j = 0; j < size; j=j+1)
{
if(i != j)
{
double increment = f(i,j);
#pragma omp critical
sum = sum + increment;
}
}
}
cout << "sum = " << sum << endl;
//not using openmp
sum = 0;
for(int i = 0; i < size; i = i + 1)
{
for(int j = 0; j < size; j=j+1)
{
if(i != j)
{
sum = sum + f(i,j);
}
}
}
cout << "sum = " << sum << endl;
}

OpenMP for matrix multiplication

I am new to OpenMP and am trying desperately to learn. I have tried to write an example code in C++ in visual studio 2012 to implement matrix multiplication. I was hoping someone with OpenMP experience could take a look at this code and help me to obtain the ultimate speed / parallelization for this:
#include <iostream>
#include <stdlib.h>
#include <omp.h>
#include <random>
using namespace std;
#define NUM_THREADS 4
// Program Variables
double** A;
double** B;
double** C;
double t_Start;
double t_Stop;
int Am;
int An;
int Bm;
int Bn;
// Program Functions
void Get_Matrix();
void Mat_Mult_Serial();
void Mat_Mult_Parallel();
void Delete_Matrix();
int main()
{
printf("Matrix Multiplication Program\n\n");
cout << "Enter Size of Matrix A: ";
cin >> Am >> An;
cout << "Enter Size of Matrix B: ";
cin >> Bm >> Bn;
Get_Matrix();
Mat_Mult_Serial();
Mat_Mult_Parallel();
system("pause");
return 0;
}
void Get_Matrix()
{
A = new double*[Am];
B = new double*[Bm];
C = new double*[Am];
for ( int i=0; i<Am; i++ ){A[i] = new double[An];}
for ( int i=0; i<Bm; i++ ){B[i] = new double[Bn];}
for ( int i=0; i<Am; i++ ){C[i] = new double[Bn]; }
for ( int i=0; i<Am; i++ )
{
for ( int j=0; j<An; j++ )
{
A[i][j]= rand() % 10 + 1;
}
}
for ( int i=0; i<Bm; i++ )
{
for ( int j=0; j<Bn; j++ )
{
B[i][j]= rand() % 10 + 1;
}
}
printf("Matrix Create Complete.\n");
}
void Mat_Mult_Serial()
{
t_Start = omp_get_wtime();
for ( int i=0; i<Am; i++ )
{
for ( int j=0; j<Bn; j++ )
{
double temp = 0;
for ( int k=0; k<An; k++ )
{
temp += A[i][k]*B[k][j];
}
}
}
t_Stop = omp_get_wtime() - t_Start;
cout << "Serial Multiplication Time: " << t_Stop << " seconds" << endl;
}
void Mat_Mult_Parallel()
{
int i,j,k;
t_Start = omp_get_wtime();
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private(i,j,k) schedule(dynamic)
for ( i=0; i<Am; i++ )
{
for ( j=0; j<Bn; j++ )
{
//double temp = 0;
for ( k=0; k<An; k++ )
{
C[i][j] += A[i][k]*B[k][j];
}
}
}
t_Stop = omp_get_wtime() - t_Start;
cout << "Parallel Multiplication Time: " << t_Stop << " seconds." << endl;
}
void Delete_Matrix()
{
for ( int i=0; i<Am; i++ ){ delete [] A[i]; }
for ( int i=0; i<Bm; i++ ){ delete [] B[i]; }
for ( int i=0; i<Am; i++ ){ delete [] C[i]; }
delete [] A;
delete [] B;
delete [] B;
}

My examples are based on a matrix class I created for parallel teaching. If you are interested feel free to contact me.
There are several ways to speedup your matrix multiplication :
Storage
Use a one dimension array in row major order for accessing the element in a faster way.
You can access to A(i,j) with A[i * An + j]
Use loop invariant optimization
for (int i = 0; i < m; i ++)
for (int j = 0; j < p; j ++)
{
Scalar sigma = C(i, j);
for (int k = 0; k < n; k ++)
sigma += (*this)(i, k) * B(k, j);
C(i, j) = sigma;
}
This prevents to recompute C(i,j) several times in the most inner loop.
Change loop order "for k <-> for i"
for (int i = 0; i < m; i ++)
for (int k = 0; k < n; k ++)
{
Aik = (*this)(i, k);
for (int j = 0; j < p; j ++)
C(i, j) += Aik * B(k, j);
}
This allows to play with spatial data locality
Use loop blocking/tiling
for(int ii = 0; ii < m; ii += block_size)
for(int jj = 0; jj < p; jj += block_size)
for(int kk = 0; kk < n; kk += block_size)
#pragma omp parallel for // I think this is the best place for this case
for(int i = ii; i < ii + block_size; i ++)
for(int k = kk; k < kk + block_size; k ++)
{
Scalar Aik = (*this)(i, k);
for(int j = jj; j < jj + block_size; j ++)
C(i, j) += Aik * B(k, j);
}
This can use better temporal data locality. The optimal block_size depends on your architecture and matrix size.
Then parallelize !
Generally, the #pragma omp parallel for should be done a the most outter loop. Maybe using two parallel loop at the two first outter loops can give better results. It depends then on the architecture you use, the matrix size... You have to test !
Since the matrix multiplication has a static workload I would use a static schedule.
Moar optimization !
You can do loop nest optimization.
You can vectorize your code.
You can take look at how BLAS do it.

I am very new to OpenMP and this code is very instructive. However I found an error in the serial version that gives it an unfair speed advantage over the parallel version.
Instead of writing C[i][j] += A[i][k]*B[k][j]; as you do in the parallel version, you have written temp += A[i][k]*B[k][j]; in the serial version. This is much faster (but doesn't help you compute the C matrix). So you're not comparing apples to apples, which makes the parallel code seem slower by comparison. When I fixed this line and ran it on my laptop (which allows 2 threads), the parallel version was almost twice as fast. Not bad!

Viterbi algorithm with OpenMP

I am trying to implement the Viterbi algorithm with the help of OpenMP. So far, my test shows that the execution time of the parallel program is approximately 4 times the execution time of the sequential program. Here is my code:
#include <omp.h>
#include <stdio.h>
#include <time.h>
#define K 39 // num states
#define T 1500 // num obs sequence
int states[K];
double transition[K][K];
double emission[K][K];
double init_prob[K];
int observation[T];
using namespace std;
void generateValues()
{
srand(time(NULL));
for(int i=0; i<T; i++)
{
observation[i] = rand() % K;
}
for(int i=0; i<K; i++)
{
states[i] = i;
init_prob[i] = (double)rand() / (double)RAND_MAX;
for(int j=0;j<K;j++)
{
transition[i][j] = (double)rand() / (double)RAND_MAX;
srand(time(NULL));
emission[i][j] = (double)rand() / (double)RAND_MAX;
}
}
}
int* viterbi(int *S, double *initp, int *Y, double A[][K], double B[][K])
{
double T1[K][T];
int T2[K][T];
#pragma omp parallel for
for(int i=0; i<K; ++i)
{
T1[i][0] = initp[i];
T2[i][0] = 0;
}
for(int i=1; i<T; ++i)
{
double max, temp;
int argmax;
#pragma omp parallel for private (max, temp, argmax)
for(int j=0; j<K; ++j)
{
max = -1;
#pragma omp parallel for
for(int k=0; k<K; ++k)
{
temp = T1[k][i-1] * A[k][j] * B[k][Y[i-1]];
if(temp > max)
{
max = temp;
argmax = k;
}
}
T1[j][i] = max;
T2[j][i] = argmax;
}
}
int Z[T];
int X[T];
double max = -1, temp;
#pragma omp parallel for
for(int k=0; k<K; ++k)
{
temp = T1[k][T-1];
if(temp > max)
{
max = temp;
Z[T-1] = k;
}
}
X[T-1] = S[Z[T-1]];
for(int i=T-1; i>0; --i)
{
Z[i-1] = T2[Z[i]][i];
X[i-1] = S[Z[i-1]];
}
return X;
}
int* viterbiNoOmp(int *S, double *initp, int *Y, double A[][K], double B[][K]) // the same as before, minus the #pragma omp
int main()
{
clock_t tStart;
int *path;
generateValues();
double sumOmp = 0;
for(int i=0;i<6;i++)
{
double start = omp_get_wtime();
path = viterbi(states, init_prob, observation, transition, emission);
double end = omp_get_wtime();
sumOmp += end - start;
}
double sumNoOmp = 0;
for(int i=0;i<6;i++)
{
tStart = clock();
path = viterbiNoOmp(states, init_prob, observation, transition, emission);
sumNoOmp += ((double)(clock() - tStart)/CLOCKS_PER_SEC);
}
for (int i=0;i<T;i++)
{
printf("%d, ", path[i]);
}
printf("\n\ntime With Omp: %f\ntime without Omp: %f", sumOmp/6, sumNoOmp/6);
return 0;
}
What am I doing wrong?

First of all, you used for your first measurement the omp_get_wtime() function, and for your second, you used clock().
Use omp_get_wtime() for both and you'll see a little improvement
Secondly instead of using sumNoOmp += ((double)(clock() - tStart)/CLOCKS_PER_SEC);
just use sumNoOmp = ((double)(clock() - tStart)/CLOCKS_PER_SEC);
Now let's move on to your code:
trying to parallel nested loops is a little tricky
try using #pragma omp parallel for only for the outer loop and watch for the result

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

OpenMP implementation increasingly slow with thread count increase - c++

Freakish above was correct about the pm() function doing nothing, and the compiler was getting confused. It also turns out that the rand() function does not play well withing OpenMP for loops. Adding the function sqrt(i) (i being the loop index) I achieved the expected speedup to my code.

Related

Parallel execution taking more time than serial

Why am I getting a Seg Fault when creating threads?

Openmp - for's inside for's

OpenMP for matrix multiplication

Viterbi algorithm with OpenMP

Categories

Resources