Same program execution time on different thread numbers openMP

Same program execution time on different thread numbers openMP - c++

I have a c++ program that multiplies 2 matrixes. I have to use openMP. This is what I have so far. https://pastebin.com/wn0AXFBG
#include <stdlib.h>
#include <time.h>
#include <omp.h>
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
int n = 1;
int Matrix1[1000][100];
int Matrix2[100][2];
int Matrix3[1000][2];
int sum = 0;
ofstream fr("rez.txt");
double t1 = omp_get_wtime();
omp_set_num_threads(n);
#pragma omp parallel for collapse(2) num_threads(n)
for ( int i = 0; i < 10; i++) {
for ( int j = 0; j < 10; j++) {
Matrix1[i][j] = i * j;
}
}
#pragma omp simd
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 2; j++) {
int t = rand() % 100;
if (t < 50) Matrix2[i][j] = -1;
if (t >= 50) Matrix2[i][j] = 1;
}
}
#pragma omp parallel for collapse(3) num_threads(n)
for (int ci = 0; ci < 1000; ci++) {
for (int cj = 0; cj < 2; cj++) {
for (int i = 0; i < 100; i++) {
if(i==0) Matrix3[ci][cj] = 0;
Matrix3[ci][cj] += Matrix1[ci][i] * Matrix2[i][cj];
}
}
}
double t2 = omp_get_wtime();
double time = t2 - t1;
fr << time;
return 0;
}
The problem is that I get the same execution times whether I use 1 thread or 8. Pictures of timing added.
I have to show that the time is reduced near to 8 times. I am using the Intel C++ compiler with openMP turned on. Please advise.

First of all, I think, there is a small bug in your program, when you are initializing entries in matrix 1 as Matrix1[i][j] = i * j. The i and j are not going upto 1000 and 100 respectively.
Also, I am not sure if your computer actually supports 8 logical cores or not,
If there are no 8 logical cores then your computer will create 8 threads and one logical core will context switch more than one threads and thus will bring the performance down and thus, high execution time. So be sure about how many actual logical cores are available and specify less than or equal to that amount of cores to num_threads()
Now coming to the question, collapse clause fuses all the loops into one and tries to dynamically schedule that fused loop among p processors. I am not sure about how it deals with the race condition handling, but if you try to parallelize innermost loop without fusing all 3 loops, there is race condition as each thread will try to concurrently update Matrix3[ci][cj] and some kind of synchronization mechanism maybe atomic or reduction clause are needed to ensure correctness.
I am pretty sure that you can parallelize outer loop without any kind of race condition and also get a speedup near the number of processors you have employed (Again, as far as number of processors are less than or equal to number of logical cores) and I would suggest changing segment of your code as below.
// You can also use this function to set number of threads:
// omp_set_num_threads(n);
#pragma omp parallel for num_threads(n)
for (int ci = 0; ci < 1000; ci++) {
for (int cj = 0; cj < 2; cj++) {
for (int i = 0; i < 100; i++) {
if(i==0) Matrix3[ci][cj] = 0;
Matrix3[ci][cj] += Matrix1[ci][i] * Matrix2[i][cj];
}
}
}

Related

Parallelization of dependent nested loops

I aim to compute a simple N-body program on C++ and I am using OpenMP to speed things up with the computations. At some point, I have nested loops that look like that:
int N;
double* S = new double[N];
double* Weight = new double[N];
double* Coordinate = new double[N];
...
#pragma omp parallel for
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < i; ++j)
{
double K = Coordinate[i] - Coordinate[j];
S[i] += K*Weight[j];
S[j] -= K*Weight[i];
}
}
The issue here is that I do not obtain exactly the same result when removing the #pragma ... I am guessing it has to do with the fact that the second loop is dependent on the integer i, but I don't see how to get past that issue

The problem is that there is a data race during updating S[i] and S[j]. Different threads may read from/write to the same element of the array at the same time, therefore it should be an atomic operation (you have to add #pragma omp atomic) to avoid data race and to ensure memory consistency:
for (int j = 0; j < i; ++j)
{
double K = Coordinate[i] - Coordinate[j];
#pragma omp atomic
S[i] += K*Weight[j];
#pragma omp atomic
S[j] -= K*Weight[i];
}

Compute the minimum value of every row in a matrix using loops parallel with openmp C++

I want to compute the minimum value of every row in a matrix in parallel using openmp c++ as follows:
// matrix Distf (float) of size n by n is declared before.
vector<float> minRows;
#pragma omp parallel for
for (i=0; i < n; ++i){
float minValue = Distf[i][0];
#pragma omp parallel for reduction(min : minValue)
for (j=1; j < n; ++j){
if (Distf[i][j] < minValue){
minValue = Distf[i][j];
}
}
minRows.push_back(minValue);
}
So far the compiler does not raise any error but I wonder if this would give the correct answer as expect? Thanks

What we talked about in the comments as an answer: Since I had to write some boilerplate anyway, I used ints as the type and avoided thinking about float problems at all:
#include <vector>
#include <iostream>
using namespace std;
int main(){
constexpr size_t n = 3;
// dummy Distf (int) declared in lieu of matrix Distf
int Distf[n][n] = {{1,2,3},{6,5,4},{7,8,8}};
//could be an array<int,n> instead
vector<int> minRows(n);
#pragma omp parallel for
for (size_t i = 0; i < n; ++i){
int minValue = Distf[i][0];
// Alain Merigot argues this is a performance drag
//#pragma omp parallel for reduction(min : minValue)
for (size_t j = 1; j < n; ++j){
if (Distf[i][j] < minValue){
minValue = Distf[i][j];
}
}
//minRows.push_back(minValue) is a race condition!
minRows[i] = minValue;
}
int k = 0;
for(auto el: minRows){
cout << "row " << k++ << ": " << el << '\n';
}
cout << '\n';
}
The inner loop normally doesn't need to be parallelized. I don't know how many cores you can use, but unless you're on a massively parallel system, think GPU-level of parallelism, the outer loop should either utilize all available cores already, or the problem just isn't big enough to matter. Starting more threads in either situation is a pessimization.

Parallelizing two for loops with OpenMP in C++ does not give better performance

I have an issue with parallelizing two for loops with OpenMP in C++. I have a memberfunction CallFunction(i,j) which sets for every i and j independent member variables to a specific value and returns a weighted sum of this values. Because these functions are independent for different combinations of i and j, I want to parallelize this process. I tried it in the following way:
double optimal_value = 0;
#pragma omp parallel for reduction(+:optimal_value)
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
if(i == j) continue;
optimal_value += CallFunction(i,j);
}
}
Above code does not have a significant effect on my runtime. I achieve almost the same runtime with and without "#pragma omp parallel for". Would it be better to write the nested loop as one loop and parallelize it? I have to idea how to make it work. Do I need further commands or settings except for activated openmp?
My system is running with a dual core cpu.
Would you please help me how I have to do it right?
Many thanks in advance!

Here is the parallelization of two loops
double optimal_value = 0;
double begin = omp_get_wtime();
#pragma omp parallel for reduction(+:optimal_value)
for (int i = 0; i < n; i++)
{
num_tr = omp_get_num_threads();
double optimal_value_in = 0.0;
#pragma omp parallel for reduction(+:optimal_value_in)
for (int j = 0; j < n; j++)
{
if((i == j)) continue;
optimal_value_in += CallFunction(i,j);
}
optimal_value += optimal_value_in;
}
double end = omp_get_wtime();
double elapsed_secs = double(end - begin);
cout<<"############# "<<"Using #Threads "<<num_tr<<endl;
cout<<"############# "<<optimal_value<<" Time For Parallel Execution :: "<<elapsed_secs<<endl;
The thing here is (also mentioned above in comments by others) ... I am not sure if you will see some speedup with just n=25 with the body of CallFunction as
double CallFunction(int i, int j){
return i*j;
}
with n=250000 and with 8 threads, I got a speed up of 4.43 so it will strongly depend on what is done in CallFunction.

No speedup with OpenMP

I am working with OpenMP in order to obtain an algorithm with a near-linear speedup.
Unfortunately I noticed that I could not get the desired speedup.
So, in order to understand the error in my code, I wrote another code, an easy one, just to double-check that the speedup was in principle obtainable on my hardware.
This is the toy example i wrote:
#include <omp.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"
int main () {
int number_of_threads = 1;
int n = 600;
int m = 50;
int N = n/number_of_threads;
int time_limit = 600;
double total_clock = omp_get_wtime();
int time_flag = 0;
#pragma omp parallel num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
while (time_flag == 0){
for (int i = 0; i < N; i++)
for(int z = 0; z < m; z++)
for(int x = 0; x < n; x++)
for(int c = 0; c < n; c++){
CD[c] = C[z]*D[x];
C[z] = CD[c] + D[x];
}
iteration_number_local++;
if ((omp_get_wtime() - total_clock) >= time_limit)
time_flag = 1;
}
#pragma omp critical
std::cout<<"I am "<<thread_id<<" and I got" <<iteration_number_local<<"iterations."<<std::endl;
}
}
I want to highlight again that this code is only a toy-example to try to see the speedup: the first for-cycle becomes shorter when the number of parallel threads increases (since N decreases).
However, when I go from 1 to 2-4 threads the number of iterations double up as expected; but this is not the case when I use 8-10-20 threads: the number of iterations does not increase linearly with the number of threads.
Could you please help me with this? Is the code correct? Should I expect a near-linear speedup?
Results
Running the code above I got the following results.
1 thread: 23 iterations.
20 threads: 397-401 iterations per thread (instead of 420-460).

Your measurement methodology is wrong. Especially for small number of iterations.
1 thread: 3 iterations.
3 reported iterations actually means that 2 iterations finished in less than 120 s. The third one took longer. The time of 1 iteration is between 40 and 60 s.
2 threads: 5 iterations per thread (instead of 6).
4 iterations finished in less than 120 s. The time of 1 iteration is between 24 and 30 s.
20 threads: 40-44 iterations per thread (instead of 60).
40 iterations finished in less than 120 s. The time of 1 iteration is between 2.9 and 3 s.
As you can see your results actually do not contradict linear speedup.
It would be much simpler and accurate to simply execute and time one single outer loop and you will likely see almost perfect linear speedup.
Some reasons (non exhaustive) why you don't see linear speedup are:
Memory bound performance. Not the case in your toy example with n = 1000. More general speaking: contention for a shared resource (main memory, caches, I/O).
Synchronization between threads (e.g. critical sections). Not the case in your toy example.
Load imbalance between threads. Not the case in your toy example.
Turbo mode will use lower frequencies when all cores are utilized. This can happen in your toy example.
From your toy example I would say that your approach to OpenMP can be improved by better using the high level abstractions, e.g. for.
More general advise would be too broad for this format and require more specific information about the non-toy example.

You make some declaration inside the parallel region which means you will allocate the memorie and fill it number_of_threads times. Instead I recommand you :
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
#pragma omp parallel firstprivate(C,D,CD) num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
}
Your hardware have a limited quantity of threads which depends of the number of core of your processor. You may have 2 or 4 core.
A parallel region doesn't speed up your code. With open openMP you should use #omp parallel for to speed up for loop or
#pragma omp parallel
{
#pragma omp for
{
}
}
this notation is equivalent to #pragma omp parallel for. It will use several threads (depend on you hardware) to proceed the for loop faster.
be careful
#pragma omp parallel
{
for
{
}
}
will make the entire for loop for each thread, which will not speed up your program.

You should try
#pragma omp parallel num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
while (time_flag == 0){
#pragma omp for
for (int i = 0; i < N; i++)
for(int z = 0; z < m; z++)
for(int x = 0; x < n; x++)
for(int c = 0; c < n; c++)
CD[c] = C[z]*D[x];
iteration_number_local++;
if ((omp_get_wtime() - total_clock) >= time_limit)
time_flag = 1;
}
if(thread_id == 0)
iteration_number = iteration_number_local;
}
std::cout<<"Iterations= "<<iteration_number<<std::endl;
}

Matrix multiplication, KIJ order, Parallel version slower than non-parallel

I have a school task about paralel programming and I'm having a lot of problems with it.
My task is to create a parallel version of given matrix multiplication code and test its performence (and yes, it has to be in KIJ order):
void multiply_matrices_KIJ()
{
for (int k = 0; k < SIZE; k++)
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
This is what I came up with so far:
void multiply_matrices_KIJ()
{
for (int k = 0; k < SIZE; k++)
#pragma omp parallel
{
#pragma omp for schedule(static, 16)
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
And that's where i found something confusing to me. This parallel version of the code is running around 50% slower than non-parallel one. The difference in speed varies only a little bit based on the matrix size (tested SIZE = 128, 256, 512, 1024, 2048, and various schedule versions - dynamic, static, w/o it at all etc. so far).
Can someone help me understand what am I doing wrong? Is it maybe because I'm using the KIJ order and it won't get any faster using openMP?
EDIT:
I'm working on a Windows 7 PC, using Visual Studio 2015 Community edition, compiling in Release x86 mode (x64 doesn't help either). My CPU is: Intel Core i5-2520M CPU # 2,50GHZ (yes, yes it's a laptop, but I'm getting same results on my home I7 PC)
I'm using global arrays:
float matrix_a[SIZE][SIZE];
float matrix_b[SIZE][SIZE];
float matrix_r[SIZE][SIZE];
I'm assigning random (float) values to matrix a and b, matrix r is filled with 0s.
I've tested the code with various matrix sizes so far (128, 256, 512, 1024, 2048 etc.). For some of them it is intended NOT to fit in cache.
My current version of code looks like this:
void multiply_matrices_KIJ()
{
#pragma omp parallel
{
for (int k = 0; k < SIZE; k++) {
#pragma omp for schedule(dynamic, 16) nowait
for (int i = 0; i < SIZE; i++) {
for (int j = 0; j < SIZE; j++) {
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
}
}
}
And just to be clear, I know that with different ordering of loops I could get better results but that is the thing - I HAVE TO use KIJ order. My task is to do the KIJ for loops in parallel and check the performence increase. My problem is that I expect(ed) at least a little faster execution (than the one im getting now which it between 5-10% faster at most) even though it's the I loop that is in parallel (can't do that with K loop because I will get incorrect result since it's matrix_r[i][j]).
These are the results I'm getting when using the code shown above (I'm doing calculations hundreds of times and getting the average time):
SIZE = 128
Serial version : 0,000608s
Parallel I, schedule(dynamic, 16): 0,000683s
Parallel I, schedule(static, 16): 0,000647s
Parallel J, no schedule: 0,001978s (this is where I exected
way slower execution)
SIZE = 256
Serial version: 0,005787s
Parallel I, schedule(dynamic, 16): 0,005125s
Parallel I, schedule(static, 16): 0,004938s
Parallel J, no schedule: 0,013916s
SIZE = 1024
Serial version: 0,930250s
Parallel I, schedule(dynamic, 16): 0,865750s
Parallel I, schedule(static, 16): 0,823750s
Parallel J, no schedule: 1,137000s

Note: This answer is not about how to get the best performance out of your loop order or how to parallelize it because I consider it to be suboptimal due to several reasons. I'll try to give some advice on how to improve the order (and parallelize it) instead.
Loop order
OpenMP is usually used to distribute work over several CPUs. Therefore, you want to maximize the workload of each thread while minimizing the amount of required data and information transfer.
You want to execute the outermost loop in parallel instead of the second one. Therefore, you'll want to have one of the r_matrix indices as outer loop index in order to avoid race conditions when writing to the result matrix.
The next thing is that you want to traverse the matrices in memory storage order (having the faster changing indices as the second not the first subscript index).
You can achieve both with the following loop/index order:
for i = 0 to a_rows
for k = 0 to a_cols
for j = 0 to b_cols
r[i][j] = a[i][k]*b[k][j]
Where
j changes faster than i or k and k changes faster than i.
i is a result matrix subscript and the i loop can run parallel
Rearranging your multiply_matrices_KIJ in that way gives quite a bit of a performance boost already.
I did some short tests and the code I used to compare the timings is:
template<class T>
void mm_kij(T const * const matrix_a, std::size_t const a_rows,
std::size_t const a_cols, T const * const matrix_b, std::size_t const b_rows,
std::size_t const b_cols, T * const matrix_r)
{
for (std::size_t k = 0; k < a_cols; k++)
{
for (std::size_t i = 0; i < a_rows; i++)
{
for (std::size_t j = 0; j < b_cols; j++)
{
matrix_r[i*b_cols + j] +=
matrix_a[i*a_cols + k] * matrix_b[k*b_cols + j];
}
}
}
}
mimicing your multiply_matrices_KIJ() function versus
template<class T>
void mm_opt(T const * const a_matrix, std::size_t const a_rows,
std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows,
std::size_t const b_cols, T * const r_matrix)
{
for (std::size_t i = 0; i < a_rows; ++i)
{
T * const r_row_p = r_matrix + i*b_cols;
for (std::size_t k = 0; k < a_cols; ++k)
{
auto const a_val = a_matrix[i*a_cols + k];
T const * const b_row_p = b_matrix + k * b_cols;
for (std::size_t j = 0; j < b_cols; ++j)
{
r_row_p[j] += a_val * b_row_p[j];
}
}
}
}
implementing the above mentioned order.
Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k
mm_kij(): 6.16706s.
mm_opt(): 2.6567s.
The given order also allows for outer loop parallelization without introducing any race conditions when writing to the result matrix:
template<class T>
void mm_opt_par(T const * const a_matrix, std::size_t const a_rows,
std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows,
std::size_t const b_cols, T * const r_matrix)
{
#if defined(_OPENMP)
#pragma omp parallel
{
auto ar = static_cast<std::ptrdiff_t>(a_rows);
#pragma omp for schedule(static) nowait
for (std::ptrdiff_t i = 0; i < ar; ++i)
#else
for (std::size_t i = 0; i < a_rows; ++i)
#endif
{
T * const r_row_p = r_matrix + i*b_cols;
for (std::size_t k = 0; k < b_rows; ++k)
{
auto const a_val = a_matrix[i*a_cols + k];
T const * const b_row_p = b_matrix + k * b_cols;
for (std::size_t j = 0; j < b_cols; ++j)
{
r_row_p[j] += a_val * b_row_p[j];
}
}
}
#if defined(_OPENMP)
}
#endif
}
Where each thread writes to an individual result row
Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k (4 OMP threads)
mm_kij(): 6.16706s.
mm_opt(): 2.6567s.
mm_opt_par(): 0.968325s.
Not perfect scaling but as a start faster than the serial code.

OpenMP implementations creates a thread pool (although a thread pool is not mandated by the OpenMP standard every implementation of OpenMP I have seen does this) so that threads don't have to be created and destroyed each time a parallel region is entered. Nevertheless, there is a barrier between each parallel region so that all threads have to sync. There is probably some additional overhead in the fork join model between parallel regions. So even though the threads don't have to be recreated they still have to be initialized between parallel regions. More details can be found here.
In order to avoid the overhead between entering parallel regions I suggest creating the parallel region on the outermost loop but doing the work sharing on the inner loop over i like this:
void multiply_matrices_KIJ() {
#pragma omp parallel
for (int k = 0; k < SIZE; k++)
#pragma omp for schedule(static) nowait
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
There is an implicit barrier when using #pragma omp for. The nowait clause removes the barrier.
Also make sure you compile with optimizing. There is little point in comparing performance without optimization enabled. I would use -O3.

Always keep in mind that for caching purposes, the most optimal ordering of your loops will be slowest -> fastest. In your case, that means I,K,L order. I would be quite surprised if your serial code is not automatically reordered from KIJ->IKL ordering by your compiler (assuming you have "-O3"). However, the compiler cannot do this with your parallel loop because that would break the logic you are declaring within your parallel region.
If you really truly cannot reorder your loops, then your best bet would probably be to rewrite the parallel region to encompass the largest possible loop. If you have OpenMP 4.0, you could also consider utilizing SIMD vectorization across your fastest dimension as well. However, I am still doubtful you will be able to beat your serial code by much because of the aforementioned caching issues inherent in your KIJ ordering...
void multiply_matrices_KIJ()
{
#pragma omp parallel for
for (int k = 0; k < SIZE; k++)
{
for (int i = 0; i < SIZE; i++)
#pragma omp simd
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js