wrong reduction using openmp - c++

I am using two different versions of reduction in openmp and I get totally different results. Which one of the following is wrong?
omp_set_num_threads(t);
long long unsigned int d = 0;
#pragma omp parallel for default(none) shared(some_stuff) reduction(+:d)
for (int i=start; i< n; i++)
{
d += calc(i,some_stuff);
}
cout << d << endl;
and the second version is this:
omp_set_num_threads(t);
//reduction array
long long unsigned int* d = new long long unsigned int[t];
for(int i = 0; i < t; i++)
d[i] = 0;
#pragma omp parallel for default(none) shared(somestuff, d)
for (int i=start; i< n; i++)
{
long long unsigned dd = calc(i, somestuff);
d[omp_get_thread_num()] += dd;
}
long long unsigned int res = 0;
for(int i = 0; i < omp_get_num_threads(); i++){
res += d[i];
}
delete[] d;
cout << res << endl;

The second code is wrong. omp_get_num_threads() returns 1 when called outside a parallel region and therefore your code does not reduce all values into the final result. Since you explicitly fix the number of threads to be t, you should instead use:
for(int i = 0; i < t; i++){
res += d[i];
}
Alternatively, you could use omp_get_max_threads().

Related

Why is my matrix multiplication code not working?

I am new to C++ and I have written a C++ OpenMp Matrix Multiplication code that multiplies two 1000x1000 matrices. So far its not running and I am having a hard time finding out where the bugs are. I tried to figure it out for a few days but I'm stuck.
Here is my code:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
int N;
void Multiply()
{
//initialize matrices with random numbers
//#pragma omp for
int aMatrix[N][N], i, j;
for( i = 0; i < N; ++i)
{for( j = 0; j < N; ++j)
{aMatrix[i][j] = rand();}
}
int bMatrix[N][N], i1, j2;
for( i1 = 0; i1 < N; ++i1)
{for( j2 = 0; j2 < N; ++j2)
{bMatrix[i1][j2] = rand();}
}
//Result Matrix
int product[N][N] = {0};
//Transpose Matrix;
int BTransposed[j][i];
BTransposed[j][i] = bMatrix[i1][j2];
for (int row = 0; row < N; row++) {
for (int col = 0; col < N; col++) {
// Multiply the row of A by the column of B to get the row, column of product.
for (int inner = 0; inner < N; inner++) {
product[row][col] += aMatrix[row][inner] * BTransposed[col][inner];
}
}
}
}
int main() {
time_t begin, end;
time(&begin);
Multiply();
time(&end);
time_t elapsed = end - begin;
cout << ("Time measured: ") << endl;
cout << elapsed << endl;
return 0;
}```
The transposed matrix (BTransposed) is not correctly constructed. You can solve this in the following ways:
First Option: use a for loop to create the correct BTransposed matrix.
for (int i = 0; i != N; i++)
for (int j = 0; j != N; j++)
BTransposed[i][j] = bMatrix[j][i]
Second Option (better one): completely delete BTransposed matrix. when needed just use the original bMatrix with indexes i,j exchanged! for example instead of BTransposed[col][inner] you can use BMatrix[inner][col].
You created a matrix
int BTransposed[j][i];
BTransposed[j][i] = bMatrix[i1][j2];
that has the size j x i and than u make the element at [j][i] equal to the element in bMatrix[i1][j2], you should have an error since u cant accses the index j and i since it goes from 0 to j-1 and i-1

c++ threading in openmp

Hello I'm having a hard time with this program, I'm supposed to go trough whole data vector sequentially and sum up each one of the vectors in there in parallel using openmp(and store the sum in solution[i]). But the program gets stuck for some reason. The input vectors that I'm given aren't many but are very large (like 2.5m ints each). Any idea what am I doing wrong?
Here is the code, ps: igone the unused minVectorSize parameter:
void sumsOfVectors_omp_per_vector(const vector<vector<int8_t>> &data, vector<long> &solution, unsigned long minVectorSize) {
unsigned long vectorNum = data.size();
for (int i = 0; i < vectorNum; i++) {
#pragma omp parallel
{
unsigned long sum = 0;
int thread = omp_get_thread_num();
int threadnum = omp_get_num_threads();
int begin = thread * data[i].size() / threadnum;
int end = ((thread + 1) * data[i].size() / threadnum) - 1;
for (int j = begin; j <= end; j++) {
sum += data[i][j];
}
#pragma omp critical
{
solution[i] += sum;
}
}
}
}
void sumsOfVectors_omp_per_vector(const vector<vector<int8_t>> &data, vector<long> &solution, unsigned long minVectorSize) {
unsigned long vectorNum = data.size();
for (int i = 0; i < vectorNum; i++) {
unsigned long sum = 0;
int begin = 0;
int end = data[i].size();
#omp parallel for reduction(+:sum)
for (int j = begin; j < end; j++) {
sum += data[i][j];
}
solution[i] += sum;
}
}
Something like this should be more elegant and work better, Could you compile and comment if it works for you or doesnt

Very slow mutex in LLVM/OpenMP

I wrote code to test the performance of openmp on win (Win7 x64, Corei7 3.4HGz) and on Mac (10.12.3 Core i7 2.7 HGz).
In xcode I made a console application setting the compiled default. I use LLVM 3.7 and OpenMP 5 (in opm.h i searched define KMP_VERSION_MAJOR=5, define KMP_VERSION_MINOR=0 and KMP_VERSION_BUILD = 20150701, libiopm5) on macos 10.12.3 (CPU - Corei7 2700GHz)
For win I use VS2010 Sp1. Additional I set c/C++ -> Optimization -> Optimization = Maximize Speed (O2), c/C++ -> Optimization ->Favor Soze Or Speed = Favor Fast code (Ot).
If I run the application in a single thread, the time difference corresponds to the frequency ratio of processors (approximately). But if you run 4 threads, the difference becomes tangible: win program be faster then mac program in ~70 times.
#include <cmath>
#include <mutex>
#include <cstdint>
#include <cstdio>
#include <iostream>
#include <omp.h>
#include <boost/chrono/chrono.hpp>
static double ActionWithNumber(double number)
{
double sum = 0.0f;
for (std::uint32_t i = 0; i < 50; i++)
{
double coeff = sqrt(pow(std::abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
return sum;
}
static double TestOpenMP(void)
{
const std::uint32_t len = 4000000;
double *a;
double *b;
double *c;
double sum = 0.0;
std::mutex _mutex;
a = new double[len];
b = new double[len];
c = new double[len];
for (std::uint32_t i = 0; i < len; i++)
{
c[i] = 0.0;
a[i] = sin((double)i);
b[i] = cos((double)i);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
double k = 2.0;
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
c[i] = k*a[i] + b[i] + k;
if (c[i] > 0.0)
{
c[i] += ActionWithNumber(c[i]);
}
else
{
c[i] -= ActionWithNumber(c[i]);
}
std::lock_guard<std::mutex> scoped(_mutex);
sum += c[i];
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
double sum2 = 0.0;
for (std::uint32_t i = 0; i < len; i++)
{
sum2 += c[i];
c[i] /= sum2;
}
if (std::abs(sum - sum2) > 0.01) printf("Incorrect result.\n");
delete[] a;
delete[] b;
delete[] c;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const std::uint32_t steps = 5;
for (std::uint32_t i = 0; i < steps; i++)
{
sum += TestOpenMP();
}
sum /= (double)steps;
std::cout << "Elapsed time = " << sum;
return 0;
}
I specifically use a mutex here to compare the performance of openmp on the "mac" and "win". On the "Win" function returns the time of 0.39 seconds. On the "Mac" function returns the time of 25 seconds, i.e. 70 times slower.
What is the cause of this difference?
First of all, thank for edit my post (i use translater to write text).
In the real app, I update the values in a huge matrix (20000х20000) in random order. Each thread determines the new value and writes it in a particular cell. I create a mutex for each row, since in most cases different threads write to different rows. But apparently in cases when 2 threads write in one row and there is a long lock. At the moment I can't divide the rows in different threads, since the order of records is determined by the FEM elements.
So just to put a critical section in there comes out, as it will block writes to the entire matrix.
I wrote code like in real application.
static double ActionWithNumber(double number)
{
const unsigned int steps = 5000;
double sum = 0.0f;
for (u32 i = 0; i < steps; i++)
{
double coeff = sqrt(pow(abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
sum /= (double)steps;
return sum;
}
static double RealAppTest(void)
{
const unsigned int elementsNum = 10000;
double* matrix;
unsigned int* elements;
boost::mutex* mutexes;
elements = new unsigned int[elementsNum*3];
matrix = new double[elementsNum*elementsNum];
mutexes = new boost::mutex[elementsNum];
for (unsigned int i = 0; i < elementsNum; i++)
for (unsigned int j = 0; j < elementsNum; j++)
matrix[i*elementsNum + j] = (double)(rand() % 100);
for (unsigned int i = 0; i < elementsNum; i++) //build FEM element like Triangle
{
elements[3*i] = rand()%(elementsNum-1);
elements[3*i+1] = rand()%(elementsNum-1);
elements[3*i+2] = rand()%(elementsNum-1);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
boost::lock_guard<boost::mutex> lockup(mutexes[i]);
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
}
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
delete[] elements;
delete[] matrix;
delete[] mutexes;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const u32 steps = 5;
for (u32 i = 0; i < steps; i++)
{
sum += RealAppTest();
}
sum /= (double)steps;
std::cout<<"Elapsed time = " << sum;
return 0;
}
You're combining two different sets of threading/synchronization primitives - OpenMP, which is built into the compiler and has a runtime system, and manually creating a posix mutex with std::mutex. It's probably not surprising that there's some interoperability hiccups with some compiler/OS combinations.
My guess here is that in the slow case, the OpenMP runtime is going overboard to make sure that there's no interactions between higher-level ongoing OpenMP threading tasks and the manual mutex, and that doing so inside a tight loop causes the dramatic slowdown.
For mutex-like behaviour in the OpenMP framework, we can use critical sections:
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
#pragma omp critical
sum += c[i];
}
or explicit locks:
omp_lock_t sumlock;
omp_init_lock(&sumlock);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
omp_set_lock(&sumlock);
sum += c[i];
omp_unset_lock(&sumlock);
}
omp_destroy_lock(&sumlock);
We get much more reasonable timings:
$ time ./openmp-original
real 1m41.119s
user 1m15.961s
sys 1m53.919s
$ time ./openmp-critical
real 0m16.470s
user 1m2.313s
sys 0m0.599s
$ time ./openmp-locks
real 0m15.819s
user 1m0.820s
sys 0m0.276s
Updated: There's no problem with using an array of openmp locks in exactly the same way as the mutexes:
omp_lock_t sumlocks[elementsNum];
for (unsigned idx=0; idx<elementsNum; idx++)
omp_init_lock(&(sumlocks[idx]));
//...
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
omp_set_lock(&(sumlocks[i]));
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
omp_unset_lock(&(sumlocks[i]));
}
}
for (unsigned idx=0; idx<elementsNum; idx++)
omp_destroy_lock(&(sumlocks[idx]));

"vector<bool> iterator not dereferencable" error in MSVC but works perfectly when compiled using g++

I was writing sieve of eratosthenes algorithm in MSVC using a vector of bools(since I intended on making the array/vector dynamic with user input)
My code:
#include<iostream>
#include<cmath>
#include<vector>
void sieve(std::vector<bool>& prime)
{
long long size = prime.size();
long long sq = (long long)sqrt(size);
if (size >= 2)
prime[0] = prime[1] = false;
for (long long i = 2; i <= sq; ++i)
if (prime[i])
for (long long j = i*i; j <= size; j += i)
prime[j] = false;
}
int main()
{
int m, n;
std::cout << "Enter first number: ";
std::cin >> m;
std::cout << "Enter second number: ";
std::cin >> n;
std::vector<bool> prime(n, true);
sieve(prime);
for (long long i = m; i <= n; ++i)
if (prime[i])
std::cout << i << std::endl;
}
I stumbled upon a run time error in MSVC
MSVC Error
But this code works perfectly when compiled using g++. I don't know whats wrong. Any help would be appreciable
Thank you
for (long long i = m; i <= n; ++i) and for (long long j = i*i; j <= size; j += i) will both run past the end of the vector as vector_name[vector_size] is 1 past the end of the elements in the vector. This is undefined behavior and you were unlucky that it worked on g++. Some people never bother to try and compile on another compiler to see if they get the same results and If you hadn't there would have been a silent bug in your "working code".
Change the loops to for (long long i = m; i < n; ++i) and for (long long j = i*i; j < size; j += i) and you will no longer run past the end of the vector.
for (long long j = i*i
prime[j] = false;
This is your problem the position is greater than your size of your vector.
Another thing that I have noticed is the size of your vector should be n*n:
std::vector<bool> prime(n*n, true);
void sieve(std::vector<bool>& prime,int m,int n)
{
long long size = prime.size();
// long long sq = (long long)sqrt(size); you can use n for this so you don't have to make another variable.
if (size >= 2)
prime[0] = prime[1] = false;
for (long long i = m; i < n; ++i)
{
if (prime[i])
{
for (long long j = i*i; j < n*n; j += i)
{
prime[j] = false;
}
}
}
}
int main()
{
int m, n;
std::cout << "Enter first number: ";
std::cin >> m;
std::cout << "Enter second number: ";
std::cin >> n;
std::vector<bool> prime(n*n, true);
sieve(prime,m,n);
for (long long i = m; i <= n; ++i)
{
if (prime[i])
{
std::cout << i << std::endl;
}
}
}
this should work ;), and don`t forget to include the headers.

OpenMP for matrix multiplication

I am new to OpenMP and am trying desperately to learn. I have tried to write an example code in C++ in visual studio 2012 to implement matrix multiplication. I was hoping someone with OpenMP experience could take a look at this code and help me to obtain the ultimate speed / parallelization for this:
#include <iostream>
#include <stdlib.h>
#include <omp.h>
#include <random>
using namespace std;
#define NUM_THREADS 4
// Program Variables
double** A;
double** B;
double** C;
double t_Start;
double t_Stop;
int Am;
int An;
int Bm;
int Bn;
// Program Functions
void Get_Matrix();
void Mat_Mult_Serial();
void Mat_Mult_Parallel();
void Delete_Matrix();
int main()
{
printf("Matrix Multiplication Program\n\n");
cout << "Enter Size of Matrix A: ";
cin >> Am >> An;
cout << "Enter Size of Matrix B: ";
cin >> Bm >> Bn;
Get_Matrix();
Mat_Mult_Serial();
Mat_Mult_Parallel();
system("pause");
return 0;
}
void Get_Matrix()
{
A = new double*[Am];
B = new double*[Bm];
C = new double*[Am];
for ( int i=0; i<Am; i++ ){A[i] = new double[An];}
for ( int i=0; i<Bm; i++ ){B[i] = new double[Bn];}
for ( int i=0; i<Am; i++ ){C[i] = new double[Bn]; }
for ( int i=0; i<Am; i++ )
{
for ( int j=0; j<An; j++ )
{
A[i][j]= rand() % 10 + 1;
}
}
for ( int i=0; i<Bm; i++ )
{
for ( int j=0; j<Bn; j++ )
{
B[i][j]= rand() % 10 + 1;
}
}
printf("Matrix Create Complete.\n");
}
void Mat_Mult_Serial()
{
t_Start = omp_get_wtime();
for ( int i=0; i<Am; i++ )
{
for ( int j=0; j<Bn; j++ )
{
double temp = 0;
for ( int k=0; k<An; k++ )
{
temp += A[i][k]*B[k][j];
}
}
}
t_Stop = omp_get_wtime() - t_Start;
cout << "Serial Multiplication Time: " << t_Stop << " seconds" << endl;
}
void Mat_Mult_Parallel()
{
int i,j,k;
t_Start = omp_get_wtime();
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private(i,j,k) schedule(dynamic)
for ( i=0; i<Am; i++ )
{
for ( j=0; j<Bn; j++ )
{
//double temp = 0;
for ( k=0; k<An; k++ )
{
C[i][j] += A[i][k]*B[k][j];
}
}
}
t_Stop = omp_get_wtime() - t_Start;
cout << "Parallel Multiplication Time: " << t_Stop << " seconds." << endl;
}
void Delete_Matrix()
{
for ( int i=0; i<Am; i++ ){ delete [] A[i]; }
for ( int i=0; i<Bm; i++ ){ delete [] B[i]; }
for ( int i=0; i<Am; i++ ){ delete [] C[i]; }
delete [] A;
delete [] B;
delete [] B;
}
My examples are based on a matrix class I created for parallel teaching. If you are interested feel free to contact me.
There are several ways to speedup your matrix multiplication :
Storage
Use a one dimension array in row major order for accessing the element in a faster way.
You can access to A(i,j) with A[i * An + j]
Use loop invariant optimization
for (int i = 0; i < m; i ++)
for (int j = 0; j < p; j ++)
{
Scalar sigma = C(i, j);
for (int k = 0; k < n; k ++)
sigma += (*this)(i, k) * B(k, j);
C(i, j) = sigma;
}
This prevents to recompute C(i,j) several times in the most inner loop.
Change loop order "for k <-> for i"
for (int i = 0; i < m; i ++)
for (int k = 0; k < n; k ++)
{
Aik = (*this)(i, k);
for (int j = 0; j < p; j ++)
C(i, j) += Aik * B(k, j);
}
This allows to play with spatial data locality
Use loop blocking/tiling
for(int ii = 0; ii < m; ii += block_size)
for(int jj = 0; jj < p; jj += block_size)
for(int kk = 0; kk < n; kk += block_size)
#pragma omp parallel for // I think this is the best place for this case
for(int i = ii; i < ii + block_size; i ++)
for(int k = kk; k < kk + block_size; k ++)
{
Scalar Aik = (*this)(i, k);
for(int j = jj; j < jj + block_size; j ++)
C(i, j) += Aik * B(k, j);
}
This can use better temporal data locality. The optimal block_size depends on your architecture and matrix size.
Then parallelize !
Generally, the #pragma omp parallel for should be done a the most outter loop. Maybe using two parallel loop at the two first outter loops can give better results. It depends then on the architecture you use, the matrix size... You have to test !
Since the matrix multiplication has a static workload I would use a static schedule.
Moar optimization !
You can do loop nest optimization.
You can vectorize your code.
You can take look at how BLAS do it.
I am very new to OpenMP and this code is very instructive. However I found an error in the serial version that gives it an unfair speed advantage over the parallel version.
Instead of writing C[i][j] += A[i][k]*B[k][j]; as you do in the parallel version, you have written temp += A[i][k]*B[k][j]; in the serial version. This is much faster (but doesn't help you compute the C matrix). So you're not comparing apples to apples, which makes the parallel code seem slower by comparison. When I fixed this line and ran it on my laptop (which allows 2 threads), the parallel version was almost twice as fast. Not bad!