OpenMP reduction with Eigen::VectorXd - c++

I am attempting to parallelize the below loop with an OpenMP reduction;
#define EIGEN_DONT_PARALLELIZE
#include <iostream>
#include <cmath>
#include <string>
#include <eigen3/Eigen/Dense>
#include <eigen3/Eigen/Eigenvalues>
#include <omp.h>
using namespace Eigen;
using namespace std;
VectorXd integrand(double E)
{
VectorXd answer(500000);
double f = 5.*E + 32.*E*E*E*E;
for (int j = 0; j !=50; j++)
answer[j] =j*f;
return answer;
}
int main()
{
omp_set_num_threads(4);
double start = 0.;
double end = 1.;
int n = 100;
double h = (end - start)/(2.*n);
VectorXd result(500000);
result.fill(0.);
double E = start;
result = integrand(E);
#pragma omp parallel
{
#pragma omp for nowait
for (int j = 1; j <= n; j++){
E = start + (2*j - 1.)*h;
result = result + 4.*integrand(E);
if (j != n){
E = start + 2*j*h;
result = result + 2.*integrand(E);
}
}
}
for (int i=0; i <50 ; ++i)
cout<< i+1 << " , "<< result[i] << endl;
return 0;
}
This is definitely faster in parallel than without, but with all 4 threads, the results are hugely variable. When the number of threads is set to 1, the output is correct.
I would be most grateful if someone could assist me with this...
I am using the clang compiler with compile flags;
clang++-3.8 energy_integration.cpp -fopenmp=libiomp5
If this is a bust, then I'll have to learn to implement Boost::thread, or std::thread...

Your code does not define a custom reduction for OpenMP to reduce the Eigen objects. I'm not sure if clang supports user defined reductions (see OpenMP 4 spec, page 180). If so, you can declare a reduction and add reduction(+:result) to the #pragma omp for line. If not, you can do it yourself by changing your code as follows:
VectorXd result(500000); // This is the final result, not used by the threads
result.fill(0.);
double E = start;
result = integrand(E);
#pragma omp parallel
{
// This is a private copy per thread. This resolves race conditions between threads
VectorXd resultPrivate(500000);
resultPrivate.fill(0.);
#pragma omp for nowait// reduction(+:result) // Assuming user-defined reductions aren't allowed
for (int j = 1; j <= n; j++) {
E = start + (2 * j - 1.)*h;
resultPrivate = resultPrivate + 4.*integrand(E);
if (j != n) {
E = start + 2 * j*h;
resultPrivate = resultPrivate + 2.*integrand(E);
}
}
#pragma omp critical
{
// Here we sum the results of each thread one at a time
result += resultPrivate;
}
}
The error you're getting (in your comment) seems to be due to a size mismatch. While there isn't a trivial one in your code itself, don't forget that when OpenMP starts each thread, it has to initialize a private VectorXd per thread. If none is supplied, the default would be VectorXd() (with a size of zero). When this object is the used, the size mismatch occurs. A "correct" usage of omp declare reduction would include the initializer part:
#pragma omp declare reduction (+: VectorXd: omp_out=omp_out+omp_in)\
initializer(omp_priv=VectorXd::Zero(omp_orig.size()))
omp_priv is the name of the private variable. It gets initialized by VectorXd::Zero(...). The size is specified using omp_orig. The standard
(page 182, lines 25-27) defines this as:
The special identifier omp_orig can also appear in the initializer-clause and it will refer to the storage of the original variable to be reduced.
In our case (see full example below), this is result. So result.size() is 500000 and the private variable is initialized to the correct size.
#include <iostream>
#include <string>
#include <Eigen/Core>
#include <omp.h>
using namespace Eigen;
using namespace std;
VectorXd integrand(double E)
{
VectorXd answer(500000);
double f = 5.*E + 32.*E*E*E*E;
for (int j = 0; j != 50; j++) answer[j] = j*f;
return answer;
}
#pragma omp declare reduction (+: Eigen::VectorXd: omp_out=omp_out+omp_in)\
initializer(omp_priv=VectorXd::Zero(omp_orig.size()))
int main()
{
omp_set_num_threads(4);
double start = 0.;
double end = 1.;
int n = 100;
double h = (end - start) / (2.*n);
VectorXd result(500000);
result.fill(0.);
double E = start;
result = integrand(E);
#pragma omp parallel for reduction(+:result)
for (int j = 1; j <= n; j++) {
E = start + (2 * j - 1.)*h;
result += (4.*integrand(E)).eval();
if (j != n) {
E = start + 2 * j*h;
result += (2.*integrand(E)).eval();
}
}
for (int i = 0; i < 50; ++i)
cout << i + 1 << " , " << result[i] << endl;
return 0;
}

Related

OpenMP parallel for with vector of vectors

I have a fixed-size 2D matrix with size W x H, each element in the matrix is a std::vector. The data is stored in vector of vectors with linearized index. I'm trying to find a way to concurrently fill the output vector. Here is some code to indicate what I'm trying to do.
#include <cmath>
#include <chrono>
#include <iostream>
#include <mutex>
#include <vector>
#include <omp.h>
struct Vector2d
{
double x;
double y;
};
double generate(double range_min, double range_max)
{
double val = (double)rand() / RAND_MAX;
return range_min + val * (range_max - range_min);
}
int main(int argc, char** argv)
{
(void)argc;
(void)argv;
// generate input data
std::vector<Vector2d> points;
size_t num = 10000000;
size_t w = 100;
size_t h = 100;
for (size_t i = 0; i < num; ++i)
{
Vector2d point;
point.x = generate(0, w);
point.y = generate(0, h);
points.push_back(point);
}
// output
std::vector<std::vector<Vector2d> > output(num, std::vector<Vector2d>());
std::mutex mutex;
auto start = std::chrono::system_clock::now();
#pragma omp parallel for
for (size_t i = 0; i < num; ++i)
{
const Vector2d point = points[i];
size_t x = std::floor(point.x);
size_t y = std::floor(point.y);
size_t id = y * w + x;
mutex.lock();
output[id].push_back(point);
mutex.unlock();
}
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
return 0;
}
The problem is the code is much slower with openmp enabled. I found some example to fill std::vector using reduction, but I don't know how to adapt it to vector of vectors. Any help is appreciate, thanks!
There are some things you could do to improve the performance:
I would preallocate the second vector holding the Vector2d class, because every time you push_back a new Vector2d and the capacity of the std::vector is exceeded, it is going to reallocate. So if you do not care having initialized Vector2ds in your std::vector I would simply use:
std::vector<std::vector<Vector2d> > output(num,
std::vector<Vector2d>(num, Vector2d(/*whatever goes in here*/)));
Then in your for loop, you coul access the elements in the second vector via operator[], which allows you to get rid of the lock.
#pragma omp parallel for
for (size_t i = 0; i < num; ++i)
{
const Vector2d point = points[i];
size_t x = std::floor(point(0));
size_t y = std::floor(point(1));
size_t id = y * w + x;
output[id][i] = num;
}
Though I'm not sure, the before-mentioned way works with what you want to do. Otherwise you could reserve the storage for each std::vector<Vector2d>, which would leave you with your initial loop:
std::vector<std::vector<Vector2d> > output(num, std::vector<Vector2d>());
for(int i = 0; i < num; ++i) {
output[i].reserve(num);
}
#pragma omp parallel for
for (size_t i = 0; i < num; ++i)
{
const Vector2d point = points[i];
size_t x = std::floor(point(0));
size_t y = std::floor(point(1));
size_t id = y * w + x;
mutex.lock();
output[id].push_back(point);
mutex.unlock();
}
Which means you get rid of the vector re-allocation, but you still have the mutex...

Why is this vectorized code subject to vector size?

I compile the following code without vectorization (-O2) and compare the time with vectorization (-O3 -march=native) for three different vector lengths (determined by uncommenting the respective #define SIZE), obtaining 29::9, 247::145 and 4866::4884, for vector sizes 10000, 100000 and 1000000, respectively.
#include <iostream>
#include <random>
#include<chrono>
#include<cmath>
using namespace std;
using namespace std::chrono;
//#define SIZE (10000) // 29::9
//#define SIZE (100000) // 247::145
#define SIZE (1000000) // 4866::4884
void vector_op_2(int * __restrict__ v1, int * __restrict__ v2) {
for (unsigned i = 0; i < SIZE; i++)
v1[i] = 2 * v2[i];
}
int main() {
using namespace std;
int* v = new int[SIZE];
int* w = new int[SIZE];
for (int i = 0; i < SIZE; i++) {
v[i] = i;
}
auto start = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
for (int k = 0; k < 5000; k++) {
vector_op_2(w, v);
}
auto end = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
std::cout << "Time " << end.count() - start.count() << std::endl;
for (int i = 0; i < SIZE; i++) {
if (abs(w[i]-2*v[i])>0.01) {
throw 1;
}
}
delete v;
return 0;
}
Why does no speedup occur in the case of vector size 1000000?
What is the optimal length?
Why does this vector length issue not occur with the following example?
[shortened]
long vector_op_1(int v[SIZE]) throw()
{
long s = 0;
for (unsigned i=0; i<SIZE; i++) s += v[i];
return s;
}
[... I am using g++ 7 on Ubuntu 16.04 ...]
[... For short vector size 1000 I am achieving a 6:1 ratio! ...]

Declaring array as a shared variable in pragma parallel directive and stabilizing the code

I have been trying to parallelize computing the sum value of series using certain number of terms to the processors using block allocation.
In this program, I am generating arithmetic series and want to pass array as a shared variable in the pragma and trying to restructure the pragma parallel directive.
I am new to OPENMP-C. Kindly help me how to insert array value as a shared variable and stabilize the code. I am attaching the code below
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main (int argc, char *argv[])
{
int rank, comm_sz;
int number, i, first, difference, global_sum1, global_sum, nprocs, step, local_sum1, local_n;
int* a;
int BLOCK_LOW, BLOCK_HIGH;
double t0, t1;
comm_sz = atoi(argv[1]);
first = atoi(argv[2]);
difference = atoi(argv[3]);
number = atoi(argv[4]);
omp_set_num_threads (comm_sz);
rank = omp_get_thread_num();
a = (int*) malloc (n*sizeof(int));
printf("comm_sz=%d, first=%d, difference=%d, number of terms=%d\n",comm_sz, first, difference, number);
for(i=1; i <= number; i++){
a[i-1] = first + (i-1)*difference;
printf("a[%d]=%d\n",i-1,a[i]);
}
for(i=0; i < number; i++){
printf("a[%d]=%d\n",i,a[i]);}
t0 = omp_get_wtime();
#pragma omp parallel omp_set_num_threads(comm_sz, number, comm_sz, first, difference, global_sum1)
{
BLOCK_LOW = (rank * number)/comm_sz;
BLOCK_HIGH = ((rank+1) * number)/comm_sz;
#pragma omp parallel while private(i, local_sum1)
//int local_sum1 = 0;
i=BLOCK_LOW;
while( i < BLOCK_HIGH )
{
printf("%d, %d\n",BLOCK_LOW,BLOCK_HIGH);
local_sum1 = local_sum1 + a[i];
i++;
}
//global_sum1 = global_sum1 + local_sum1;
#pragma omp while reduction(+:sum1)
i=0;
for (i < comm_sz) {
global_sum1 = global_sum1 + local_sum1;
i++;
}
}
step = 2*first + (n-1)*difference;
sum = 0.5*n*step;
printf("sum is %d\n", global_sum );
t1 = omp_get_wtime();
printf("Estimate of pi: %7.5f\n", global_sum1);
printf("Time: %7.2f\n", t1-t0);
}
There are several mistakes in your code. I've tried to infer what you would like to do. So, I have rewritten your code according to my understanding.
Here is my suggestion:
int main (int argc, char *argv[])
{
int comm_sz, number, i, first, difference, global_sum, step;
int* a;
double t0, t1, sum;
comm_sz = atoi(argv[1]);
first = atoi(argv[2]);
difference = atoi(argv[3]);
number = atoi(argv[4]);
omp_set_num_threads (comm_sz);
a = (int*) malloc (number*sizeof(int));
printf("comm_sz=%d, first=%d, difference=%d, number of terms=%d\n",comm_sz, first, difference, number);
for(i=0; i < number; i++){
a[i] = first + (i)*difference;
printf("a[%d]=%d\n",i,a[i]);
}
t0 = omp_get_wtime();
global_sum = 0;
#pragma omp parallel for private(i) reduction(+:global_sum)
for (i=0; i < number; i++){
global_sum += a[i];
}
step = 2*first + (number-1)*difference;
sum = 0.5*number*step;
t1 = omp_get_wtime();
printf("sum is %d\n", global_sum);
printf("Estimate of pi: %7.5f\n", sum);
printf("Time: %7.2f\n", t1-t0);
}

Vectorizing a program increases runtime

I am asked to vectorize a larger program. Before I started with the big program I wanted to see the effect of vectorization in isolated case. For this I created two programs that should show the idea of the outstanding transformation. One with an array of structs (no vec) and struct of arrays (with vec). I expected that the soa would outperform the aos by far, but it doesn't.
measured program loop A
for (int i = 0; i < NUM; i++) {
ptr[i].c = ptr[i].a + ptr[i].b;
}
full program:
#include <cstdlib>
#include <iostream>
#include <stdlib.h>
#include <chrono>
using namespace std;
using namespace std::chrono;
struct myStruct {
double a, b, c;
};
#define NUM 100000000
high_resolution_clock::time_point t1, t2, t3;
int main(int argc, char* argsv[]) {
struct myStruct *ptr = (struct myStruct *) malloc(NUM * sizeof(struct myStruct));
for (int i = 0; i < NUM; i++) {
ptr[i].a = i;
ptr[i].b = 2 * i;
}
t1 = high_resolution_clock::now();
for (int i = 0; i < NUM; i++) {
ptr[i].c = ptr[i].a + ptr[i].b;
}
t2 = high_resolution_clock::now();
long dur = duration_cast<microseconds>( t2 - t1 ).count();
cout << "took "<<dur << endl;
double sum = 0;
for (int i = 0; i < NUM; i++) {
sum += ptr[i].c;
}
cout << "sum is "<< sum << endl;
}
measured program loop B
#pragma simd
for (int i = 0; i < NUM; i++) {
C[i] = A[i] + B[i];
}
full program:
#include <cstdlib>
#include <iostream>
#include <stdlib.h>
#include <omp.h>
#include <chrono>
using namespace std;
using namespace std::chrono;
#define NUM 100000000
high_resolution_clock::time_point t1, t2, t3;
int main(int argc, char* argsv[]) {
double *A = (double *) malloc(NUM * sizeof(double));
double *B = (double *) malloc(NUM * sizeof(double));
double *C = (double *) malloc(NUM * sizeof(double));
for (int i = 0; i < NUM; i++) {
A[i] = i;
B[i] = 2 * i;
}
t1 = high_resolution_clock::now();
#pragma simd
for (int i = 0; i < NUM; i++) {
C[i] = A[i] + B[i];
}
t2 = high_resolution_clock::now();
long dur = duration_cast<microseconds>( t2 - t1 ).count();
cout << "Aos "<<dur << endl;
double sum = 0;
for (int i = 0; i < NUM; i++) {
sum += C[i];
}
cout << "sum "<<sum;
}
I compile with
icpc vectorization_aos.cpp -qopenmp --std=c++11 -cxxlib=/lrz/mnt/sys.x86_64/compilers/gcc/4.9.3/
icpc (v16)
compiled and executed on an Intel(R) Xeon(R) CPU E5-2697 v3 # 2.60GHz
in my test cases program A takes around 300ms, B 350ms. If I add unnecessary additional data to the struct in A it becomes increasingly slower (as more memory has to be loaded)
the -O3 flag does not have any impact on run-time
removing the #pragma simd directive does also not have impact. So either its auto vectorized or my vectorization does not work at all.
Questions:
am I missing something? Is this the way how one would vectorize a program?
Why is program 2 slower? Maybe the program is both times just memory bandwidth bound and I need to increase the computation density?
Are there programs/ code snippets that show the impact of vecotrization better and how can I verify that my program is actually executed vectorized.

Run method and wait to finish all for loops openmp

I creathed in C++ method to find number's dividors. Second step was to use openmp in c++.
Unfortunatelly I can't manage why my function doStuff throws memory error. Probably the problem is with threads and I check arrays before all threads stop.. Could someone help me?
There is no need to read all my program, the problem is in doStuff()
#include <iostream>
#include <vector>
#include <string>
#include <cmath>
#include <algorithm>
#include "omp.h"
using namespace std;
vector<int> dividors;
int NUMBER = 1000;
bool ifContains(vector<int> element, int dividedNumber)
{
for(int i=0; i<dividors.size(); i++)
{
if(dividors[i] == dividedNumber)
return true;
}
return false;
}
void doStuff()
{
int sqr = (int) sqrt(NUMBER);
int sqrp1 = sqr + 1;
#pragma omp parallel
{
#pragma omp for nowait
for (int i = 1; i < sqrp1; i++)
{
if (NUMBER % i == 0)
{
if (!ifContains(dividors, i))
dividors.push_back(i);
int dividednumber = NUMBER / i;
if (!ifContains(dividors,dividednumber))
dividors.push_back(dividednumber);
}
}
sort(dividors.begin(), dividors.end());
#pragma omp for nowait
for (int i = 0; i < dividors.size(); i++)
{
cout << dividors[i] << "\r\n";
}
}
}
int main()
{
doStuff();
return 0;
}
Also I tried this, but It doesn't work
void doStuff()
{
int sqr = (int) sqrt(NUMBER);
int sqrp1 = sqr + 1;
#pragma omp parallel
{
#pragma omp for
for (int i = 1; i < sqrp1; i++)
{
if (NUMBER % i == 0)
{
if (!ifContains(dividors, i))
dividors.push_back(i);
int dividednumber = NUMBER / i;
if (!ifContains(dividors,dividednumber))
dividors.push_back(dividednumber);
}
}
#pragma omp single
sort(dividors.begin(), dividors.end());
#pragma omp single
for (int i = 0; i < dividors.size(); i++)
{
cout << dividors[i] << "\r\n";
}
}
}
There are several ways to fix this. The simplest is just to use the ordered clause. See the code below. However, this removes some of the parallelism. A better way is to declare prviate dividors vectors (which I call dividors_private) inside the parallel block so that each thread get's it's own private version and then write to the dividors vector in critical block. The sorting is done on the private vectors in parallel. A final sort is done on dividors in a single thread but since most of it is already sorted it goes fast. See the second code below:
The version of the code with ordered:
#include <iostream>
#include <vector>
#include <string>
#include <cmath>
#include <algorithm>
#include "omp.h"
using namespace std;
vector<int> dividors;
int NUMBER = 1000;
bool ifContains(vector<int> dividors, int dividedNumber)
{
for(int i=0; i<dividors.size(); i++)
{
if(dividors[i] == dividedNumber)
return true;
}
return false;
}
void doStuff()
{
int sqr = (int) sqrt(NUMBER);
int sqrp1 = sqr + 1;
#pragma omp parallel
{
#pragma omp for ordered
for (int i = 1; i < sqrp1; i++)
{
if (NUMBER % i == 0)
{
#pragma omp ordered
{
dividors.push_back(i);
if (!ifContains(dividors, i))
dividors.push_back(i);
int dividednumber = NUMBER / i;
if (!ifContains(dividors, dividednumber))
dividors.push_back(dividednumber);
}
}
}
}
sort(dividors.begin(), dividors.end());
for (int i = 0; i < dividors.size(); i++)
{
cout << dividors[i] << "\r\n";
}
}
int main()
{
doStuff();
return 0;
}
The version of the code which uses private dividors vectors
#include <iostream>
#include <vector>
#include <string>
#include <cmath>
#include <algorithm>
#include "omp.h"
using namespace std;
vector<int> dividors;
int NUMBER = 1000;
bool ifContains(vector<int> dividors, int dividedNumber)
{
for(int i=0; i<dividors.size(); i++)
{
if(dividors[i] == dividedNumber)
return true;
}
return false;
}
void doStuff()
{
int sqr = (int) sqrt(NUMBER);
int sqrp1 = sqr + 1;
#pragma omp parallel
{
vector<int> dividors_private;
#pragma omp for nowait
for (int i = 1; i < sqrp1; i++)
{
if (NUMBER % i == 0)
{
dividors_private.push_back(i);
//printf("i %d\n", i);
if (!ifContains(dividors_private, i))
dividors_private.push_back(i);
int dividednumber = NUMBER / i;
if (!ifContains(dividors_private, dividednumber))
dividors_private.push_back(dividednumber);
}
}
sort(dividors_private.begin(), dividors_private.end());
#pragma omp critical
{
dividors.insert(dividors.end(), dividors_private.begin(), dividors_private.end());
}
}
sort(dividors.begin(), dividors.end());
for (int i = 0; i < dividors.size(); i++)
{
cout << dividors[i] << "\r\n";
}
}
int main()
{
doStuff();
return 0;
}
I ran your code in gdb and am getting random crashed on the calls to dividors.push_back(...). This seem to be a race condition and the reason is that you are changing the dividors vector from several threads at once and in this sense the std::vector class is not thread safe. See std::vector, thread-safety, multi-threading.
What you have to do is to make sure that no thread changes the vector while another thread changes or reads it. This goes for sorting it on every thread to. Do it in a #pragma omp single, it only needs to be sorted once and especially not from several threads at once.