There is a piece of code
#include <iostream>
#include <array>
#include <random>
#include <omp.h>
class DBase
{
public:
DBase()
{
delta=(xmax-xmin)/n;
for(int i=0; i<n+1; ++i) x.at(i)=xmin+i*delta;
y={1.0, 3.0, 9.0, 15.0, 20.0, 17.0, 13.0, 9.0, 5.0, 4.0, 1.0};
}
double GetXmax(){return xmax;}
double interpolate(double xx)
{
int bin=xx/delta;
if(bin<0 || bin>n-1) return 0.0;
double res=y.at(bin)+(y.at(bin+1)-y.at(bin)) * (xx-x.at(bin))
/ (x.at(bin+1) - x.at(bin));
return res;
}
private:
static constexpr int n=10;
double xmin=0.0;
double xmax=10.0;
double delta;
std::array<double, n+1> x;
std::array<double, n+1> y;
};
int main(int argc, char *argv[])
{
DBase dbase;
const int N=10000;
std::array<double, N> rnd{0.0};
std::array<double, N> src{0.0};
std::array<double, N> res{0.0};
unsigned seed = 1;
std::default_random_engine generator (seed);
for(int i=0; i<N; ++i) rnd.at(i)=
std::generate_canonical<double,std::numeric_limits<double>::digits>(generator);
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
src.at(i)=rnd.at(i) * dbase.GetXmax();
res.at(i)=dbase.interpolate(rnd.at(i) * dbase.GetXmax());
}
for(int i=0; i<N; ++i) std::cout<<"("<<src.at(i)<<" , "<<res.at(i)
<<") "<<std::endl;
return 0;
}
It seemes to work properly with either #pragma omp parallel for
or without it (checked output). But i can't understand the following things:
1) Different parallel threads access the same arrays x and y of object dbase of class Dbase (i understand that OpenMP implicitly made dbase object shared, i. e. #pragma omp parallel for shared(dbase)). Different threads do not write in these arrays, only read. But when they read, can there be a race condition for their reading from x and y or not? If not, how is it organized that at every moment only 1 thread reads from x and y in interpolate() and different threads do not bother each other? Maybe, there is a local copy of dbase object and its x and y arrays in each OpenMP thread (but it is equivalent to #pragma omp parallel for private(dbase))?
2) Shall i write in such code #pragma omp parallel for shared(dbase) or #pragma omp parallel for is enough?
3) I think that if i placed 1 random number generator inside the for-loop, to make it work properly (not to let its inner state be in race condition), i should write
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
src.at(i)=rnd.at(i) * dbase.GetXmax();
#pragma omp atomic
std::generate_canonical<double,std::numeric_limits<double>::digits>
(generator)
res.at(i)=dbase.interpolate(rnd.at(i) * dbase.GetXmax());
}
The #pragma omp atomic would destroy the increase in performance (would make threads wait) from #pragma omp parallel for. So, the only correct way how to use random numbers inside a parallel region is to have own generator (or seed) for each thread or prepare all needed random numbers before the for-loop. Is it correct?
I want to parallelize a program that I have written which calculates a series involving matrix and vector products with the result being a vector. Since arguments become very small and large, I use ARB (based on GMP, MPFR and flint) to prevent Loss of significance.
Also, since series elements are independent, matrix dimensions are not big, but the series needs to be evaluated upto 50k elements, it makes sense to have a number of threads compute a few elements of the series each, i.e. 5 threads could each compute 10k elements in parallel and then add up the resulting vectors.
The issue is now that the ARB function to add up vectors and matrices is not a standard operation that can be used withtin an openmp reduction easily.
When I naively try to write a custom reduction, g++ complains about the void type, since operations in ARB do not have a return value:
void arb_add(arb_t z, const arb_t x, const arb_t y, slong prec)¶
will calculate and set z to z=x+y with a precision of prec bits, but arb_add itself is s a void function.
As an example: a random for-loop for a similar problem looks like this (my actual program is different of course)
[...]
arb_mat_t RMa,RMb,RMI,RMP,RMV,RRes;
arb_mat_init(RMa,Nmax,Nmax); //3 matrices
arb_mat_init(RMb,Nmax,Nmax);
arb_mat_init(RMI,Nmax,Nmax);
arb_mat_init(RMV,Nmax,1); // 3 vectors
arb_mat_init(RMP,Nmax,1);
arb_mat_init(RRes,Nmax,1);
[...]
//Result= V + ABV +(AB)^2 V + (AB)^3 V + ...
//in my actual program A and B would be j- and k-dependent and would
//include more matrices and vectors
#pragma omp parallel for collapse(1) private(j)
for(j=0; j<jmax; j++){
arb_mat_one(RMI); //sets the matrix RMI to 1
for(k=0; k<j; k++){
Qmmd(RMI,RMI,RMa,Nmax,prec); //RMI=RMI*RMa
Qmmd(RMI,RMI,RMb,Nmax,prec); //RMI=RMI*RMb
cout << "j=" << j << ", k=" << k << "\r" << flush;
}
Qmvd(RMP,RMI,RMV,Nmax,prec); //RMP=RMI*RMV
arb_mat_add(RRes,RRes,RMP,prec); //Result=Result + RMP
}
[...]
which of course breaks down when using more than 1 thread because I did not specify a reduction on RRes. Here Qmmd() and Qmvd() are self-written matrix-matrix and matrix-vector product functions, and RMa, RMb, and RMV are random matrices and vectors, resp.
The idea is now to reduce RRes, such that each thread can compute a private version of RRes including a fraction of the final result, before adding them all up using arb_mat_add. I could write a function matrixadd(A,B) to compute A=A+B
void matrixadd(arb_mat_t A, arb_mat_t B) {
arb_mat_add(A,A,B,2000);
//A=A+B, the last value is the precision in bits used for that operation
}
and then eventually
#pragma omp declare reduction \
(myadd : void : matrixadd(omp_out, omp_in))
#pragma omp parallel for private(j) reduction(myadd:RRes)
for(j=0; j<jmax; j++){
arb_mat_one(RMI);
for(k=0; k<j; k++){
Qmmd(RMI,RMI,RMa,Nmax,prec);
Qmmd(RMI,RMI,RMb,Nmax,prec);
cout << "j=" << j << ", k=" << k << "\r" << flush;
}
Qmvd(RMP,RMI,RMV,Nmax,prec);
matrixadd(RRes,RMP);
}
Gcc is not happy with this:
main.cpp: In function ‘int main()’:
main.cpp:503:46: error: invalid use of void expression
(myadd : void : matrixadd(omp_out, omp_in))
^
main.cpp:504:114: error: ‘RRes’ has invalid type for ‘reduction’
Can Openmp understand my void reduction and can be made to work with ARB and GMP? If so, how? Thanks!
(Also, my program currently includes a convergence check with a break condition in the j-for-loop. If you also know how to easily implement such a thing, too, I would be very grateful because for my current openmp tests I just removed the break and set a constant jmax.)
My question is very similar to this one.
Edit: Sorry, here is my try of a minimal, complete and verifiable example. Additional required packages are arb, flint, gmp, mpfr (available through packetmanagers) and gmpfrxx.
#include <iostream>
#include <omp.h>
#include <cmath>
#include <ctime>
#include <cmath>
#include <gmp.h>
#include "gmpfrxx/gmpfrxx.h"
#include "arb.h"
#include "acb.h"
#include "arb_mat.h"
using namespace std;
void generate_matrixARBdeterministic(arb_mat_t Mat, int N, double w2) //generates some matrix
{
int i,j;
double what;
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
what=(i*j+30/w2)/((1+0.1*w2)*j+20-w2);
arb_set_d(arb_mat_entry(Mat,i,j),what);
}
}
}
void generate_vecARBdeterministic(arb_mat_t Mat, int N) //generates some vector
{
int i;
double what;
for(i=0;i<N;i++)
{
what=(4*i*i+40)/200;
arb_set_d(arb_mat_entry(Mat,i,0),what);
}
}
void Qmmd(arb_mat_t res, arb_mat_t MA, arb_mat_t MB, int NM, slong prec)
{ ///res=M*M=Matrix * Matrix
arb_t Qh1;
arb_mat_t QMh;
arb_init(Qh1);
arb_mat_init(QMh,NM,NM);
for (int i=0; i<NM; i++){
for(int j=0; j<NM; j++){
for (int k=0; k<NM; k++ ) {
arb_mul(Qh1,arb_mat_entry(MA, i, k),arb_mat_entry(MB, k, j),prec);
arb_add(arb_mat_entry(QMh, i, j),arb_mat_entry(QMh, i, j),Qh1,prec);
}
}
}
arb_mat_set(res,QMh);
arb_mat_clear(QMh);
arb_clear(Qh1);
}
void Qmvd(arb_mat_t res, arb_mat_t M, arb_mat_t V, int NM, slong prec) //res=M*V=Matrix * Vector
{ ///res=M*V
arb_t Qh,Qs;
arb_mat_t QMh;
arb_init(Qh);
arb_init(Qs);
arb_mat_init(QMh,NM,1);
arb_set_ui(Qh,0.0);
arb_set_ui(Qs,0.0);
arb_mat_zero(QMh);
for (int i=0; i<NM; i++){
arb_set_ui(Qs,0.0);
for(int j=0; j<NM; j++){
arb_mul(Qh,arb_mat_entry(M, i, j),arb_mat_entry(V, j, 0),prec);
arb_add(Qs,Qs,Qh,prec);
}
arb_set(arb_mat_entry(QMh, i, 0),Qs);
}
arb_mat_set(res,QMh);
arb_mat_clear(QMh);
arb_clear(Qh);
arb_clear(Qs);
}
void QPrintV(arb_mat_t A, int N){ //Prints Vector
for(int i=0;i<N;i++){
cout << arb_get_str(arb_mat_entry(A, i, 0),5,0) << endl; //ARB_STR_NO_RADIUS
}
}
void matrixadd(arb_mat_t A, arb_mat_t B) {
arb_mat_add(A,A,B,2000);
}
int main() {
int Nmax=10,jmax=300; //matrix dimension and max of j-loop
ulong prec=2000; //precision for arb
//initializations
arb_mat_t RMa,RMb,RMI,RMP,RMV,RRes;
arb_mat_init(RMa,Nmax,Nmax);
arb_mat_init(RMb,Nmax,Nmax);
arb_mat_init(RMI,Nmax,Nmax);
arb_mat_init(RMV,Nmax,1);
arb_mat_init(RMP,Nmax,1);
arb_mat_init(RRes,Nmax,1);
omp_set_num_threads(1);
cout << "Maximal threads is " << omp_get_max_threads() << endl;
generate_matrixARBdeterministic(RMa,Nmax,1.0); //generates some Matrix for RMa
arb_mat_set(RMb,RMa); // sets RMb=RMa
generate_vecARBdeterministic(RMV,Nmax); //generates some vector
double st=omp_get_wtime();
Qmmd(RMI,RMa,RMb,Nmax,prec);
int j,k=0;
#pragma omp declare reduction \
(myadd : void : matrixadd(omp_out, omp_in))
#pragma omp parallel for private(j) reduction(myadd:RRes)
for(j=0; j<jmax; j++){
arb_mat_one(RMI);
for(k=0; k<j; k++){
Qmmd(RMI,RMI,RMa,Nmax,prec);
Qmmd(RMI,RMI,RMb,Nmax,prec);
cout << "j=" << j << ", k=" << k << "\r" << flush;
}
Qmvd(RMP,RMI,RMV,Nmax,prec);
matrixadd(RRes,RMP);
}
QPrintV(RRes,Nmax);
double en=omp_get_wtime();
printf("\n Time it took was %lfs\n",en-st);
arb_mat_clear(RMa);
arb_mat_clear(RMb);
arb_mat_clear(RMV);
arb_mat_clear(RMP);
arb_mat_clear(RMI);
arb_mat_clear(RRes);
return 0;
}
and
g++ test.cpp -g -fexceptions -O3 -ltbb -fopenmp -lmpfr -lflint -lgmp -lgmpxx -larb -I../../PersonalLib -std=c++14 -lm -o b.out
You can do the reduction by hand like this
#pragma omp parallel
{
arb_mat_t RMI, RMP;
arb_mat_init(RMI,Nmax,Nmax); //allocate memory
arb_mat_init(RMP,Nmax,1); //allocate memory
#pragma omp for
for(int j=0; j<jmax; j++){
arb_mat_one(RMI);
for(int k=0; k<j; k++){
Qmmd(RMI,RMI,RMa,Nmax,prec);
Qmmd(RMI,RMI,RMb,Nmax,prec);
}
}
Qmvd(RMP,RMI,RMV,Nmax,prec);
#pragma omp critical
arb_mat_add(RRes,RRes,RMP,prec);
arb_mat_clear(RMI); //deallocate memory
arb_mat_clear(RMP); //deallocate memory
}
If you want to use declare reduction you need to make a C++ wrapper for arb_mat_t. Using declare reduction lets OpenMP decide how to do the reduction. But I highly doubt you will find a case where this gives better performance than the manual case.
You can either create an array of matrices for each thread to sum up. You simply replace matrixadd(RRes, RMP) with matrixadd(RRes[get_omp_thread_num()], RMP) and then sum all RRes in the end, where RRes would now be a std::vector<arb_mat_t>.
Or you could try to define an addition operator for a wrapper class, of course you should be careful to avoid copying the entire matrix. This feels like more of a hassle as you have to be a bit careful with the memory management (since you're using a library - you don't know exactly what it does unless you take the time to go through it all).
I am trying to implement a procedure in parallel processing form with OpenMP. It contains four level nested for loops (dependent) and has a variable sum_p to be updated in the innermost loop. In short, the my question is regarding the parallel implementation of the following code snippet:
for (int i = (test_map.size() - 1); i >= 1; --i) {
bin_i = test_map.at(i); //test_map is a "STL map of vectors"
len_rank_bin_i = bin_i.size(); // bin_i is a vector
for (int j = (i - 1); j >= 0; --j) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
for (int u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i]; //node_u is a scalar
for (int v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_p += 1;
}
}
}
}
The full program is given below:
#include <iostream>
#include <vector>
#include <omp.h>
#include <random>
#include <unordered_map>
#include <algorithm>
#include <functional>
#include <time.h>
int main(int argc, char* argv[]){
double time_temp;
int test_map_size = 5000;
std::unordered_map<unsigned int, std::vector<unsigned int> > test_map(test_map_size);
// Fill the test map with random intergers ---------------------------------
std::random_device rd;
std::mt19937 gen1(rd());
std::uniform_int_distribution<int> dist(1, 5);
auto gen = std::bind(dist, gen1);
for(int i = 0; i < test_map_size; i++)
{
int vector_len = dist(gen1);
std::vector<unsigned int> tt(vector_len);
std::generate(begin(tt), end(tt), gen);
test_map.insert({i,tt});
}
// Sequential implementation -----------------------------------------------
time_temp = omp_get_wtime();
std::vector<unsigned int> bin_i, bin_j;
unsigned int node_v, node_u;
unsigned int len_rank_bin_i;
unsigned int len_rank_bin_j;
int sum_s = 0;
for (unsigned int i = (test_map_size - 1); i >= 1; --i) {
bin_i = test_map.at(i);
len_rank_bin_i = bin_i.size();
for (unsigned int j = i; j-- > 0; ) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
for (unsigned int u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i];
for (unsigned int v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_s += 1;
}
}
}
}
std::cout<<"Estimated sum (seq): "<<sum_s<<std::endl;
time_temp = omp_get_wtime() - time_temp;
printf("Time taken for sequential implementation: %.2fs\n", time_temp);
// Parallel implementation -----------------------------------------------
time_temp = omp_get_wtime();
int sum_p = 0;
omp_set_num_threads(4);
#pragma omp parallel
{
std::vector<unsigned int> bin_i, bin_j;
unsigned int node_v, node_u;
unsigned int len_rank_bin_i;
unsigned int len_rank_bin_j;
unsigned int i, u_i, v_i;
int j;
#pragma omp parallel for private(j,u_i,v_i) reduction(+:sum_p)
for (i = (test_map_size - 1); i >= 1; --i) {
bin_i = test_map.at(i);
len_rank_bin_i = bin_i.size();
#pragma omp parallel for private(u_i,v_i)
for (j = (i - 1); j >= 0; --j) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
#pragma omp parallel for private(v_i)
for (u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i];
#pragma omp parallel for
for (v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_p += 1;
}
}
}
}
}
std::cout<<"Estimated sum (parallel): "<<sum_p<<std::endl;
time_temp = omp_get_wtime() - time_temp;
printf("Time taken for parallel implementation: %.2fs\n", time_temp);
return 0;
}
Running the code with command g++-7 -fopenmp -std=c++11 -O3 -Wall -o so_qn so_qn.cpp in macOS 10.13.3 (i5 processor with four logical cores) gives the following output:
Estimated sum (seq): 38445750
Time taken for sequential implementation: 0.49s
Estimated sum (parallel): 38445750
Time taken for parallel implementation: 50.54s
The time taken for parallel implementation is multiple times higher than sequential implementation. Do you think the code or logic can deduced to parallel implementation? I have spent a few days to improve the terrible performance of my code but to no avail. Any help is greatly appreciated.
Update
With the changes suggested by JimCownie, i.e., "using omp for, not omp parallel for" and removing the parellelism of inner loops, the performance is greatly improved.
Estimated sum (seq): 42392944
Time taken for sequential implementation: 0.48s
Estimated sum (parallel): 42392944
Time taken for parallel implementation: 0.27s
My CPU has four logical cores (and I am using four threads), now I am wondering, would there be anyway to get four times better performance than the sequential implementation.
I see a different problem here when my map of vectors test_map is short, but fat at each level, i.e., the map size is small and but the vector size at each of the keys is very large. In such a case the performance of sequential and parallel implementations are comparable, without much difference. It seems like we need to parallelize inner loops too. Do you know how to achieve it in this context?