How to parallelize accumulative probability function with OpenMP? - c++

I'm trying to make more efficient via parallelizing my code that calculates the accumulative probability function. I have a vector<double> of radii called r and I need to count how many elements there are with a radius bigger than a given one > R. In addition, I need to calculate the accumulative probability function for the volume.
The code I have is the following one:
int i, j;
double aux, contar, contar1, aux;
vector<double> r, contador, contador1, vol,
for (i = 0; i != r.size() - 1; i++)
{
aux = r[i];
contador[i] = 0;
contador1[i] = 0;
contar = 0;
contar1 = 0;
vol[i] = 0.0;
for (j = 0; j != r.size() - 1; j++)
{
if(aux <= r[j])
{
contar++;
#pragma omp atomic write
vol[i] = vol[i] + 4.0 * 3.141592653589793 * r[j] * r[j] * r[j] / 3.0;
}
if(aux==r[j])
{
contar1++;
}
}
#pragma omp atomic write
contador[i]=contar;
#pragma omp atomic write
contador1[i]=contar1;
}
but it's not efficient at all. Any help in order to make it more efficient with OpenMP?

Related

C++ Two Algorithms for Same Procedure Produce Different Results

I have a homework assignment where, given an array of points, we must find the number of point pairs that are less than or equal to a given distance epsilon apart. The first algorithm I wrote gives me the correct answer, which is here:
#pragma omp parallel for schedule(dynamic, CHUNK) num_threads(NB_THREADS) reduction(+:counter) private(dx, dy, distance)
for (int i = 0; i < N-1; ++i)
{
for (int j = i+1; j < N; j = ++j)
{
dx = (data[j].x - data[i].x);
dy = (data[j].y - data[i].y);
distance = (dx*dx) + (dy*dy);
if (distance <= epsilon_squared)
{
++counter;
}
}
}
The second algorithm makes use of distance from origin and trigonometry to perform the same operations. The problem is that the final result is off by a very small margin, typically between 2-4. The point array is sorted beforehand by distance from origin.
#pragma omp parallel for schedule(dynamic, CHUNK) num_threads(NB_THREADS) reduction(+:counter) private(a, b, theta, distance)
for (int i = 0; i < N-1; ++i)
{
//dfo = distance from origin
a = data[i].dfo;
for (int j = i+1; j < N; j = ++j)
{
b = data[j].dfo;
//find angle between point a, origin, point b
theta = acos(((data[i].x*data[j].x)+(data[i].y*data[j].y))/((sqrt(((data[i].x*data[i].x)+(data[i].y*data[i].y)))*(sqrt((data[j].x*data[j].x)+(data[j].y*data[j].y))))));
distance = (a*a) + (b*b) - (2*a*b*(cos(theta)));
if (distance <= epsilon_squared)
{
++counter;
} else {
if (abs(a-b)>epsilon)
{
break;
}
}
}
}
My Question: Can the operations in the second algorithm lead to a different distance result compared to the first algorithm? I have checked the results between the first and the second, and they seem to be completely identical. If there is a difference being created, what can I do to fix this? Thank you in advance.

Constructing distance matrix in parallel in C++11 using OpenMP

I would like to construct a distance matrix in parallel in C++11 using OpenMP. I read various documentations, introductions, examples etc. Yet, I still have a few questions. To facilitate answering this post, I state my questions as assumptions numbered 1 through 7. This way, you can quickly browse through them and point out which ones are correct and which ones are not.
Let us begin with a simple serially executed function computing a dense Armadillo matrix:
// [[Rcpp::export]]
arma::mat compute_dist_mat(arma::mat &coordinates, unsigned int n_points) {
arma::mat dist_mat(n_points, n_points, arma::fill::zeros);
double dist {};
for(unsigned int i {0}; i < n_points; i++) {
for(unsigned int j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
dist_mat.at(i, j) = dist;
dist_mat.at(j, i) = dist;
}
}
return dist_mat;
}
As a side note: this function is supposed to be called from R through the Rcpp interface - indicated by the // [[Rcpp::export]]. And accordingly the top of the file includes
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]
#include <omp.h>
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
using namespace arma;
However, the function should work also fine without the R interface.
In an attempt to parallelize the code, I replace the loops with
unsigned int i {};
unsigned int j {};
# pragma omp parallel for private(dist, i, j) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
for(j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
dist_mat.at(i, j) = dist;
dist_mat.at(j, i) = dist;
}
}
and add n_threads as an argument to the compute_dist_mat function.
This distributes the iterations of the outer loop across threads, with the iterations of the inner loop executed by the respective thread handling the outer loop.
The two loop levels cannot be combined because the inner loop depends on the outer one.
dist, i, and j are all to be initialized above the # pragma line and then declared private rather than initializing them in the loops.
The # pragma line does not have any effect when n_treads = 1, inducing a serial execution.
Extending the dense matrix application, the following code block illustrates the serial sparse matrix case with batch insertion. To motivate the use of sparse matrices here, I set distances below a certain threshold to zero.
// [[Rcpp::export]]
arma::sp_mat compute_dist_spmat(arma::mat &coordinates, unsigned int n_points, double dist_threshold) {
std::vector<double> dists;
std::vector<unsigned int> dist_i;
std::vector<unsigned int> dist_j;
double dist {};
for(unsigned long int i {0}; i < n_points; i++) {
for(unsigned long int j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
if(dist >= dist_threshold) {
dists.push_back(dist);
dist_i.push_back(i);
dist_j.push_back(j);
}
}
}
unsigned int mat_size = dist_i.size();
arma::umat index_mat(2, mat_size * 2);
arma::vec dists_vec(mat_size * 2);
unsigned int j {};
for(unsigned int i {0}; i < mat_size; i++) {
j = i * 2;
index_mat.at(0, j) = dist_i[i];
index_mat.at(1, j) = dist_j[i];
index_mat.at(0, j + 1) = dist_j[i];
index_mat.at(1, j + 1) = dist_i[i];
dists_vec.at(j) = dists[i];
dists_vec.at(j + 1) = dists[i];
}
arma::sp_mat dist_mat(index_mat, values_vec, n_points, n_points);
return dist_mat;
}
Because the function does ex ante not know how many distances are above the threshold, it first stores the non-zero values in standard vectors and then constructs the Armadillo objects from them.
I parallelize the function as follows:
// [[Rcpp::export]]
arma::sp_mat compute_dist_spmat(arma::mat &coordinates, unsigned int n_points, double dist_threshold, unsigned short int n_threads) {
std::vector<std::vector<double>> dists(n_points);
std::vector<std::vector<unsigned int>> dist_j(n_points);
double dist {};
unsigned int i {};
unsigned int j {};
# pragma omp parallel for private(dist, i, j) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
for(j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
if(dist >= dist_threshold) {
dists[i].push_back(dist);
dist_j[i].push_back(j);
}
}
}
unsigned int vec_intervals[n_points + 1];
vec_intervals[0] = 0;
for (i = 0; i < n_points; i++) {
vec_intervals[i + 1] = vec_intervals[i] + dist_j[i].size();
}
unsigned int mat_size {vec_intervals[n_points]};
arma::umat index_mat(2, mat_size * 2);
arma::vec dists_vec(mat_size * 2);
unsigned int vec_begins_i {};
unsigned int vec_length_i {};
unsigned int k {};
# pragma omp parallel for private(i, j, k, vec_begins_i, vec_length_i) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
vec_begins_i = vec_intervals[i];
vec_length_i = vec_intervals[i + 1] - vec_begins_i;
for(j = 0, j < vec_length_i, j++) {
k = (vec_begins_i + j) * 2;
index_mat.at(0, k) = i;
index_mat.at(1, k) = dist_j[i][j];
index_mat.at(0, k + 1) = dist_j[i][j];
index_mat.at(1, k + 1) = i;
dists_vec.at(k) = dists[i][j];
dists_vec.at(k + 1) = dists[i][j];
}
}
arma::sp_mat dist_mat(index_mat, dists_vec, n_points, n_points);
return dist_mat;
}
Using dynamic vectors in the loop is thread-safe.
dist, i, j, k, vec_begins_i, and vec_length_i are all to be initialized above the # pragma line and then declared private rather than initializing them in the loops.
Nothing has to be marked as a section.
Are any of the seven statements incorrect?
The following does not directly answer your question (it's just some dev code I copied from a personal GitHub repo), but it makes several points clear that may be of use in your application:
OpenMP automatically determines private members so long as you are not doing any dynamic memory allocation within the parallel loop
For sparse matrix distance calculations, it becomes important to move beyond a simple calculation of distance at each non-zero index and instead consider the structure of sparsity that is expected, and optimize for that. In the example below, I assume both matrices are very sparse and their intersection is less than their union. Thus, I "precondition" each distance calculation with squared column sums (for calculating Euclidean distance), and then adjust the calculation for the intersection only. This avoids complicated iterator structures and is very fast.
Using as few temporaries as possible is much to your benefit, and sparse matrix iterators do as good of a job of this as any alternative code anyone may ever write.
Eigen provides better vectorization than Armadillo (across the board, I might add) which means you want Eigen instead of Armadillo if those last 20% of performance gains are important to you.
This function calculates the Euclidean distance between all unique pairs of columns in an Eigen::SparseMatrix<double> object:
// sparse column-wise Euclidean distance between all columns
Eigen::MatrixXd distance(Eigen::SparseMatrix<double>& A) {
Eigen::MatrixXd dists(A.cols(), A.cols());
Eigen::VectorXd sq_colsums(A.cols());
for (int col = 0; col < A.cols(); ++col)
for (Eigen::SparseMatrix<double>::InnerIterator it(A, col); it; ++it)
sq_colsums(col) += it.value() * it.value();
#pragma omp parallel for
for (unsigned int i = 0; i < (A.cols() - 1); ++i) {
for (unsigned int j = (i + 1); j < A.cols(); ++j) {
double dist = sq_colsums(i) + sq_colsums(j);
Eigen::SparseMatrix<double>::InnerIterator it1(A, i), it2(A, j);
while (it1 && it2) {
if (it1.row() < it2.row()) ++it1;
else if (it1.row() > it2.row()) ++it2;
else {
dist -= it1.value() * it1.value();
dist -= it2.value() * it2.value();
dist += std::pow(it1.value() - it2.value(), 2);
++it1; ++it2;
}
}
dists(i, j) = std::sqrt(dist);
dists(j, i) = dists(i, j);
}
}
dists.diagonal().array() = 1;
return dists;
}
As Dirk and others have said, there are packages out there (i.e. ParallelDist) that seem to do everything you're after (for dense matrices). Look at wordspace for fast cosine distance calculations. See here for some comparisons. Cosine distance is easy to efficiently calculate in R without use of Rcpp using crossprod operations (see qlcMatrix::cosSparse source code for algorithmic inspiration).

How can I parallelize my code about deleting overlapping spheres?

I'm trying to parallelize a piece of code. What my code does is checking if some spheres (defined by their coordinates xcentro, ycentro, zcentro and their radii r) overlapp each other or not. If they overlap, I must delete them, but as I don't know how to delete a component of a vector (it's a mess with the index and stuff) I just set the radii to zero and do not take them into account later.
My problem comes when I try to parallelize the code. If I don't do it, it works properly (although the code is not efficient at all and I need to run it with millions of spheres). And if I try to parallelize it, I obtain several errors. For example, if I try to run the code the exact way it is written below, I obtain segmentation fault. If I eliminate the private(...) part, I don't obtain any error, but don't obtain the same results as without parallelization.
What can I be doing wrong?
Here's the code:
vector<double> xcentro, ycentro, zcentro, r;
r.reserve(34000000);
xcentro.reserve(34000000);
ycentro.reserve(34000000);
zcentro.reserve(34000000);
... read files and fill up xcentro ycentro zcentro r with data ...
//#pragma omp parallel for private(i, j, xcentro, ycentro, zcentro, d) shared(r)
for (size_t i = 0; i < r.size() - 1; i++)
{
//#pragma omp parallel for private(i, j, xcentro, ycentro, zcentro, d) shared(r)
for (size_t j = i + 1; j < r.size() - 1; j++)
{
auto dist_square = (xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j])
+ (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j])
+ (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j]);
if ( dist_square < (r[i]+r[j])*(r[i]+r[j]) )
{
//hacer 0 el radio de la esfera j-esima
r[j] = 0;
//hacer 0 el radio de la esfera i-esima
r[i] = 0;
}
}
}
Okay, let's first consider an algorithm which actually works, i.e. obtain the subset of spheres with no overlap. To this end, we don't remove a sphere (before checking whether it overlaps with another one) but merely record that is has overlaps.
struct sphere { double R,X,Y,Z; };
inline constexpr double square(double x) noexcept
{ return x*x; }
inline constexpr bool overlap(sphere const&a, sphere const&b) noexcept
{ return square(a.X-b.X)+square(a.Y-b.Y)+square(a.Z-b.Z) > square(a.R+b.R); }
std::vector<sphere> keep_non_overlapping(std::vector<sphere> const&S)
{
std::vector<char> hasOverlap(Spheres.size(), char(0));
vector<sphere> result;
for(size_t i=0; i<S.size(); ++i) {
for(size_t j=i+1; j<S.size(); ++j)
if((!hasOverlap[i] || !hasOverlap[j]) && overlap(S[i],S[j])) {
hasOverlap[i] = 1;
hasOverlap[j] = 1;
}
if(!hasOverlap[i])
result.push_back(S[i]);
}
return result;
}
This algorithm loops every pair of spheres once. Since the test between spheres k and l is done when i equals the smaller of k and l and j the larger, the executions of the loop over i are still not mutually independent: there is still a race condition. This can be removed by looping over each pair of spheres twice:
std::vector<sphere> keep_non_overlapping(std::vector<sphere> const&S)
{
std::vector<char> hasOverlap(Spheres.size(), char(0));
#pragma omp parallel for
for(size_t i=0; i<S.size(); ++i) {
bool overlapping = false;
for(size_t j=0; !overlapping && j<S.size(); ++j)
if(j!=i && overlap(S[i],S[j])
overlapping = true;
hasOverlap[i] = !overlapping;
}
vector<sphere> result;
for(size_t i=0; i<S.size(); ++i)
if(!hasOverlap[i])
result.push_back(S[i]);
return result;
}
Note also that, depending on the distribution of spheres, it can make the execution significantly faster if you first order the sphere is descending radius (largest spheres first) as in
std::sort(S.begin(), S.end(), [](sphere const&a, sphere const&b) { return a.R > b.R; });
Note further that this naive O(N^2) algorithm is not optimal. There is likely a O(N ln(N)) algorithm which first arranges the spheres in some data structure (perhaps a spatial tree) in O(N ln(N)) time and then finds whether a sphere is overlapping in no more than O(ln N) time for each sphere.
Hereby, I answer your question asked in the comment:
How could I increase the speed of my program?
The best is to completely change the algorithm (as already suggested), but if you do not wish to change it for any reason, you can gain ca. 20% speed by parallelizing the outer loop:
#pragma omp parallel for schedule(dynamic, r.size()/500)
for (size_t i = 0; i < r.size(); ++i)
{
for (size_t j = i + 1; j < r.size(); ++j)
{
if ((((xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j]) + (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j]) + (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j])) < (r[i] + r[j]) * (r[i] + r[j])))
{
#pragma omp atomic write
overlaps[i] = 1;
#pragma omp atomic write
overlaps[j] = 1;
}
}
}
UPDATE:
Based on #Walter’s response and code, I created a simple algorithm that is significantly faster than your code. The basic idea is as follows: Sort the data according to x values and determine the largest radius. For a given x value, it is not necessary to go through the entire range, it is enough to examine those x values that are closer than twice the largest radius. Thus, the number of loop cycles can be significantly reduced and the speed of the algorithm was increased by orders of magnitude. I tested the speed difference between your code and the new algorithm with the code below using arrays filled with data of randomly created spheres. I created the algorithm so that you don't have to change the rest of your program, the new_algorithm function takes the data from xcentro, ycentro, zcetro, r arrays and returns the indexes of the overlapping spheres in the overlay2 array. On compiler explorer significant speed increase was observed:
size=20000
Runtime(your method)=1216 ms
Runtime(new algorithm)=13 ms
Note that this is a simple algorithm and easy to understand how it works, but based on your real data better algorithms may be created. Here is the code:
#include <iostream>
#include <vector>
#include <chrono>
#include <omp.h>
#include <algorithm>
using namespace std;
constexpr size_t N=10000;
std::vector<double> xcentro, ycentro, zcentro, r;
struct sphere { double X,Y,Z,R; size_t index; };
std::vector<sphere> Spheres;
inline constexpr double square(double x) noexcept
{ return x*x; }
inline constexpr bool overlap(sphere const&a, sphere const&b) noexcept
{ return square(a.X-b.X)+square(a.Y-b.Y)+square(a.Z-b.Z) < square(a.R+b.R); }
void new_algorithm(const std::vector<double>& x, const std::vector<double>& y, const std::vector<double>& z, const std::vector<double>& r, std::vector<char>& overlaps)
{
const auto start = std::chrono::high_resolution_clock::now();
std::vector<sphere> S;
S.reserve(r.size());
for (size_t i = 0; i < r.size(); i++)
{
S.push_back(sphere{x[i],y[i],z[i],r[i], i});
}
//Sort ascending X
std::sort(S.begin(), S.end(), [](sphere const&a, sphere const&b) { return a.X < b.X; });
// Clear overlaps and determine maximum r value
double maxr=-1;
for (size_t i = 0; i < S.size(); i++)
{
overlaps[i]=0;
if(S[i].R>maxr) maxr=S[i].R;
}
//Create a vector for maximum indices
std::vector<size_t> max_index(S.size(),0);
//Determine maximum_index
size_t j=1;
for (size_t i = 0; i < S.size(); i++)
{
while(S[j].X-S[i].X<2*maxr)
{
if(j<r.size()) j++; else break;
}
max_index[i]=j;
}
#pragma omp parallel for
for(size_t i=0; i<S.size(); ++i)
{
for(size_t j=i+1; j<max_index[i]; ++j)
if(overlap(S[i],S[j]))
{
#pragma omp atomic write
overlaps[S[i].index] = 1;
#pragma omp atomic write
overlaps[S[j].index] = 1;
}
}
const auto stop = std::chrono::high_resolution_clock::now();
auto diff = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
std::cout << "Runtime(new algorithm)=" << diff.count() << " ms\n";
}
void your_algorithm(std::vector<char>& overlaps)
{
size_t i,j;
const auto start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
for(i=0; i<r.size(); i++)
{
overlaps[i]=0;
}
for (i = 0; i < r.size(); i++)
{
#pragma omp parallel for
for (j = i + 1; j < r.size(); j++)
{
if ((((xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j]) + (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j]) + (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j])) < (r[i] + r[j]) * (r[i] + r[j])))
{
overlaps[i] = 1;
overlaps[j] = 1;
}
}
}
const auto stop = std::chrono::high_resolution_clock::now();
auto diff = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
std::cout << "Runtime(your method)=" << diff.count() << " ms" << std::endl;
}
int main() {
std::vector<char> overlaps1, overlaps2;
r.reserve(N);
xcentro.reserve(N);
ycentro.reserve(N);
zcentro.reserve(N);
overlaps1.reserve(N);
overlaps2.reserve(N);
//fill the arrays with random numbers
for(size_t i=0; i<N; i++)
{
double x=(rand() % 1000)/10.0;
double y=(rand() % 1000)/10.0;
double z=(rand() % 1000)/10.0;
double R=(rand() % 10000)/((double)N ) + 0.1;
xcentro.push_back( x );
ycentro.push_back( y );
zcentro.push_back( z );
r.push_back(R);
}
std::cout << "size=" << r.size() << std::endl;
your_algorithm(overlaps1);
new_algorithm(xcentro,ycentro,zcentro,r,overlaps2);
// Check if array of overlap is the same for the 2 methods
for(size_t i=0; i<N; i++)
{
if(overlaps1[i]!=overlaps2[i])
{
cout << "error\n"; exit (-1);
}
}
cout << "OK\n";
}
UPDATE2: Here is the code mentioned in comment (sort by R and remove the bigger sphere only)
std::vector<sphere> S;
S.reserve(r.size());
for (size_t i = 0; i < r.size(); i++)
{
overlaps[i]=0;
S.push_back(sphere{x[i],y[i],z[i],r[i], i});
}
//Sort descending R
std::sort(S.begin(), S.end(), [](sphere const&a, sphere const&b) { return a.R > b.R; });
#pragma omp parallel for
for(size_t i=0; i<S.size(); ++i)
{
for(size_t j=i+1; j<S.size(); ++j)
if(overlap(S[i],S[j]))
{
overlaps[S[i].index] = 1;
break;
}
}
Let us first improve your serial code a bit by avoiding to loop over already deleted spheres:
for(size_t i = 0; i < r.size(); ++i)
if(r[i]>0) {
for(size_t j=i+1; j<r.size(); ++j)
if(r[j]>0 && (xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j])
+ (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j])
+ (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j])
< (r[i]+r[j])*(r[i]+r[j]) ) {
r[j] = 0;
r[i] = 0;
}
}
This immediately shows you that the execution of the outer loop depends on all previous executions at smaller index, since these may have removed some of the spheres. This interdependence of the loop executions implies that your algorithm cannot be straightforwardly parallelized (in the way you attempted it).
Also, you have race conditions in the variables r[], which are read and written to. Your naive parallelization didn't take care of that problem either.
Just in case someone is still interested, I've improved my code and now it's more efficient (althought now enough for me yet) and now it does eliminate overlapping spheres:
#pragma omp parallel for
for(i=0; i<r.size(); i++)
{
overlaps[i]=0;
}
cout << "overlaps igualados a cero..." << endl;
//Me queda ver qué esferas se superponen y eliminarlas. Primero voy a ver comprobar qué esferas se superponen y posteriormente hago el radio de aquellas
//que se superponen igual a cero.
double cero = 0.0;
for (i = 0; i < r.size(); i++)
{
contador=0;
#pragma omp parallel for reduction(+:contador)
for (j = i + 1; j < r.size(); j++)
{
if ((((xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j]) + (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j]) + (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j])) < (r[i] + r[j]) * (r[i] + r[j])))
{
contador++;
overlaps[i] = contador;
overlaps[j]=contador;
}
}
}
#pragma omp parallel for
for(i=0; i<r.size(); i++)
{
if(overlaps[i]!=0)
{
r[i]=0;
}
}

Why omp version is slower than serial?

It's a follow-up question to this one
Now I have the code:
#include <iostream>
#include <cmath>
#include <omp.h>
#define max(a, b) (a)>(b)?(a):(b)
const int m = 2001;
const int n = 2000;
const int p = 4;
double v[m + 2][m + 2];
double x[m + 2];
double y[m + 2];
double _new[m + 2][m + 2];
double maxdiffA[p + 1];
int icol, jrow;
int main() {
omp_set_num_threads(p);
double h = 1.0 / (n + 1);
double start = omp_get_wtime();
#pragma omp parallel for private(icol) shared(x, y, v, _new)
for (icol = 0; icol <= n + 1; ++icol) {
x[icol] = y[icol] = icol * h;
_new[icol][0] = v[icol][0] = 6 - 2 * x[icol];
_new[n + 1][icol] = v[n + 1][icol] = 4 - 2 * y[icol];
_new[icol][n + 1] = v[icol][n + 1] = 3 - x[icol];
_new[0][icol] = v[0][icol] = 6 - 3 * y[icol];
}
const double eps = 0.01;
#pragma omp parallel private(icol, jrow) shared(_new, v, maxdiffA)
{
while (true) { //for [iters=1 to maxiters by 2]
#pragma omp single
for (int i = 0; i < p; i++) maxdiffA[i] = 0;
#pragma omp for
for (icol = 1; icol <= n; icol++)
for (jrow = 1; jrow <= n; jrow++)
_new[icol][jrow] =
(v[icol - 1][jrow] + v[icol + 1][jrow] + v[icol][jrow - 1] + v[icol][jrow + 1]) / 4;
#pragma omp for
for (icol = 1; icol <= n; icol++)
for (jrow = 1; jrow <= n; jrow++)
v[icol][jrow] = (_new[icol - 1][jrow] + _new[icol + 1][jrow] + _new[icol][jrow - 1] +
_new[icol][jrow + 1]) / 4;
#pragma omp for
for (icol = 1; icol <= n; icol++)
for (jrow = 1; jrow <= n; jrow++)
maxdiffA[omp_get_thread_num()] = max(maxdiffA[omp_get_thread_num()],
fabs(_new[icol][jrow] - v[icol][jrow]));
#pragma omp barrier
double maxdiff = 0.0;
for (int k = 0; k < p; ++k) {
maxdiff = max(maxdiff, maxdiffA[k]);
}
if (maxdiff < eps)
break;
#pragma omp barrier
//#pragma omp single
//std::cout << maxdiff << std::endl;
}
}
double end = omp_get_wtime();
printf("start = %.16lf\nend = %.16lf\ndiff = %.16lf\n", start, end, end - start);
return 0;
}
But why it works 2-3 times slower (32sec vs 18sec) than serial analog:
#include <iostream>
#include <cmath>
#include <omp.h>
#define max(a,b) (a)>(b)?(a):(b)
const int m = 2001;
const int n = 2000;
double v[m + 2][m + 2];
double x[m + 2];
double y[m + 2];
double _new[m + 2][m + 2];
int main() {
double h = 1.0 / (n + 1);
double start = omp_get_wtime();
for (int i = 0; i <= n + 1; ++i) {
x[i] = y[i] = i * h;
_new[i][0]=v[i][0] = 6 - 2 * x[i];
_new[n + 1][i]=v[n + 1][i] = 4 - 2 * y[i];
_new[i][n + 1]=v[i][n + 1] = 3 - x[i];
_new[0][i]=v[0][i] = 6 - 3 * y[i];
}
const double eps=0.01;
while(true){ //for [iters=1 to maxiters by 2]
double maxdiff=0.0;
for (int i=1;i<=n;i++)
for (int j=1;j<=n;j++)
_new[i][j]=(v[i-1][j]+v[i+1][j]+v[i][j-1]+v[i][j+1])/4;
for (int i=1;i<=n;i++)
for (int j=1;j<=n;j++)
v[i][j]=(_new[i-1][j]+_new[i+1][j]+_new[i][j-1]+_new[i][j+1])/4;
for (int i=1;i<=n;i++)
for (int j=1;j<=n;j++)
maxdiff=max(maxdiff, fabs(_new[i][j]-v[i][j]));
if(maxdiff<eps) break;
std::cout << maxdiff<<std::endl;
}
double end = omp_get_wtime();
printf("start = %.16lf\nend = %.16lf\ndiff = %.16lf\n", start, end, end - start);
return 0;
}
Also interesting that it works SAME TIME as version (I can post it here if you say so) which looks like so
while(true){ //106 iteratins here!!!
#pragma omp paralell for
for(...)
#pragma omp paralell for
for(...)
#pragma omp paralell for
for(...)
}
But I thought that what making omp code slow is spawning threads inside while loop 106 times... But no! Then probably threads simultaneously write to the same array cells.. But where does it happen? I don't see it could you show me please?
Maybe it's because too much barriers? But Lecturer told me to implement the code like so and "analyse it" Maybe the answer is "Jacobi algorithm isn't meant to run well in parallel"? Or it's just my lame coding?
So the root of evel was
max(maxdiffA[w],fabs(_new[icol][jrow] - v[icol][jrow]))
because it's
#define max(a, b) (a)>(b)?(a):(b)
It's probably creating TOO much branching ('if's ) Without this thing parallel version works 8 times faster and loading CPU 68% instead of 99%..
The starange thing: same "max" doesn't affect serioal version
I am writing to make you aware of a few situations. It is not short to write in a comment, so I decided to write as an answer.
every time a thread is made, it takes some time for its creation. if your program's running time in a single core is short, then the creation of threads will make this time longer for multi-core.
plus using a barrier makes all your threads wait for others, which could somehow be slowed down in cpu. this way, even if all threads finish the job very fast, that last one will make the total run time longer.
try to run your program with bigger sized arrays where time is around 2 minutes for single threading. then make your way to multi-core.
then try to wrap your main code in a normal loop to run it a few times and prints the timings for each. the first run of the loop might be slow because of loading libraries, but the next runs should be faster to prove the increasing speed.
if above suggestions do not give a result, then it means your coding needs more editing.
EDIT:
To downvoters, If you don't like a post, please at least be polite and leave a comment. Or better, give your own answer so be helpful to community.

Optimize outer loop with OpenMP and a reduction

I struggle a bit with a function. The calculation is wrong if I try to parallelize the outer loop with a
#pragma omp parallel reduction(+:det).
Can someone show me how to solve it and why it is failing?
// template<class T> using vector2D = std::vector<std::vector<T>>;
float Det(vector2DF &a, int n)
{
vector2DF m(n - 1, vector1DF(n - 1, 0));
if (n == 1) return a[0][0];
if (n == 2) return a[0][0] * a[1][1] - a[1][0] * a[0][1];
float det = 0;
for (int i = 0; i < n; i++)
{
int l = 0;
#pragma omp parallel for private(l)
for (int j = 1; j < n; j++)
{
l = 0;
for (int k = 0; k < n; k++)
{
if (k == i) continue;
m[j - 1][l] = a[j][k];
l++;
}
}
det += std::pow(-1.0, 1.0 + i + 1.0) * a[0][i] * Det(m, n - 1);
}
return det;
}
If you parallelize the outer loop, there is a race condition on this line:
m[j - 1][l] = a[j][k];
Also you likely want a parallel for reduction instead of just a parallel reduction.
The issue is, that m is shared, even though that wouldn't be necessary given that it is completely overwritten in the inner loop. Always declare variables as locally as possible, this avoids issues with wrongly shared variables, e.g.:
float Det(vector2DF &a, int n)
{
if (n == 1) return a[0][0];
if (n == 2) return a[0][0] * a[1][1] - a[1][0] * a[0][1];
float det = 0;
#pragma omp parallel reduction(+:det)
for (int i = 0; i < n; i++)
{
vector2DF m(n - 1, vector1DF(n - 1, 0));
for (int j = 1; j < n; j++)
{
int l = 0;
for (int k = 0; k < n; k++)
{
if (k == i) continue;
m[j - 1][l] = a[j][k];
l++;
}
}
det += std::pow(-1.0, 1.0 + i + 1.0) * a[0][i] * Det(m, n - 1);
}
return det;
}
Now that is correct, but since m can be expensive to allocate, performance could benefit from not doing it in each and every iteration. This can be done by splitting parallel and for directives as such:
float Det(vector2DF &a, int n)
{
if (n == 1) return a[0][0];
if (n == 2) return a[0][0] * a[1][1] - a[1][0] * a[0][1];
float det = 0;
#pragma omp parallel reduction(+:det)
{
vector2DF m(n - 1, vector1DF(n - 1, 0));
#pragma omp parallel for
for (int i = 0; i < n; i++)
{
for (int j = 1; j < n; j++)
{
int l = 0;
for (int k = 0; k < n; k++)
{
if (k == i) continue;
m[j - 1][l] = a[j][k];
l++;
}
}
det += std::pow(-1.0, 1.0 + i + 1.0) * a[0][i] * Det(m, n - 1);
}
}
return det;
}
Now you could also just declare m as firstprivate, but that would assume that the copy constructor makes a completely independent deep-copy and thus make the code more difficult to reason about.
Please be aware that you should always include expected output, actual output and a minimal complete and verifiable example.