How to parellalize histogram addition? - c++

I have this algoritmo which scans an image and for each pixel p calculates a 256 bins histogram in which values of the pixel inside a patch around p are saved. The algorithm needs to be O(1) so a need to do many histogram addition, I'd like to make the algorithm faster by parallelizing the histogram addition with OpenMP, so I added #pragma omp parallel for before each for (just the ones with histogram additions) but it actually makes it 10 times slower. I think i need to create a parallel region outside but I don't understand how.
Also, I'm afraid the overhead generated by OpenMP overcomes the speed gained by the parallelization of a 256-for, but I don't know for sure
for (int i = 0; i < src.rows; i++) {
for (int j = 0; j < src.cols; j++) {
if (j == 0)
{ ... }
else {
if (j > side/2) { // subtract col
for (int h = 0; h < 256; h++) // THIS ONE
histogram[h] -= colHisto[j - (side/2) - 1][h];
if (j < src.cols - side/2) { // add column
if (i > side/2) { // subtract pixel
colHisto[j + side/2][<uchar>(i - side/2 - 1, j + side/2)]--;
if (i < src.rows - side/2) { // add pixel
colHisto[j + side/2][<uchar>(i + side/2, j + side/2)]++;
for (int h = 0; h < 256; h++) // AND THIS ONE
histogram[h] += colHisto[j + side/2][h];

I actually solved myself by studying OpenMP more here is the code
#pragma omp parallel
for (int i = 0; i < src.rows; i++) {
for (int j = 0; j < src.cols; j++) {
// printf("%d%d:", i, j);
if (j == 0) { ... }
else {
#pragma omp single
{ ... }
one = getTickCount();
#pragma omp for
for (int h = 0; h < 256; h++)
histogram[h] += colHisto[j + side / 2][h];
printf("histotime = %d\n", getTickCount() - one);
It's significantly faster than putting #pragma omp parallel for before each loop but still slower than the sequential version


Is there a way to parallelize a lower triangle matrix solver?

The goal is to add OpenMP parallelization to for (i = 0; i < n; i++) for the lower triangle solver for the form Ax=b. Expected result is exactly same as the result when there is NO parallelization added to for (i = 0; i < n; i++).
vector<vector<double>> represents a 2-D matrix. makeMatrix(int m, int n) initializes a vector<vector<double>> of all zeroes of size mxn.
Two of the most prominent tries have been left in comments.
vector<vector<double>> lowerTriangleSolver(vector<vector<double>> A, vector<vector<double>> b)
vector<vector<double>> x = makeMatrix(A.size(), 1);
int i, j;
int n = A.size();
double s;
//#pragma omp parallel for reduction(+: s)
//#pragma omp parallel for shared(s)
for (i = 0; i < n; i++)
s = 0.0;
#pragma omp parallel for
for (j = 0; j < i; j++)
s = s + A[i][j] * x[j][0];
x[i][0] = (b[i][0] - s) / A[i][i];
return x;
You could try to assign the outer loop iterations among threads, instead of the inner loop. In this way, you increase the granularity of the parallel tasks and avoid the reduction of the 's' variable.
#pragma omp parallel for
for (int i = 0; i < n; i++){
double s = 0.0;
for (int j = 0; j < i; j++){
s = s + A[i][j] * x[j][0];
x[i][0] = (b[i][0] - s) / A[i][i];
Unfortunately, that is not possible because there is a dependency between s = s + A[i][j] * x[j][0]; and x[i][0] = (b[i][0] - s) / A[i][i];, more precisely x[j][0] depends upon the x[i][0].
So you can try two approaches:
for (int i = 0; i < n; i++){
double s = 0.0;
#pragma omp parallel for reduction(+:s)
for (int j = 0; j < i; j++){
s = s + A[i][j] * x[j][0];
x[i][0] = (b[i][0] - s) / A[i][i];
or using SIMD :
for (int i = 0; i < n; i++){
double s = 0.0;
#pragma omp simd reduction(+:s)
for (int j = 0; j < i; j++){
s = s + A[i][j] * x[j][0];
x[i][0] = (b[i][0] - s) / A[i][i];

How to add OpenMp to triple nested for-loop

The goal is to add as much OpenMP to the following Cholesky factor function to increase parallelization. So far, I only have one #pragma omp parallel for implemented correctly. vector<vector<double>> represents a 2-D matrix. I've already tried adding #pragma omp parallel for for
for (int i = 0; i < n; ++i), for (int k = 0; k < i; ++k), and for (int j = 0; j < k; ++j) but the parallelization goes wrong. makeMatrix(n, n) initializes a vector<vector<double>> of all zeroes of size nxn.
vector<vector<double>> cholesky_factor(vector<vector<double>> input)
int n = input.size();
vector<vector<double>> result = makeMatrix(n, n);
for (int i = 0; i < n; ++i)
for (int k = 0; k < i; ++k)
double value = input[i][k];
for (int j = 0; j < k; ++j)
value -= result[i][j] * result[k][j];
result[i][k] = value / result[k][k];
double value = input[i][i];
#pragma omp parallel for
for (int j = 0; j < i; ++j)
value -= result[i][j] * result[i][j];
result[i][i] = std::sqrt(value);
return result;
I don't think you can parallelize much more than this with this algorithm, as the ith iteration of the outer loop depends on the results of the i - 1th iteration and the kth iteration of the inner loop depends on the results of the k - 1th iteration.
vector<vector<double>> cholesky_factor(vector<vector<double>> input)
int n = input.size();
vector<vector<double>> result = makeMatrix(n, n);
for (int i = 0; i < n; ++i)
for (int k = 0; k < i; ++k)
double value = input[i][k];
// reduction(-: value) does the same
// (private instances of value are initialized to zero and
// added to the initial instance of value when the threads are joining
#pragma omp parallel for reduction(+: value)
for (int j = 0; j < k; ++j)
value -= result[i][j] * result[k][j];
result[i][k] = value / result[k][k];
double value = input[i][i];
#pragma omp parallel for reduction(+: value)
for (int j = 0; j < i; ++j)
value -= result[i][j] * result[i][j];
result[i][i] = std::sqrt(value);
return result;

OPENMP Parallel Problem Error for Double Loop

I was getting the error: "free(): corrupted unsorted chunks" when trying to run:
#pragma omp parallel for reduction(+:save) shared(save2)
for (size_t i = 0; i <= N; ++i) {
vector<float> dist = cdist(i, arestas);
vector<float> distinv(dist.size());
for (size_t j = 0; j < N(); ++j) {
if (arr[j] > 0)
arrv[j] = (1/N) + (1 / arr[j]);
arrv[j] = 0;
save = accumulate(arrv.begin(), arrv.end(), 0.0);
vector<double>::iterator iter = save2.begin() + i;
save2.insert(iter, sum);
I might miss the point here, but what about just doing it this way (not tested)?
vector<double> sum2(N);
#pragma omp parallel for num_threads(8)
for ( size_t i = 0; i < N; i++ ) {
double sum = 0;
for ( size_t j = 0; j < dist.size(); ++j ) {
if ( dist[j] > 0 ) {
sum += 1. / dist[j];
sum2[i] = sum;
There is still some room for improving this version (by removing the if statement for example, in order to help the vectorization), but unless you had some unexplained constrains in your code, I think this version is a good starting point.

Deadlock on parallel loop

I'm trying to parallelize the code below. It's easy to see that there is a dependency between the values of aux, since they are computed after the inner loop, but they are needed inside that inner loop (note that on the first iteration j = 0, the code inside the inner loop is not executed). On the other hand, there is no dependency between the values of mu because we only update mu[k], but the only values needed for other computations are in mu[j], for 0 <= j < k.
My approach consists in having the elements of aux locked until they are computed. As soon as a given value of aux is computed, the lock of that element is released and every thread can use it. However, with this code a deadlock occurs and I can't figure out why. Does someone have any tips?
for (j = 0; j < k; ++j)
locks[j] = 0;
#pragma omp parallel for num_threads(N_THREADS) private(j, i)
for (j = 0; j < k; ++j)
vals[j] = (long)0;
for (i = 0; i < j; i++)
vals[j] += mu[j][i] * aux[i];
aux[j] = (s[j] - vals[j]);
locks[j] = 1;
mu[k][j] = aux[j] / c[j];
Does it also hang when not optimized?
In optimized code, gcc would not bother reading locks[i] more than once, so this:
for (i = 0; i < j; i++) {
would be like writing:
for (i = 0; i < j; i++) {
if( !locks[i] ) for(;;) {}
Try adding a barrier to force gcc to re-read locks[i]:
#define pause() do { asm volatile("pause;":::"memory"); } while(0)
for (i = 0; i < j; i++) {
while(!locks[i]) pause();

OpenMP: Nested for-loop, barely any difference in execution time

I am doing some image processing and have a nested for loop. I want to implement multiprocessing using OpenMP. The for loop looks like this, where I have added the pragma tags and declared some of the variables private as well.
int a,b,j, idx;
#pragma omp parallel for private(b,j,sumG,sumGI)
for(a = 0; a < ny; ++a)
for(b = 0; b < nx; ++b)
idx = a*ny+b;
if (imMask[idx] == 0)
Wshw[idx] = 0;
sumG = 0;
sumGI = 0;
for(j = a; j < ny; ++j)
sumG += shadowM[j-a];
sumGI += shadowM[j-a] * imBlurred[nx*j + b];
Wshw[idx] = sumGI / sumG;
The size of both nx and ny is large and I thought that, using OpenMP, I would get a descent decrease in execution time, instead there is almost no difference. Am I doing something wrong when I implement the multi-threading maybe?
You have a race conditon in idx. You need to make it private as well.
However, instead you could try something like this.
int a,b,j, idx;
#pragma omp parallel for private(a,b,j,sumG,sumGI)
for(idx=0; idx<ny*nx; ++idx) {
if (imMask[idx] == 0)
Wshw[idx] = 0;
sumG = 0;
sumGI = 0;
for(j = a; j < ny; ++j) {
sumG += shadowM[j-a];
sumGI += shadowM[j-a] * imBlurred[nx*j + b];
Wshw[idx] = sumGI / sumG;
You might be able to simiply the inner loop as well as a functcion of idx instead a and b.