I'm currently trying to parallelize a particle swarm algorithm, and can't find an efficient way to handle a = inside a loop. I didn't do the most simple form of just putting #pragma omp parallel for at the start of the loop since i expect that will just lead to false sharing problems. The whole thing gets complicated by the variables being Matrices and Vectors.
I think this minimal example (using armadillo, a linear algebra library similar to shows it better then i can describe it:
#include <omp.h>
#include <math.h>
#include <armadillo>
using namespace std;
int main(int argc, char** argv) {
arma::uword dimensions = 10, particleCount = 40;
//Matrix with dimensions-rows, and particleCount-columns. initialized with 1
arma::Mat<double> positions = arma::ones(dimensions, particleCount);
//same as above, but initialized with 2. +,/,*,- are elementwise;like in matlab
arma::Mat<double> velocities = arma::ones(dimensions, particleCount) * 2;
for(arma::uword n = 0; n < particleCount; n++) {
//.col(n) gets the nth column of the matrix
arma::Col<double> newVelocity = std::rand() * velocities.col(n);
//there is a lot more math done here, but all of it is read-only access to other variables
positions.col(n) += newVelocity; //again elementwise
velocities.col(n) = newVelocity;
return 0;
My first idea was to do something like this, but it's horribly inefficient:
#include <omp.h>
#include <math.h>
#include <armadillo>
using namespace std;
int main(int argc, char** argv) {
arma::uword dimensions = 10, particleCount = 40;
//these two variables cannot be moved inside the parallel region :-/
arma::Mat<double> positions = arma::ones(dimensions, particleCount);
arma::Mat<double> velocities = arma::ones(dimensions, particleCount) * 2;
#pragma omp parallel
arma::Mat<double> velocity_private = arma::zeros(dimensions,particleCount);
#pragma omp for
for(arma::uword n = 0; n < particleCount; n++) {
arma::Col<double> newVelocity = std::rand() * velocities.col(n);
velocity_private.col(n) = newVelocity;
#pragma omp single
//first part of workaround for '='
velocities = arma::zeros(dimensions, particleCount);
#pragma omp critical
for(arma::uword n = 0; n < particleCount; n++) {
positions.col(n) += velocity_private.col(n);
//second part of workaround for '='
velocities.col(n) += velocity_private.col(n);
}//end omp parallel
return 0;
I thought about using user-defined reductions, but i didn't find any example for assignments. All of them where for additions or multiplications, and unfortunately not very accessible.
Any suggestion(s) or advice appreciated! :)


Random generation with TRNG

For the following code which generates random numbers for Monte Carlo simulation, I need to receive the exact sum for each run, but this will not happen, although I have fixed the seed. I would appreciate it if anyone could point out the problem with this code
#include <cmath>
#include <random>
#include <iostream>
#include <chrono>
#include <cfloat>
#include <iomanip>
#include <cstdlib>
#include <omp.h>
#include <trng/yarn2.hpp>
#include <trng/mt19937_64.hpp>
#include <trng/uniform01_dist.hpp>
using namespace std;
using namespace chrono;
const double landa = 1;
const double exact_solution = landa / (pow(landa, 2) + 1);
double function(double x) {
return cos(x) / landa;
int main() {
int rank;
const int N = 1000000;
double sum = 0.0;
trng::yarn2 r[6];
for (int i = 0; i <6; i++)
for (int i = 0; i < 6; i++)
trng::uniform01_dist<double> u;
auto start = high_resolution_clock::now();
#pragma omp parallel num_threads(6)
#pragma omp for reduction (+: sum)
for (int i = 0; i<N; ++i) {
//double x = distribution(g);
double x= u(r[rank]);
x = (-1.0 / landa) * log(1.0 - x);
sum = sum+function(x);
double app = sum / static_cast<double> (N);
auto end = high_resolution_clock::now();
auto diff=duration_cast<milliseconds>(end-start);
cout << "Approximation is: " <<setprecision(17) << app << "\t"<<"Time: "<< setprecision(17) << diff.count()<<" Error: "<<(app-exact_solution)<< endl;
return 0;
TL;DR The problem is two-fold:
Floating point addition is not associative;
You are generating different random number for each thread.
I need to receive the exact sum for each run, but this will not
happen, although I have fixed the seed. I would appreciate it if
anyone could point out the problem with this code
First, you have a race-condition on rank=omp_get_thread_num();, the variable rank is shared among all threads, to fix that you can declared the variable rank inside the parallel region, hence, making it private to each thread.
#pragma omp parallel num_threads(6)
int rank=omp_get_thread_num();
In your code, you should not expect that the value of the sum will be the same for different number of threads. Why ?
because you are adding doubles in parallel
double sum = 0.0;
#pragma omp for reduction (+: sum)
for (int i = 0; i<N; ++i) {
//double x = distribution(g);
double x= u(r[rank]);
x = (-1.0 / landa) * log(1.0 - x);
sum = sum+function(x);
and from What Every Computer Scientist Should Know about Floating
Point Arithmetic one can read:
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the
expression (x+y)+z has a totally different answer than x+(y+z) when
x = 1e30, y = -1e30 and z = 1 (it is 1 in the former case, 0 in the
Hence, from that you conclude that floating point addition is not
associative, and the reason why for a different number of threads you might have different sum values.
You are generating different random values per thread:
for (int i = 0; i < 6; i++)
Consequently, for different number of threads, the variable sum
gets different results as well.
As kindly point out by jérôme-richard in the comments:
Note that more precise algorithm like the Kahan summation can
significantly reduces the rounding issue while being still relatively

Why this code (with OpenMP in MEX-file of Matlab) give different results?

I am using OpenMP in building MEX-file for Matlab. I found my code gives different results when using OpenMP for acceleration. I made a simple example of it as below. It suppose to calculate the mean of every vector. Every element in every vector is 1. So the result is supposed be an array of 1. But the result sometimes has other numbers, like 0.333,0.666, or 0. I thought it must be related to the OpenMP for loop. But I can't figure it out. Any suggestion or idea will be appreciate.
#include "mex.h"
#include <vector>
#include <iostream>
#include <algorithm>
#include <numeric>
#include <omp.h>
using namespace std;
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
int xx=8;
int yy[]={2,3,4,5,6,7,8,9};
vector<vector<double>> data(xx);
int i,ii;
#pragma omp parallel
#pragma omp for
for (i = 0; i < xx; i++) {
for (ii = 0; ii < yy[i]; ii++) {
mean0[i] =accumulate( data[i].begin(), data[i].end(), 0.0)/data[i].size();
// output
plhs[0] = mxCreateDoubleMatrix(mean0.size(), 1, mxREAL);
copy(mean0.begin(), mean0.end(), mxGetPr(plhs[0]));
You have declared int i,ii; before the parallel section. This causes these variables to be shared.
You are using C++, declare variables where you first initialize them. In the case of loop variables, this looks like this:
for (int i = 0; i < xx; i++) {
for (int ii = 0; ii < yy[i]; ii++) {
mean0[i] = ...
This improves readability of the code, and also fixes your problem with OpenMP.
By the way, the loop above can also be written with a single call to std::fill.

Getting speed improvement with OpenMP in nested for loops with dependencies

I am trying to implement a procedure in parallel processing form with OpenMP. It contains four level nested for loops (dependent) and has a variable sum_p to be updated in the innermost loop. In short, the my question is regarding the parallel implementation of the following code snippet:
for (int i = (test_map.size() - 1); i >= 1; --i) {
bin_i =; //test_map is a "STL map of vectors"
len_rank_bin_i = bin_i.size(); // bin_i is a vector
for (int j = (i - 1); j >= 0; --j) {
bin_j =;
len_rank_bin_j = bin_j.size();
for (int u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i]; //node_u is a scalar
for (int v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_p += 1;
The full program is given below:
#include <iostream>
#include <vector>
#include <omp.h>
#include <random>
#include <unordered_map>
#include <algorithm>
#include <functional>
#include <time.h>
int main(int argc, char* argv[]){
double time_temp;
int test_map_size = 5000;
std::unordered_map<unsigned int, std::vector<unsigned int> > test_map(test_map_size);
// Fill the test map with random intergers ---------------------------------
std::random_device rd;
std::mt19937 gen1(rd());
std::uniform_int_distribution<int> dist(1, 5);
auto gen = std::bind(dist, gen1);
for(int i = 0; i < test_map_size; i++)
int vector_len = dist(gen1);
std::vector<unsigned int> tt(vector_len);
std::generate(begin(tt), end(tt), gen);
// Sequential implementation -----------------------------------------------
time_temp = omp_get_wtime();
std::vector<unsigned int> bin_i, bin_j;
unsigned int node_v, node_u;
unsigned int len_rank_bin_i;
unsigned int len_rank_bin_j;
int sum_s = 0;
for (unsigned int i = (test_map_size - 1); i >= 1; --i) {
bin_i =;
len_rank_bin_i = bin_i.size();
for (unsigned int j = i; j-- > 0; ) {
bin_j =;
len_rank_bin_j = bin_j.size();
for (unsigned int u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i];
for (unsigned int v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_s += 1;
std::cout<<"Estimated sum (seq): "<<sum_s<<std::endl;
time_temp = omp_get_wtime() - time_temp;
printf("Time taken for sequential implementation: %.2fs\n", time_temp);
// Parallel implementation -----------------------------------------------
time_temp = omp_get_wtime();
int sum_p = 0;
#pragma omp parallel
std::vector<unsigned int> bin_i, bin_j;
unsigned int node_v, node_u;
unsigned int len_rank_bin_i;
unsigned int len_rank_bin_j;
unsigned int i, u_i, v_i;
int j;
#pragma omp parallel for private(j,u_i,v_i) reduction(+:sum_p)
for (i = (test_map_size - 1); i >= 1; --i) {
bin_i =;
len_rank_bin_i = bin_i.size();
#pragma omp parallel for private(u_i,v_i)
for (j = (i - 1); j >= 0; --j) {
bin_j =;
len_rank_bin_j = bin_j.size();
#pragma omp parallel for private(v_i)
for (u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i];
#pragma omp parallel for
for (v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_p += 1;
std::cout<<"Estimated sum (parallel): "<<sum_p<<std::endl;
time_temp = omp_get_wtime() - time_temp;
printf("Time taken for parallel implementation: %.2fs\n", time_temp);
return 0;
Running the code with command g++-7 -fopenmp -std=c++11 -O3 -Wall -o so_qn so_qn.cpp in macOS 10.13.3 (i5 processor with four logical cores) gives the following output:
Estimated sum (seq): 38445750
Time taken for sequential implementation: 0.49s
Estimated sum (parallel): 38445750
Time taken for parallel implementation: 50.54s
The time taken for parallel implementation is multiple times higher than sequential implementation. Do you think the code or logic can deduced to parallel implementation? I have spent a few days to improve the terrible performance of my code but to no avail. Any help is greatly appreciated.
With the changes suggested by JimCownie, i.e., "using omp for, not omp parallel for" and removing the parellelism of inner loops, the performance is greatly improved.
Estimated sum (seq): 42392944
Time taken for sequential implementation: 0.48s
Estimated sum (parallel): 42392944
Time taken for parallel implementation: 0.27s
My CPU has four logical cores (and I am using four threads), now I am wondering, would there be anyway to get four times better performance than the sequential implementation.
I see a different problem here when my map of vectors test_map is short, but fat at each level, i.e., the map size is small and but the vector size at each of the keys is very large. In such a case the performance of sequential and parallel implementations are comparable, without much difference. It seems like we need to parallelize inner loops too. Do you know how to achieve it in this context?

Cilk Plus code result depends on number of workers

I have a small piece of code that I would like to parallelize as I upscale. I've been using cilk_for from Cilk Plus to run the multithreading. The trouble is that I get a different result depending on the number of workers.
I've read that this might be due to a race condition, but I'm not sure what specifically about the code causes that or how to ameliorate it. Also, I realize that long and __float128 are overkill for this problem, but might be necessary in the upscaling.
#include <assert.h>
#include "cilk/cilk.h"
#include <cstring>
#include <iostream>
#include <math.h>
#include <stdio.h>
#include <string>
#include <vector>
using namespace std;
__float128 direct(const vector<double>& Rpct, const vector<unsigned>& values, double Rbase, double toWin) {
unsigned count = Rpct.size();
__float128 sumProb = 0.0;
__float128 rProb = 0.0;
long nCombo = static_cast<long>(pow(2, count));
// for (long j = 0; j < nCombo; ++j) { //over every combination
cilk_for (long j = 0; j < nCombo; ++j) { //over every combination
vector<unsigned> binary;
__float128 prob = 1.0;
unsigned point = Rbase;
for (unsigned i = 0; i < count; ++i) { //over all the individual events
long exp = static_cast<long>(pow(2, count-i-1));
bool odd = (j/exp) % 2;
if (odd) {
point += values[i];
prob *= static_cast<__float128>(Rpct[i]);
} else {
prob *= static_cast<__float128>(1.0 - Rpct[i]);
sumProb += prob;
if (point >= toWin) rProb += prob;
assert(sumProb >= rProb);
//print sumProb
cout << " sumProb = " << (double)sumProb << endl;
assert( fabs(1.0 - sumProb) < 0.01);
return rProb;
int main(int argc, char *argv[]) {
vector<double> Rpct;
vector<unsigned> value;
unsigned Rbase = 22;
unsigned win = 30;
__float128 rProb = direct(Rpct, value, Rbase, win);
cout << (double)rProb << endl;
return 0;
Sample output for export CILK_NWORKERS=1 && ./code.exe:
sumProb = 1
Sample output for export CILK_NWORKERS=4 && ./code.exe:
sumProb = 0.948159
Assertion failed: (fabs(1.0 - sumProb) < 0.01), function direct, file code.c, line 61.
Abort trap: 6
It is because of a race condition. cilk_for is implementation of parallel for algorithm. If you want to use parallel for you must use independent iteration (independent data). It`is very important. You have to use cilk reducers for your case:
To clarify, there is at least one race on sumProb. Each of the parallel workers will do a read/modify/write on that location. As sribin mentioned above, solving problems like this is what reducers are for.
It's entirely possible that there's more than one race in your program. The only way to be sure is to run it under a race detector, since finding races is one of the things that computers are much better at than humans. A free possibility is the Cilkscreen race detector, available from the website. Unfortunately it doesn't support gcc/g++.

Couldn't get acceleration OpenMP

I am writing simple parallel program in C++ using OpenMP.
I am working on Windows 7 and on Microsoft Visual Studio 2010 Ultimate.
I changed the Language property of the project to "Yes/OpenMP" to support OpenMP
Here I provide the code:
#include <iostream>
#include <omp.h>
using namespace std;
double sum;
int i;
int n = 800000000;
int main(int argc, char *argv[])
sum = 0;
#pragma omp for reduction(+:sum)
for (i = 0; i < n; i++)
sum+= i/(n/10);
But, I couldn't get any acceleration by changing the x in omp_set_num_threads(x);
It doesn't matter if I use OpenMp or not, the calculating time is the same, about 7 seconds.
Does Someone know what is the problem?
Your pragma statement is missing the parallel specifier:
#include <iostream>
#include <omp.h>
using namespace std;
double sum;
int i;
int n = 800000000;
int main(int argc, char *argv[])
sum = 0;
#pragma omp parallel for reduction(+:sum) // add "parallel"
for (i = 0; i < n; i++)
sum+= i/(n/10);
Here's a version that some speedup with Hyperthreading. I had to increase the # of iterations by 10x and bump the datatypes to long long:
double sum;
long long i;
long long n = 8000000000;
int main(int argc, char *argv[])
double start = omp_get_wtime();
sum = 0;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < n; i++)
sum+= i/(n/10);
double end = omp_get_wtime();
cout << end - start << endl;
Threads: 1
Threads: 2
Threads: 4
Threads: 8
Apart from the error pointed out by Mystical, you seemed to assume that openMP can justs to magic. It can at best use all cores on your machine. If you have 2 cores, it may reduce the execution time by two if you call omp_set_num_threads(np) with np>=2, but for np much larger than the number of cores, the code will be inefficient due to parallelization overheads.
The example from Mystical was apparently run on at least 4 cores with np=4.