I have been trying to parallelize a particle simulation code I wrote. But in my parallelization, I came away with no increase in performance when moving from 1 processor to 12, and even worse the code is no longer returning accurate results. I have been banging my head against the wall and can't figure this out. Below is the loop being parallelized:
#pragma omp parallel
#pragma omp for
// Loop over azimuth ejection angle, from 0-360.
for(int i=0; i<360; i++)
// Declare temporary variables
double *y = new double[12];
vector<double> ejecVel(3);
vector<double> colLoc(7);
double azimuth, inc;
bool collision;
// Loop over inclincation ejection angle from 1-max_angle, increasing by 1 degree.
for(int j=1; j<=15; j++)
// Update azimuth and inclination angle and get velocity direction vector.
azimuth = (double) i;
inc = (double) j;
ejecVel = Jet::GetEjecVelocity(azimuth,inc);
collision = false;
// Update initial conditions.
y[0] = m_parPos[0];
y[1] = m_parPos[1];
... (define pointer values)
// Simulate particle
if(collision == true)
cout << "Collision! " << endl;
delete [] y;
The goal is to loop through, simulating particles for different initial conditions over the loops, and store where they have gone and their state vector upon collision in master variables densCount and collisionStates. The simulation takes place in a function from another class (systemSolver.ParticleSim() ), and it seems like each solve from a different thread is not independent. Everything I've read suggests that it should be, but I can't figure out why else the result would not be right only if I have Open MP implemented. Any thoughts are greatly appreciated.
SOLUTION: The simulation was modifying a member variable of a separate (systemSolver) class. Since I provided a single class object to all threads, they were all simultaneously modifying an important member variable. Thought I would post this in case any other n00bs encounter a similar problem.

I believe one mistake is the call to omp_set_* functions inside the parallel region. In the best case, they take effect on subsequent regions only. Try to reorder as following:
#pragma omp parallel


TBB Free Image lambda array comparison error

I hae been playing around with Thread building blocks with Free Image Plus on linux. I have been trying to compare the speeds between a sequential and parallel approach when subtracting one image from another, however I noticed that the final outcome when using the parallel approach generates some anomalies that I am unsure now to solve and am in need of some advice.
my question is: Why does the image seem to generate more array comparison errors when using parallel but work fine when using sequential (The image is supposed to be black with a few white spots, so the white pixels in the second image are comparison errors between the 2 image pixel arrays (of type RGBQUAD)).
RGBQUADs are declared before the call to these methods and act as global variables.
RGBQUAD rgb2; "Sequential".
for (auto y = 0; y < height; y++)
for(auto x= 0; x < width; x++)
inputImage.getPixelColor(x, y, &rgb);
inputImage2.getPixelColor(x, y, &rgb2);
rgbDiffVal[y][x].rgbRed = abs(rgb.rgbRed - rgb2.rgbRed);
rgbDiffVal[y][x].rgbBlue = abs(rgb.rgbBlue - rgb2.rgbBlue);
rgbDiffVal[y][x].rgbGreen = abs(rgb.rgbGreen - rgb2.rgbGreen);
} "with TBB parallel".
parallel_for(blocked_range2d<int,int>(0,height, 0, width), [&] (const blocked_range2d<int,int>&r) {
auto y1 = r.rows().begin();
auto y2 = r.rows().end();
auto x1 = r.cols().begin();
auto x2 = r.cols().end();
for (auto y = y1; y < y2; y++) {
for (auto x = x1; x < x2; x++) {
inputImage.getPixelColor(x, y, &rgb);
inputImage2.getPixelColor(x, y, &rgb2);
rgbDiffVal[y][x].rgbRed = abs(rgb.rgbRed - rgb2.rgbRed);
rgbDiffVal[y][x].rgbBlue = abs(rgb.rgbBlue - rgb2.rgbBlue);
rgbDiffVal[y][x].rgbGreen = abs(rgb.rgbGreen - rgb2.rgbGreen);
I believe it may have something to do with passing the reference pointer inside a lambda that is copying values by reference anyway as this is the only thing I can think of that may affect the process. (rgb, rgb2). I have observed that If I change the parallel for blocked range to height and width, this solves the issue, however this then defeats the point of using a parallel method in the first place.
The first thing I would do is put a spin mutex lock around the update (The last three lines of the inner loop where you are storing the result.) This will slow the program down a lot but will tell you if there is a synchronization problem with your update. (It is not obvious, but you are storing adjacent values in the same cache lines for some of the results. Testing is the only way to answer this.)
If it is, you can do atomic swaps to update the result, but that is super-expensive. If not, sorry I didn't spot your problem.
Even if this is not a problem, you might get better performance by using a 1-D blocked_range which is cache-aligned. I'll leave it to you to do the gory implementation.
If these variables are supposed to temporarily store pixel colors, maybe you just need to move the declarations into the lambda, making variables local to each thread. – Alexey Kukanov.
Variables were indeed placed outside the bounds for the lambda so each thread was modifying the referenced variable causing a race condition where one thread would try to read the data from the variable as another was modifying it.

Optimal way to handle array indexing in parallel?

I have the following situation: I have a list of particles in a box of size L, where L is the length of one of the sides.
Next, I split the box into cells, where L/cell_dim = 7. So there are 7*7*7 cells.
Finally, I read through all the particles, note their position, and calculate which cell they are in.
I accomplish the above in an openMP parallel for loop. However, I need to capture the information in a thread safe fashion such that I don't have to loop through all the particles for each cell. So I need some way to record an arbitrary subset of the particles into each cell, in parallel.
The method I have right now makes use of the OpenMP critical code block. I have an array size [7][7][7][max_particles], where max_particles is the highest number of particles per cell, but which is much less than the total number of particles. I record the index of the last particle added in a counter array size [7][7][7], and update the cell array according to the latest count in my parallel loop:
int cube[7][7][7][10];
int cube_counts[7][7][7]={0};
#pragma omp parallel for num_threads(a lot)
for (int i = 0; i < num_particles; i++){
cell_x = //cell calculation;
cell_y = //ditto;
cell_z = //...;
#pragma omp critical
cube_counts[cell_x][cell_y][cell_z] += 1;
// for readability
int index = cube_counts[cell_x][cell_y][cell_z];
cube[cell_x][cell_y][cell_z][index] = i;
// rest in pseudo code:
foreach cell:
adjacent_cell = cell2
particle_countA = cube_counts[cellx][celly][cellz]
particle_countB = cube_counts[cell2x][cell2y][cell2z]
// these two for loops will cover ~2-4 particles,
// so super a result of the cell analysis above.
for particle in cell:
for particle in cell2: stuff
Although this works, it increases in speed by a factor of more than 2 when I am able to eliminate the critical block (I am on an intel coprocessor with 60 physical, 240 logical).
How would I accomplish this without need for the critical block? I thought of doing a big array...but then I lose everything I gained when I iterate through the 7*7*7*257 (where 257 is the particle count) array. Linked lists still have the race conditions.
Maybe some kind of unordered, thread safe list...?
Using a lock instead of the critical section can be driven further:
You may use atomic increment and atomic assignment pseudo calls ("intrinsics") that the compiler will translate to the correct x86 specific assembler instructions. This is however platform or even compiler dependent.
If your use a modern c++ compiler (C++11) then std::atomic_* might be the best way to do it.

How to use multi-threading within a loop that iterates through a point cloud in C++?

I have made a function that estimates the normal vectors of a 3D Point Cloud and it takes a lot of time to run on a cloud of size 2 million. I want to multi-thread by calling the same function on two different points at the same time but it didn't work (it was creating hundreds of threads). Here is what I tried:
// kd-tree used for finding neighbours
pcl::KdTreeFLANN<pcl::PointXYZRGB> kdt;
// cloud iterators
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it = pt_cl->points.begin();
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it1;
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it2;
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it3;
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it4;
// initializing tree
// loop exit condition
bool it_completed = false;
while (!it_completed)
// initializing cloud iterators
cloud_it1 = cloud_it;
cloud_it2 = cloud_it++;
cloud_it3 = cloud_it++;
if (cloud_it3 != pt_cl->points.end())
// attaching threads
boost::thread thread_1 = boost::thread(geom::vectors::find_normal, pt_cl, cloud_it1, kdt, radius, max_neighbs);
boost::thread thread_2 = boost::thread(geom::vectors::find_normal, pt_cl, cloud_it2, kdt, radius, max_neighbs);
boost::thread thread_3 = boost::thread(geom::vectors::find_normal, pt_cl, cloud_it3, kdt, radius, max_neighbs);
// joining threads
it_completed = true;
As you can see I am trying to call the same function on 3 different points at the same time. Any suggestions for how to make this work? Sorry for the poor code, I'm tired and thank you in advance.
EDIT: here is the find_normal function
Here are the parameters:
#param pt_cl is a pointer to the point cloud to be treated (pcl::PointCloud<PointXYZRGB>::Ptr)
#param cloud_it is an iterator of this cloud (pcl::PointCloud<PointXYZRGB>::iterator)
#param kdt is the kd_tree used to find the closest neighbours of a point
#param radius defines the range in which to search for the neighbours of a point
#param max_neighbs is the maximum number of neighbours to be returned by the radius search
// auxilliary vectors for the k-tree nearest search
std::vector<int> pointIdxRadiusSearch; // neighbours ids
std::vector<float> pointRadiusSquaredDistance; // distances from the source to the neighbours
// the vectors of which the cross product calculates the normal
geom::vectors::vector3 *vect1;
geom::vectors::vector3 *vect2;
geom::vectors::vector3 *cross_prod;
geom::vectors::vector3 *abs_cross_prod;
geom::vectors::vector3 *normal;
geom::vectors::vector3 *normalized_normal;
// vectors to average
std::vector<geom::vectors::vector3> vct_toavg;
// if there are neighbours left
if (kdt.radiusSearch(*cloud_it, radius, pointIdxRadiusSearch, pointRadiusSquaredDistance, max_neighbs) > 0)
for (int pt_index = 0; pt_index < (pointIdxRadiusSearch.size() - 1); pt_index++)
// defining the first vector
vect1 = geom::vectors::create_vect2p((*cloud_it), pt_cl->points[pointIdxRadiusSearch[pt_index + 1]]);
// defining the second vector; making sure there is no 'out of bounds' error
if (pt_index == pointIdxRadiusSearch.size() - 2)
vect2 = geom::vectors::create_vect2p((*cloud_it), pt_cl->points[pointIdxRadiusSearch[1]]);
vect2 = geom::vectors::create_vect2p((*cloud_it), pt_cl->points[pointIdxRadiusSearch[pt_index + 2]]);
// adding the cross product of the two previous vectors to our list
cross_prod = geom::vectors::cross_product(*vect1, *vect2);
abs_cross_prod = geom::aux::abs_vector(*cross_prod);
// freeing memory
delete vect1;
delete vect2;
delete cross_prod;
delete abs_cross_prod;
// calculating the normal
normal = geom::vectors::vect_avg(vct_toavg);
// calculating the normalized normal
normalized_normal = geom::vectors::normalize_normal(*normal);
// coloring the point
geom::aux::norm_toPtRGB(&(*cloud_it), *normalized_normal);
// freeing memory
delete normal;
delete normalized_normal;
// clearing vectors
// shrinking vectors
Since I don't quite get it how the result data is being stored, I'm going to suggest a solution based on OpenMP that matches the code you've posted.
// kd-tree used for finding neighbours
pcl::KdTreeFLANN<pcl::PointXYZRGB> kdt;
#pragma openmp parallel for schedule(static)
for (pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it = pt_cl->points.begin();
cloud_it < pt_cl.end();
++cloud_it) {
geom::vectors::find_normal, pt_cl, cloud_it, kdt, radius, max_neighbs);
Note that you should be using the < comparison, and not the != one, -that's how OpenMP works (it wants random access iterators). I'm using the static schedule since every element should take more or less identical time to process. If that's not the case, try using schedule(dynamic) instead.
This solution uses OpenMP, and you may investigate e.g. TBB as well, though it has a higher entrance barrier than OpenMP and uses an OOP-style API.
Also, repeating what I've said in the comments already: OpenMP as well as TBB are going to handle thread management and load distribution for you. You only pass them hints (such as schedule(static)) on how to do it to so as to better suit your needs.
Other than that, please, do get in the habit of repeating as little code as you can; ideally, no code should be duplicated. E.g. when you declare many variables of the same type, or call a certain function a few times in a row with a similar pattern, etc. I also see excessive commenting in the code, with an unclear reason behind it.

OpenMP - Poor performance when solving system of linear equations

I am trying to use OpenMP to parallelize a simple c++ code that solves a system of linear equations by Gauss elimination.
The relevant part of my code is:
#include <iostream>
#include <time.h>
using namespace std;
#define nl "\n"
void LinearSolve(double **& M, double *& V, const int N, bool parallel, int threads){
for (int i=0;i<N;i++){
#pragma omp parallel for num_threads(threads) if(parallel)
for (int j=i+1;j<N;j++){
double aux, * Mi=M[i], * Mj=M[j];
for (int k=i+1;k<N;k++) {
class Time {
clock_t startC, endC;
time_t startT, endT;
void start() {startC=clock(); time (&startT);};
void end() {endC=clock(); time (&endT);};
double timedifCPU() {return(double(endC-startC)/CLOCKS_PER_SEC);};
int timedif() {return(int(difftime (endT,startT)));};
int main (){
Time t;
double ** M, * V;
int N=5000;
cout<<"number of equations "<<N<<nl<<nl;
M= new double * [N];
V=new double [N];
for (int i=0;i<N;i++){
M[i]=new double [N];
for (int m=1;m<=16;m=2*m){
cout<<m<<" threads"<<nl;
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
M[i][j]=(j+2.3)/(i-0.2)+(i+2)/(j+3); //some function to get regular matrix
cout<<"time "<<t.timedif()<<", CPU time "<<t.timedifCPU()<<nl<<nl;
Since the code is extremely simple I would expect that the time would be
inversely proportional to the number of threads. However the typical result I get is (the code is compiled with gcc on Linux)
number of equations 5000
1 threads
time 217, CPU time 215.89
2 threads
time 125, CPU time 245.18
4 threads
time 80, CPU time 302.72
8 threads
time 67, CPU time 458.55
16 threads
time 55, CPU time 634.41
There is a decrease in time, but much less that I would like to and the CPU time mysteriously grows.
I suspect the problem is in memory sharing, but I have been unable to identify it. Access to row M[j] should not be a problem, since each thread writes to the a different row of the matrix. There could be a problem in reading from row M[i], so I also tried to make a separate copy of this row for each thread by replacing the parallel loop by
#pragma omp parallel num_threads(threads) if(parallel)
double Mi[N];
for (int j=i;j<N;j++) Mi[j]=M[i][j];
#pragma omp for
for (int j=i+1;j<N;j++){
double aux, * Mj=M[j];
for (int k=i+1;k<N;k++) {
Unfortunately it does not help at all.
I would very much appreciate any help.
Your problem is excessive OpenMP synchronization.
Having the #omp parallel inside the first loop means that each iteration of the outer loop comes with the whole synchronization overhead.
Take a look at the top-most chart on the image here (more detail can be found on the Allinea MAP OpenMP profiler introduction). The top line is application activity - and dark gray means "OpenMP synchronization" and green means "doing compute".
You can see a lot of dark gray in the right hand side of that top graph/chart - that is when the 16 threads are running. You're spending a lot of time synchronizing.
I also see a lot of time being spent in memory access (more than in compute) - so it's probably this that is making what should be balanced workload actually be highly unbalanced and giving the synchronization delay.
As the other respondent suggested - it's worth reading other literature for ideas here.
I think the underlying problem may be that traditional Gaussian elimination may not be suitable for parallelization.
Gaussian elimination is a process where by each subsequent step relies on the result of the previous step i.e. each iteration of your linear solve loop is dependent on the results of the previous iteration, i.e. it must be done serially. Try searching the literature for "parallel row reduction algorithms".
Also glancing at your code it looks like you will have race condition.

Help with code optimization

I've written a little particle system for my 2d-application. Here is raining code:
// HPP -----------------------------------
struct Data
float x, y, x_speed, y_speed;
int timeout;
std::vector<Data> mData;
bool mFirstTime;
void processDrops(float windPower, int i);
// CPP -----------------------------------
: x(rand()%ScreenResolutionX), y(0)
, x_speed(0), y_speed(0), timeout(rand()%130)
{ }
void Rain::processDrops(float windPower, int i)
int posX = rand() % mWindowWidth;
mData[i].x = posX;
mData[i].x_speed = WindPower*0.1; // WindPower is float
mData[i].y_speed = Gravity*0.1; // Gravity is 9.8 * 19.2
// If that is first time, process drops randomly with window height
if (mFirstTime)
mData[i].timeout = 0;
mData[i].y = rand() % mWindowHeight;
mData[i].timeout = rand() % 130;
mData[i].y = 0;
void update(float windPower, float elapsed)
// If this is first time - create array with new Data structure objects
if (mFirstTime)
for (int i=0; i < mMaxObjects; ++i)
processDrops(windPower, i);
mFirstTime = false;
for (int i=0; i < mMaxObjects; i++)
// Sleep until uptime > 0 (To make drops fall with randomly timeout)
if (mData[i].timeout > 0)
// Find new x/y positions
mData[i].x += mData[i].x_speed * elapsed;
mData[i].y += mData[i].y_speed * elapsed;
// Find new speeds
mData[i].x_speed += windPower * elapsed;
mData[i].y_speed += Gravity * elapsed;
// Drawing here ...
// If drop has been falled out of the screen
if (mData[i].y > mWindowHeight) processDrops(windPower, i);
So the main idea is: I have some structure which consist of drop position, speed. I have a function for processing drops at some index in the vector-array. Now if that's first time of running I'm making array with max size and process it in cycle.
But this code works slower that all another I have. Please, help me to optimize it.
I tried to replace all int with uint16_t but I think it doesn't matter.
Replacing int with uint16_t shouldn't do any difference (it'll take less memory, but shouldn't affect running time on most machines).
The shown code already seems pretty fast (it's doing only what it's needed to do, and there are no particular mistakes), I don't see how you could optimize it further (at most you could remove the check on mFirstTime, but that should make no difference).
If it's slow it's because of something else. Maybe you've got too many drops, or the rest of your code is so slow that update gets called little times per second.
I'd suggest you to profile your program and see where most time is spent.
one thing that could speed up such algorithm, especially if your system hasn't got an FPU (! That's not the case of a personal computer...), would be to replace your floating point values with integers.
Just multiply the elapsed variable (and your constants, like those 0.1) by 1000 so that they will represent milliseconds, and use only integers everywhere.
Few points:
Physics is incorrect: wind power should be changed as speed makes closed to wind speed, also for simplicity I would assume that initial value of x_speed is the speed of the wind.
You don't take care the fraction with the wind at all, so drops getting faster and faster. but that depends on your want to model.
I would simply assume that drop fails in constant speed in constant direction because this is really what happens very fast.
Also you can optimize all this very simply as you don't need to solve motion equation using integration as it can be solved quite simply directly as:
x(t):= x_0 + wind_speed * t
y(t):= y_0 - fall_speed * t
This is the case of stable fall when the gravity force is equal to friction.
x(t):= x_0 + wind_speed * t;
y(t):= y_0 - 0.5 * g * t^2;
If you want to model drops that fall faster and faster.
Few things to consider:
In your processDrops function, you pass in windPower but use some sort of class member or global called WindPower, is that a typo? If the value of Gravity does not change, then save the calculation (i.e. mult by 0.1) and use that directly.
In your update function, rather than calculating windPower * elapsed and Gravity * elapsed for every iteration, calculate and save that before the loop, then add. Also, re-organise the loop, there is no need to do the speed calculation and render if the drop is out of the screen, do the check first, and if the drop is still in the screen, then update the speed and render!
Interestingly, you never check to see if the drop is out of the screen interms of it's x co-ordinate, you check the height, but not the width, you could save yourself some calculations and rendering time if you did this check as well!
In loop introduce reference Data& current = mData[i] and use it instead of mData[i]. And use this reference instead of index also in procesDrops.
BTW I think that consulting mFirstTime in processDrops serves no purpose because it will never be true. Hmm, I missed processDrops in initialization loop. Never mind this.
This looks pretty fast to me already.
You could get some tiny speedup by removing the "firsttime" code and putting it in it's own functions to call once rather that testing every calls.
You are doing the same calculation on lots of similar data so maybe you could look into using SSE intrinsics to process several items at once. You'l likely have to rearrange your data structure for that though to be a structure of vectors rather than a vector od structures like now. I doubt it would help too much though. How many items are in your vector anyway?
It looks like maybe all your time goes into ... Drawing Here.
It's easy enough to find out for sure where the time is going.