Parallelizing for loop with openMP inside of while loop?

Parallelizing for loop with openMP inside of while loop? - c++

I have a program structure similarly to this:
ssize_t remain = nsamp;
while (!nsamp || remain > 0) {
#pragma omp parallel for num_threads(nthread)
for (ssize_t ii=0; ii < nthread; ii++) {
<generate noise>
}
// write noise
out.write(data, nthread*PERITER);
remain -= nthread*PERITER;
}
The problem is, when I benchmark the output of this, if I run with eg: two threads, sometimes it takes ~ the same time as a single thread, and sometimes I get a 2x speedup, it feels like there's some sort of synchronization race condition that I'm running into, sometimes I hit it and things go smoothly and sometimes (often) not.
Does anyone know what might be causing this and what the right way to parallelize a section inside of an outer while loop is?
Edit: Using strace, I see a lot of calls to sched_yield() This is probably making it look like I'm doing a lot on the CPU but I'm fighting the scheduler for a good scheduling pattern.

You are creating a new bunch of threads each time the while loop gets entered. After the parallel loop, the threads are destroyed. Because of the nature of a while loop, this might happen irregularily (depending on the condition).
So if your loops gets executed only a few times, then the thread creation process might overweigh the actual workload and thus you get at most sequential performance, if not less. However, maybe the parallel system (OpenMP) can detect if the loop is entered many times to keep threads alive.
Nothing guaranteed though.

I'd suggest something like this.
For nsamp == 0 you'll need some more reasonable handling. For proper Signal handling with OpenMP, please refer to this answer.
ssize_t remain = nsamp;
#pragma omp parallel num_threads(nthread) shared(out, remain, data)
while (remain > 0) {
#pragma omp for
for (ssize_t ii=0; ii < nthread; ii++) {
/* generate noise */
}
#pragma omp single
{
// write noise
out.write(data, nthread*PERITER);
remain -= nthread*PERITER;
}
}

Related

Parallelizing for-loop inside another loop efficiently with OpenMP

I have a problem writing the parallel instructions for a code that work like this:
// every iteration depends on the previous one
for (int iter = 0; iter < numIters; ++i)
{
#pragma omp parallel for num_threads(numThreads)
for (int p = 0; p < numParticles; ++p)
{
p_velocity_calculation(...);
}
// implicit sync barrier
#pragma omp parallel for num_threads(numThreads)
for (int p = 0; p < numParticles; ++p)
{
p_position_calculation(...);
}
}
The program is about a n-body simulation where first I need to calculate the velocities and then the positions of a set of particles, hence the separation of the two for-loops.
The code runs as expected, but from what I have inquired, the thread pools created by the #pragma omp directives are created and destroyed every iteration of the outer for-loop, but I don't want to waste resources creating them.
So my question is how can I reuse those thread pools and not creating/destroying the threads every iteration?

First of all: the thread pools are not destroyed, only suspended.
Next: Have you timed this and found that creating the threads is a limiting factor in your application? If not, don't worry.
Or to put it constructively : I have timed it and unless you have an extremely short omp parallel for and you call it tens of thousand of times, the overhead is negligible.
But if you are really worried, put the omp parallel outside the time loop, and do an omp for around the particle loop. You will do some redundant work between the for loops, which you can either accept or put a omp master around if it affects global variables.
But really: I wouldn't worry.

Parallelize Algorithm with OpenMP in C++

my problem is this:
I want to solve TSP with the Ant Colony Optimization Algorithm in C++.
Right now Ive implemented a algorithm that solve this problem iterative.
For example: I generate 500 ants - and they find their route one after the other.
Each ant starts not until the previous ant finished.
Now I want to parallelize the whole thing - and I thought about using OpenMP.
So my first question is: Can I generate a large number of threads that work
simultaneously (for the number of ants > 500)?
I already tried something out. So this is my code from my main.cpp:
#pragma omp parallel for
for (auto ant = antarmy.begin(); ant != antarmy.end(); ++ant) {
#pragma omp ordered
if (ant->getIterations() < ITERATIONSMAX) {
ant->setNumber(currentAntNumber);
currentAntNumber++;
ant->antRoute();
}
}
And this is the code in my Ant class that is "critical" because each Ant reads and writes into the same Matrix (pheromone-Matrix):
void Ant::antRoute()
{
this->route.setCity(0, this->getStartIndex());
int nextCity = this->getNextCity(this->getStartIndex());
this->routedistance += this->data->distanceMatrix[this->getStartIndex()][nextCity];
int tempCity;
int i = 2;
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(1, nextCity);
updatePheromone(this->getStartIndex(), nextCity, routedistance, 0);
while (this->getVisitedCount() < datacitycount) {
tempCity = nextCity;
nextCity = this->getNextCity(nextCity);
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(i, nextCity);
this->routedistance += this->data->distanceMatrix[tempCity][nextCity];
updatePheromone(tempCity, nextCity, routedistance, 0);
i++;
}
this->routedistance += this->data->distanceMatrix[nextCity][this->getStartIndex()];
// updatePheromone(-1, -1, -1, 1);
ShortestDistance(this->routedistance);
this->iterationsshortestpath++;
}
void Ant::updatePheromone(int i, int j, double distance, bool reduce)
{
#pragma omp critical(pheromone)
if (reduce == 1) {
for (int x = 0; x < datacitycount; x++) {
for (int y = 0; y < datacitycount; y++) {
if (REDUCE * this->data->pheromoneMatrix[x][y] < 0)
this->data->pheromoneMatrix[x][y] = 0.0;
else
this->data->pheromoneMatrix[x][y] -= REDUCE * this->data->pheromoneMatrix[x][y];
}
}
}
else {
double currentpheromone = this->data->pheromoneMatrix[i][j];
double updatedpheromone = (1 - PHEROMONEREDUCTION)*currentpheromone + (PHEROMONEDEPOSIT / distance);
if (updatedpheromone < 0.0) {
this->data->pheromoneMatrix[i][j] = 0;
this->data->pheromoneMatrix[j][i] = 0;
}
else {
this->data->pheromoneMatrix[i][j] = updatedpheromone;
this->data->pheromoneMatrix[j][i] = updatedpheromone;
}
}
}
So for some reasons the omp parallel for loop wont work on these range-based loops. So this is my second question - if you guys have any suggestions on the code how the get the range-based loops done im happy.
Thanks for your help

So my first question is: Can I generate a large number of threads that work simultaneously (for the number of ants > 500)?
In OpenMP you typically shouldn't care how many threads are active, instead you make sure to expose enough parallel work through work-sharing constructs such as omp for or omp task. So while you may have a loop with 500 iterations, your program could be run with anything between one thread and 500 (or more, but they would just idle). This is a difference to other parallelization approaches such as pthreads where you have to manage all the threads and what they do.
Now your example uses ordered incorrectly. Ordered is only useful if you have a small part of your loop body that needs to be executed in-order. Even then it can be very problematic for performance. Also you need to declare a loop to be ordered if you want to use ordered inside. See also this excellent answer.
You should not use ordered. Instead make sure that the ants know there number beforehand, write the code such that they don't need a number, or at the very least that the order of numbers doesn't matter for ants. In the latter case you can use omp atomic capture.
As to the access to shared data. Try to avoid it as much as possible. Adding omp critical is a first step to get a correct parallel program, but often leads to performance problems. Measure your parallel efficiency, use parallel performance analysis tools to find out if this is the case for you. Then you can use atomic data access or reduction (each threads has their own data they work on and only after the main work is finished, data from all threads is merged).

Boost Thread_Group in a loop is very slow

I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.

The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)

I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.

OpenMP optimize scheduling of for loop

I need some help with OpenMP. Is it possible that if a thread ended in a for loop it helps then to another thread, dividing it? I have a loop in a loop where are breaks; and the threads doesn't end at the same time, so there are threads which has much work, and other threads which are done. (so there are unused cores). I run my program on a corei7, and it seems that OpenMP divide the loop to 8 threads. But the utilization starts to drop after some time when one thread did the job.
#pragma omp parallel for
for(i = 0; i < Vector.size(); i++) {
for(j = 0; j < othervector.size(); j++) {
{some code}
if(sth is true) break;
}
}
Thank you.

The default division/SCHEDULE of the loop iterations in a for loop is implementation dependent. In your case, when using the omp parallel for the default shedule may be STATIC, which means that depending on the size of your vector each thread gets assigned a fixed chunk of data. Since apparently the work load can't be balanced by statically dividing it, you should check out the DYNAMIC, GUIDED and RUNTIME clause and see if this helps you to reestablish a high utilization of your (virtual) cores. Depending on the chunk size this will of course cause additional overhead, but it may become negligible comparing it with the time your cores spend in idle when scheduling statically.
To answer the original question: I don't think that you can tell a thread to continue the work of another one. When the work gets assigned each thread has to deal with it on its own. Here is what I would try out.
#define CHUNKSIZE 100
#pragma omp parallel for schedule(dynamic,chunk) nowait
for(i = 0; i < Vector.size(); i++) {
for(j = 0; j < othervector.size(); j++) {
{some code}
if(sth is true) break;
}
}
Actually Hristo Iliev wrote a very nice answer to a similar question some time ago.

Bad performace of parallelized OpenMP code for particle simulation

I am trying to parallelize a code for particle-based simulations and experiencing poor performance of an OpenMP based approach. By that I mean:
Displaying CPU usage using the Linux tool top, OpenMP-threads running CPUs have an average usage of 50 %.
With increasing number of threads, speed up converges to a factor of about 1.6. Convergence is quite fast, i.e. I reach a speed up of 1.5 using 2 threads.
The following pseudo code illustrates the basic template for all parallel regions implemented.
Note that during a single time step, 5 parallel regions of the below shown fashion are being executed. Basically, the force acting on a particle i < N is a function of several field properties of neighboring particles j < NN(i).
omp_set_num_threads(ncpu);
#pragma omp parallel shared( quite_a_large_amount_of_readonly_data, force )
{
int i,j,N,NN;
#pragma omp for
for( i=0; i<N; i++ ){ // Looping over all particles
for ( j=0; j<NN(i); j++ ){ // Nested loop over all neighbors of i
// No communtions between threads, atomic regions,
// barriers whatsoever.
force[i] += function(j);
}
}
}
I am trying to sort out the cause for the observed bottleneck. My naive initial guess for an explanation:
As stated, there is large amount of memory being shared between threads for read-only access. It is quite possible that different threads try to read the same memory location at the same time. Is this causing a bottleneck ? Should I rather let OpenMP allocate private copies ?

How large is N, and how intensive is NN(i)?
You say nothing shared, but force[i] is probably within the same cache line of force[i+1]. This is what's known as false sharing and can be pretty detrimental. OpenMP should batch things together to compensate for this, so with a large enough N I don't think this would be your problem.
If NN(i) isn't very CPU intensive, you might have a simple memory bottleneck -- in which case throwing more cores at it won't solve anything.

Assuming that force[i] is plain array of 4 or 8 byte data, you definitely have false sharing, no doubt about it.
Assuming that function(j) is independently calculated, you may want to do something like this:
for( i=0; i<N; i+=STEP ){ // Looping over all particles
for ( j=0; j<NN(i); j+=STEP ){ // Nested loop over all neighbors of i
// No communtions between threads, atomic regions,
// barriers whatsoever.
calc_next(i, j);
}
}
void calc_next(int i, int j)
{
int ii, jj;
for(ii = 0; ii < STEP; ii++)
{
for(jj = 0; jj < STEP; jj++)
{
force[i+ii] = function(j+jj);
}
}
}
That way, you calculate a bunch of things on one thread, and a bunch of things on the next thread, and each bunch is far enough apart that you don't get false sharing.
If you can't do it this way, try to split it up in some other way that leads to larger sections being calculated each time.

As the others stated that, false sharing on force could be a reason. Try in this simple way,
#pragma omp for
for( i=0; i<N; i++ ){
int sum = force[i];
for ( j=0; j<NN(i); j++ ){
sum += function(j);
}
force[i] = sum;
}
Technically, it's possible that force[i] = sum still makes a false sharing. But, it's highly unlikely to happen because the other thread would access force[i + N/omp_num_threads()*omp_thread_num()], which is pretty far from force[i].
If still scalability is poor, try to use a profiler such as Intel Parallel Amplifier (or VTune) to see how much memory bandwidth is needed per thread. If so, put some more DRAMs in your computer :) That will really boost memory bandwidth.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js