OpenMP parallel only one thread seems to run at a time - c++

I'm trying to parallelize the below for loop with OpenMP, however only one thread seems to be running at a time. I can tell this based on the below observations:
Normally when I have prints inside the loop, the output will be jumbled and lines will be mixed together, however here, all my outputs are printed cleanly, suggesting that only one thread is executing at a time.
There is some heavy dynamic programming computation going on inside the loop, however I only see CPU usage on one core in htop.
If I print the current thread number omp_get_thread_num() I only see one active thread at a time. e.g I see some iterations all from thread 4, then some iterations all from thread 3 and so on.
This only happens after a while. For the first few iterations, things seem to run in parallel.
I'm not sure if there is anything wrong with the code that prevents OpenMP from running two threads in parallel. Below is the for loop and the templates for the functions called inside it. The functions only work with what's passed into them and don't modify any other data structures.
I'm suspecting this may have something to do with the fact that I'm passing const references to things around, could that be the case?
// variables
string ref ; // read-only access
vector<vector<Cluster>> _clusters(24) ;
vector<Cluster> position_clusters = some_function() ;
#pragma omp parallel for num_threads(24) schedule(dynamic, 10)
for (int i = 0; i < position_clusters.size(); i++) {
auto& pc = position_clusters[i] ;
if (pc.size() < 2) {
continue ;
}
vector<Cluster> type_clusters = type_cluster(pc);
for (Cluster &tc : type_clusters) {
if (tc.size() < 2) {
continue ;
}
auto clusters = cluster_breakpoints(tc, 0.7) ; // dynamic programming
for (const Cluster &c : clusters) {
auto result = dynamic_programming(c, ref) ; // dynamic programming
_clusters[omp_get_thread_num()].push_back(result);
}
}
}
// Templates:
vector<Cluster> type_cluster(const Cluster &c) ;
vector<Cluster> cluster_breakpoints(Cluster& cluster, float ratio) ;
vector<Cluster> dynamic_programming(const Cluster& cluster, const string& ref) ;

Related

Parallelize Algorithm with OpenMP in C++

my problem is this:
I want to solve TSP with the Ant Colony Optimization Algorithm in C++.
Right now Ive implemented a algorithm that solve this problem iterative.
For example: I generate 500 ants - and they find their route one after the other.
Each ant starts not until the previous ant finished.
Now I want to parallelize the whole thing - and I thought about using OpenMP.
So my first question is: Can I generate a large number of threads that work
simultaneously (for the number of ants > 500)?
I already tried something out. So this is my code from my main.cpp:
#pragma omp parallel for
for (auto ant = antarmy.begin(); ant != antarmy.end(); ++ant) {
#pragma omp ordered
if (ant->getIterations() < ITERATIONSMAX) {
ant->setNumber(currentAntNumber);
currentAntNumber++;
ant->antRoute();
}
}
And this is the code in my Ant class that is "critical" because each Ant reads and writes into the same Matrix (pheromone-Matrix):
void Ant::antRoute()
{
this->route.setCity(0, this->getStartIndex());
int nextCity = this->getNextCity(this->getStartIndex());
this->routedistance += this->data->distanceMatrix[this->getStartIndex()][nextCity];
int tempCity;
int i = 2;
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(1, nextCity);
updatePheromone(this->getStartIndex(), nextCity, routedistance, 0);
while (this->getVisitedCount() < datacitycount) {
tempCity = nextCity;
nextCity = this->getNextCity(nextCity);
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(i, nextCity);
this->routedistance += this->data->distanceMatrix[tempCity][nextCity];
updatePheromone(tempCity, nextCity, routedistance, 0);
i++;
}
this->routedistance += this->data->distanceMatrix[nextCity][this->getStartIndex()];
// updatePheromone(-1, -1, -1, 1);
ShortestDistance(this->routedistance);
this->iterationsshortestpath++;
}
void Ant::updatePheromone(int i, int j, double distance, bool reduce)
{
#pragma omp critical(pheromone)
if (reduce == 1) {
for (int x = 0; x < datacitycount; x++) {
for (int y = 0; y < datacitycount; y++) {
if (REDUCE * this->data->pheromoneMatrix[x][y] < 0)
this->data->pheromoneMatrix[x][y] = 0.0;
else
this->data->pheromoneMatrix[x][y] -= REDUCE * this->data->pheromoneMatrix[x][y];
}
}
}
else {
double currentpheromone = this->data->pheromoneMatrix[i][j];
double updatedpheromone = (1 - PHEROMONEREDUCTION)*currentpheromone + (PHEROMONEDEPOSIT / distance);
if (updatedpheromone < 0.0) {
this->data->pheromoneMatrix[i][j] = 0;
this->data->pheromoneMatrix[j][i] = 0;
}
else {
this->data->pheromoneMatrix[i][j] = updatedpheromone;
this->data->pheromoneMatrix[j][i] = updatedpheromone;
}
}
}
So for some reasons the omp parallel for loop wont work on these range-based loops. So this is my second question - if you guys have any suggestions on the code how the get the range-based loops done im happy.
Thanks for your help
So my first question is: Can I generate a large number of threads that work simultaneously (for the number of ants > 500)?
In OpenMP you typically shouldn't care how many threads are active, instead you make sure to expose enough parallel work through work-sharing constructs such as omp for or omp task. So while you may have a loop with 500 iterations, your program could be run with anything between one thread and 500 (or more, but they would just idle). This is a difference to other parallelization approaches such as pthreads where you have to manage all the threads and what they do.
Now your example uses ordered incorrectly. Ordered is only useful if you have a small part of your loop body that needs to be executed in-order. Even then it can be very problematic for performance. Also you need to declare a loop to be ordered if you want to use ordered inside. See also this excellent answer.
You should not use ordered. Instead make sure that the ants know there number beforehand, write the code such that they don't need a number, or at the very least that the order of numbers doesn't matter for ants. In the latter case you can use omp atomic capture.
As to the access to shared data. Try to avoid it as much as possible. Adding omp critical is a first step to get a correct parallel program, but often leads to performance problems. Measure your parallel efficiency, use parallel performance analysis tools to find out if this is the case for you. Then you can use atomic data access or reduction (each threads has their own data they work on and only after the main work is finished, data from all threads is merged).

Boost Thread_Group in a loop is very slow

I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.
The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)
I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.

OpenMP parallelization

I'm writing a C++ program with scientific purposes. The program works well and it returns good results, so I decided to improve its perfomance using OpenMP. The loop I want to optimize is the following one:
//== #pragma omp parallel for private(i,j)
for (k=0; k < number; k++)
{
for (i=0; i < L; i++)
{
for (j=0; j < L; j++)
{
red[i][j] = UNDEFINED;
}
}
Point inicial = {L/2, L/2, OCCUPIED};
red[L/2][L/2] = OCCUPIED;
addToList(inicial, red, list, L,f);
oc.push_back(inicial);
while (list.size() > 0 && L > 0)
{
punto = selectPoint(red, list, generator, prob, p);
if (punto.state == OCCUPIED)
{
addToList(punto, red, list, L,f);
oc.push_back(punto);
}
else
{
out.push_back(punto);
}
}
L = auxL;
oc.clear();
out.clear();
list.clear();
}
f = f*1.0/(number*1.0);
if (f > 0.5)
{
inta = inta;
intb = p;
p = (inta + intb) / 2.0;
}
else if (f < 0.5)
{
intb = intb;
inta = p;
p = (inta + intb) / 2.0;
}
cout << p << endl;
}
My try with OpenMP is commented above. As you can see I've declared i and j as private because they're declared before the parallel section. I've also tried to make L private, with no results. Only segmentation faults and bad pointers everywhere.
I think the problem is that while loop nested inside. My questions are: Is the omp parallel for correct in this case? or should I try to optimize only that while loop? Are the std::vector interfering with OpenMP?
NOTE: list, oc and out are std::vector<Point>, and Point is a simple struct with three int properties. addToList is a function with no loops inside.
You might want to go over an OpenMP tutorial. When you look at OpenMP code, you need to imagine what can happen in parallel. Take
oc.push_back(inicial);
Can two threads try to do this at the same time? Yes. Does std::vector support parallelism? No.
The code above is full of these things.
If you want to use data-structures within your OpenMP ode, you need to use locks. From my personal experience, when this happens, it is far better to refactor the algorithm than actually use them. While OpenMP + locks is possible, it is usually an indication that there's a problem with the idea (= a possibly subjective view).
The current answer points out the concurrency in the code, but please note that not all data-structures have to be implemented with locks to attain thread-safety. There are also lock-free data structures. For this particular case, we could the Harris lock free linked list: https://timharris.uk/papers/2001-disc.pdf
While I know that pointing out concurrency issues to the OP is of great assistance at this point, I want to make sure we don't convey a wrong message by saying that locks are absolutely necessary to attain thread safety.
The directive #pragma omp parallel defines a piece of code that can be executed simultaneously by various threads. In your case, as you have not specified any further directive, your parallel region will be executed once by every thread. In order to achieve a parallel behavior you could try to break the loop into smaller tasks(the taskloop directive will do the job). Those tasks will remain in a task pool until a thread starts executing them. This way your loop will be fragmented and executed by your threads instead of making each thread execute the whole loop.
https://www.openmp.org/spec-html/5.0/openmpsu47.html here's the official openMP documentation for the taskloop directive.

OpenMP share file handler

I've got a loop, which I parallelize using OpenMP. In this loop, I read a triangle from a file, and perform some operations on this data. These operations are independent from each triangle to another, so I thought this would be easy to parallelize, as long as I kept the actual reading of files in a critical section.
Order in which triangles are read is not important
Some triangles are read and get discarded pretty quickly, some need some more algorithmic work (bbox construction, ...)
I'm doing binary I/O
Using C++ ifstream *tri_data*
I'm testing this on an SSD
ReadTriangle calls file.read() and reads 12 floats from an ifstream.
#pragma omp parallel for shared (tri_data)
for(int i = 0; i < ntriangles ; i++) {
vec3 v0,v1,v2,normal;
#pragma omp critical
{
readTriangle(tri_data,v0,v1,v2,normal);
}
(working with the triangle here)
}
Now, the behaviour I'm observing is that with OpenMP enabled, the whole process is slower.
I've added some timers to my code to track time spent in the I/O method, and time spent in the loop itself.
Without OpenMP:
Total IO IN time : 41.836 s.
Total algorithm time : 15.495 s.
With OpenMP:
Total IO IN time : 48.959 s.
Total algorithm time : 44.61 s.
My guess is, since the reading is in a critical section, the threads are just waiting for eachother to finish using the file handler, resulting in a longer waiting time.
Any pointers on how to resolve this? My program would really benefit from the possibility to process read triangles with multiple processes. I've tried toying with thread scheduling and related stuff, but that doesn't seem to help a lot in this instance.
Since I'm working on an out-of-core algorithm, introducing a buffer to hold a multitude of triangles is not really an option.
So, the solution I would propose is based on a master/slave strategy, where:
the master (thread 0) performs all the I/O
the slaves do some work on the retrieved data
The pseudo-code would read something like the following:
#include<omp.h>
vector<vec3> v0;
vector<vec3> v1;
vector<vec3> v2;
vector<vec3> normal;
vector<int> tdone;
int nthreads;
int triangles_read = 0;
/* ... */
#pragma omp parallel shared(tri_data)
{
int id = omp_get_thread_num();
/*
* Initialize all the buffers in the master thread.
* Notice that the size in memory is similar to your example.
*/
#pragma omp single
{
nthreads = omp_get_num_threads();
v0.resize(nthreads);
v1.resize(nthreads);
v2.resize(nthreads);
normal.resize(nthreads);
tdone.resize(nthreads,1);
}
if ( id == 0 ) { // Producer thread
int next = 1;
while( triangles_read != ntriangles ) {
if ( tdone[next] ) { // If the next thread is free
readTriangle(tri_data,v0[next],v1[next],v2[next],normal[next]); // Read data and fill the correct buffer
triangles_read++;
tdone[next] = 0; // Set a flag for thread next to start working
#pragma omp flush (tdone[next],triangles_read) // Flush it
}
next = next%(nthreads - 1) + 1; // Set next
} // while
} else { // Consumer threads
while( true ) { // Wait for work
if( tdone[id] == 0) {
/* ... do work here on v0[id], v1[id], v2[id], normal[id] ... */
tdone[id] == 1;
#pragma omp flush (tdone[id]) // Flush it
}
if( tdone[id] == 1 && triangles_read == ntriangles) break; // Work finished for all
}
}
#pragma omp barrier
}
I am not sure if this is still valuable to you but that was a nice teaser anyhow!

D taskpool wait untill all tasks are done

This is in relation to my previous question: D concurrent writing to buffer
Say you have a piece of code that consists of 2 consecutive code blocks A and B, where B depends on A. This is very common in programming. Both A and B consist of a loop, where each iteration can be run in parallel:
double[] array = [ ... ]; // has N elements
// A
for (int i = 0; i < N; i++)
{
job1(array[i]); // new task
}
// wait for all job1's to be done
// B
for (int i = 0; i < N; i++)
{
job2(array[i]); // new task
}
B can only be executed when A is finished. How do I wait till all tasks of A are finished before executing B?
I assume you're using std.parallelism? I wrote std.parallelism, so I'll let you in on a design decision. There was actually a join function in some of the betas of std.parallelism. It waited until all tasks were finished and then shut down the task pool. I removed it because I realized it was useless.
The reason is that if you're manually creating a set of O(N) task objects to iterate over some range, you're misusing the library. You should be using a parallel foreach loop instead, which automatically joins before it releases control back to the calling thread. Your example would become:
foreach(ref elem; parallel(array)) {
job1(elem);
}
foreach(ref elem; parallel(array)) {
job2(elem);
}
In this case job1 and job2 should not start a new task because the parallel foreach loop is already using enough tasks to fully utilize all CPU cores.