C++ OpenMP for-loop global variable problems - c++

I'm new to OpenMP and from what I have read about OpenMP 2.0, which comes standard with Microsoft Visual Studio 2010, global variables are considered troublesome and error prone when used in parallel programming. I have also been adopting this feeling since I have found very little on how to deal with global variables and static global variables efficiently, or at all for that matter.
I have this snippet of code which runs but because of the local variable created in the parallel block I don't get the answer I'm looking for. I get 8 different print outs (because that how many threads I have on my PC) instead of 1 answer. I know that it's because of the local variables "list" created in the parallel block but this code will not run if I move the "list" variable and make it a global variable. Actually the code does run but it never gives me an answer back. This is the sample code that I would like to modify to use a global "list" variable :
#pragma omp parallel
{
vector<int> list;
#pragma omp for
for(int i = 0; i < 50000; i++)
{
list.push_back(i);
}
cout << list.size() << endl;
}
Output:
6250
6250
6250
6250
6250
6250
6250
6250
They add up to 50000 but I did not get the one answer with 50000, instead it's divided up.
Solution:
vector<int> list;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 50000; i++)
{
cout << i << endl;
#pragma omp critical
{
list.push_back(i);
}
}
}
cout << list.size() << endl;

According to the MSDN Documentation the parallel clause
Defines a parallel region, which is code that will be executed by
multiple threads in parallel.
And since the list variable is declared inside this section every thread will have its own list.
On the other hand, the for pragma
Causes the work done in a for loop inside a parallel region to be
divided among threads.
So the 50000 iterations will be split among threads but each thread will have its own list.
I think what you are trying to do can be achieved by:
Taking the list definition outside the "parallel" section.
Protect the list.push_back statement with a critical section.
Try this:
vector<int> list;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 50000; i++)
{
#pragma omp critical
{
list.push_back(i);
}
}
}
cout << list.size() << endl;
I don't think you should get any speedup from OpenMP in this case because there will be contention for the critical section. A faster solution for this (if you don't care about the order of elements) would be for every thread to have its own list, and get those lists merged after the loop finishes. The implementation using std::list instead of std::vector would look cleaner in this case (because you wouldn't have to copy arrays).
Some apps are memory bound and not compute bound. Bottom line: check if you actually get a speedup from OpenMP.

Why you need the first pragma here? (#pragma omp parallel). I think that's the issue.

Related

Wait for the whole omp block to be finished before calling the second function

I don't have experience with openmp in C++ and I would like to learn how to solve my problem properly. I have 30 files that need to be processed independently by the same function. Each time the function is activated, a new output file will be generated (out01.txt to out30.txt) saving the results. I have 12 processors in my machine and would like to use 10 for this problem.
I need to change my code to wait for all the 30 files to be processed to execute other routines in C++. At this moment, I'm not able to force my code to wait for all the omp scope to be executed and then move the second function.
Please find below a draft of my code.
int W = 10;
int i = 1;
ostringstream fileName;
int th_id, nthreads;
omp_set_num_threads(W);
#pragma omp parallel shared (nFiles) private(i,fileName,th_id)
{
#pragma omp for schedule(static)
for ( i = 1; i <= nFiles; i++)
{
th_id = omp_get_thread_num();
cout << "Th_id: " << th_id << endl;
// CALCULATION IS PERFORMED HERE FOR EACH FILE
}
}
// THIS is the point where the program should wait for the whole block to be finished
// Calling the second function ...
Both the "omp for" and "omp parallel" pragmas have an implicit barrier at the end of its scope. Therefore, the code after the parallel section can't be executed until the parallel section has concluded. So your code should run perfectly.
If there is still a problem then it isn't because your code isn't waiting at the end of the parallel region.
Please supply us with more details about what happens during execution of this code. This way we might be able to find the real cause of your problem.

omp doens't launch thread

openMp used to work on my project on 6 threads and now, (I have no ideas why), the program is single threaded.
My code is pretty simple, I only use openMp in one cpp file, i declared
#include <omp.h>
then the function to parallelize is :
#pragma omp parallel for collapse(2) num_threads(IntervalMapEstimator::m_num_thread)
for (int cell_index_x = m_min_cell_index_sensor_rot_x; cell_index_x <= m_max_cell_index_sensor_rot_x; cell_index_x++)
{
for (int cell_index_y = m_min_cell_index_sensor_rot_y; cell_index_y <= m_max_cell_index_sensor_rot_y; cell_index_y++)
{
//use for debug
omp_set_num_threads (5);
std::cout << "omp_get_num_threads = " << omp_get_num_threads ()<< std::endl;
std::cout << "omp_get_max_threads = " << omp_get_max_threads ()<< std::endl;
if(split_points) {
extract_relevant_points_from_angle_lists(relevant_points, pointcloud_ff_polar_angle_lists, cell_min_angle_sensor_rot, cell_max_angle_sensor_rot);
} else {
extract_relevant_points_multithread_with_localvector(relevant_points, pointcloud, cell_min_angle_sensor_rot, cell_max_angle_sensor_rot);
}
}
}
omp_get_num_threads return 1 thread
omp_get_max_threads return 5
IntervalMapEstimator::m_num_thread is set at 6
Any lead would be greatly appreciated.
EDIT 1:
I modified my code but the problem remains, the program is still running in mono thread.
omp_get_num_threads return 1
omp_get_max_threads return 8
Is there a way to know how many threads are available at running time ?
#pragma omp parallel for collapse(2)
for (int cell_index_x = m_min_cell_index_sensor_rot_x; cell_index_x <= m_max_cell_index_sensor_rot_x; cell_index_x++)
{
for (int cell_index_y = m_min_cell_index_sensor_rot_y; cell_index_y <= m_max_cell_index_sensor_rot_y; cell_index_y++)
{
std::cout << "omp_get_num_threads = " << omp_get_num_threads ()<< std::endl;
std::cout << "omp_get_max_threads = " << omp_get_max_threads ()<< std::endl;
extract_relevant_points(relevant_points, pointcloud, cell_min_angle_sensor_rot, cell_max_angle_sensor_rot);
}
}
I just saw my computer is beginning to run low in memory, could that be a part of the problem ?
According to https://msdn.microsoft.com/en-us/library/bx15e8hb.aspx:
If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads requested for the parallel region exceeds the number that the run-time system can supply, the behavior of the program is implementation-defined. An implementation may, for example, interrupt the execution of the program, or it may serialize the parallel region.
You request 6 threads, the implementation can only provide 5, so it is free to do what it wants.
I'm also pretty sure you are not supposed to change the number of threads while inside a parallel region, so your omp_set_num_threads will do nothing at best and blow up in your face at worst.
I founded the answer thanks to another post: Why does the compiler ignore OpenMP pragmas?
In the end it was a simple error of library that i didn't add to the compiler, and I didn't noticed it because i was compiling with cmake so i don't have to type the line directly. Also i use catkin_make to compile so i don't have warning but only error in the console.
So basically, to use openMp you have to add -fopenmp as arguments to your compiler, and if you don't' ... well the lines are just ignored by the compiler.

How to write to file from different threads, OpenMP, C++

I use openMP for parallel my C++ program. My parallel code have very simple form
#pragma omp parallel for shared(a, b, c) private(i, result)
for (i = 0; i < N; i++){
result= F(a,b,c,i)//do some calculation
cout<<i<<" "<<result<<endl;
}
If two threads try to write into file simultaneously, the data is mixed up.
How I can solve this problem?
OpenMP provides pragmas to help with synchronisation. #pragma omp critical allows only one thread to be executing the attached statement at any time (a mutual exclusion critical region). The #pragma omp ordered pragma ensures loop iteration threads enter the region in order.
// g++ -std=c++11 -Wall -Wextra -pedantic -fopenmp critical.cpp
#include <iostream>
int main()
{
#pragma omp parallel for
for (int i = 0; i < 20; ++i)
std::cout << "unsynchronized(" << i << ") ";
std::cout << std::endl;
#pragma omp parallel for
for (int i = 0; i < 20; ++i)
#pragma omp critical
std::cout << "critical(" << i << ") ";
std::cout << std::endl;
#pragma omp parallel for ordered
for (int i = 0; i < 20; ++i)
#pragma omp ordered
std::cout << "ordered(" << i << ") ";
std::cout << std::endl;
return 0;
}
Example output (different each time in general):
unsynchronized(unsynchronized(unsynchronized(05) unsynchronized() 6unsynchronized() 1unsynchronized(7) ) unsynchronized(unsynchronized(28) ) unsynchronized(unsynchronized(93) ) unsynchronized(4) 10) unsynchronized(11) unsynchronized(12) unsynchronized(15) unsynchronized(16unsynchronized() 13unsynchronized() 17) unsynchronized(unsynchronized(18) 14unsynchronized() 19)
critical(5) critical(0) critical(6) critical(15) critical(1) critical(10) critical(7) critical(16) critical(2) critical(8) critical(17) critical(3) critical(9) critical(18) critical(11) critical(4) critical(19) critical(12) critical(13) critical(14)
ordered(0) ordered(1) ordered(2) ordered(3) ordered(4) ordered(5) ordered(6) ordered(7) ordered(8) ordered(9) ordered(10) ordered(11) ordered(12) ordered(13) ordered(14) ordered(15) ordered(16) ordered(17) ordered(18) ordered(19)
Problem is: you have a single resource all threads try to access. Those single resources must be protected against concurrent access (thread safe resources do this, too, just transparently for you; by the way: here is a nice answer about thread safety of std::cout). You could now protect this single resource e. g. with a std::mutex. Problem then is, that the threads will have to wait for the mutex until the other thread gives it back again. So you only will profit from parallelisation if F is a very complex function.
Further drawback: as threads work parallel, even with a mutex to protect std::in, the results can be printed out in arbitrary order, depending on which thread happens to operate earlier.
If I may assume that you want the results of F(... i) for smaller i before the results of greater i, you either should drop parallelisation entirely or do it differently:
Provide an array of size N and let each thread store its results there (array[i] = f(i);). Then iterate over the array in a separate non-parallel loop. Again, doing so is only worth the effort if F is a complex function (and for large N).
Additionally: Be aware that threads must be created, too, which causes some overhead somewhere (creating thread infrastructure and stack, registering thread at OS, ... – unless if you can reuse some threads already created in a thread pool earlier...). Consider this, too, when deciding if you want to parallelise or not. Sometimes, non-parallel calculations can be faster...

OpenMP parallel-for efficiency query

Please consider the following simple code for summing up values in a parallel for loop:
int nMaxThreads = omp_get_max_threads();
int nTotalSum = 0;
#pragma omp parallel for num_threads(nMaxThreads) \
reduction(+:nTotalSum)
for (int i = 0; i < 4; i++)
{
nTotalSum += i;
cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
}
When I run this on a two-core machine, the output I get is
0: nTotalSum is 0
0: nTotalSum is 1
1: nTotalSum is 2
1: nTotalSum is 5
This suggests to me that the critical section, i.e. the update of nTotalSum, is being executed on each loop. This seems like a waste, when all each thread has to do is calculate a 'local' sum of the values it is adding then update nTotalSum with this 'local sum' after it has done so.
Is my interpretation of the output correct, and if so, how can I make it more efficient? Note I tried the following:
#pragma omp parallel for num_threads(nMaxThreads) \
reduction(+:nTotalSum)
int nLocalSum = 0;
for (int i = 0; i < 4; i++)
{
nLocalSum += i;
}
nTotalSum += nLocalSum;
...but the compiler complained stating that it was expecting a for loop following the pragma omp parallel for statement...
Your output does in fact not indicate a critical section during the loop. Each thread has its own zero-initialized copy, thread 0 working on i = 0,1, thread 1 working on i = 2,3. At the end OpenMP takes care of adding the local copies to the original.
You should not try to implement it yourself unless you have specific evidence that you can do it more efficiently. See for example this question / answer.
Your manual version would work if you split the parallel / for into two directives:
int nTotalSum = 0;
#pragma omp parallel
{
// Declare the local variable it here!
// Then it's private implicitly and properly initialized
int localSum = 0;
#pragma omp for
for (int i = 0; i < 4; i++) {
localSum += i;
cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
}
// Do not forget the atomic, or it would be a race condition!
// Alternative would be a critical, but that's less efficient
#pragma omp atomic
nTotalSum += localSum;
}
I think it's likely that your OpenMP implementation does the reduction just like that.
Each OMP thread has its own copy of nTotalSum. At the end of the OMP section these are combined back into the original nTotalSum. The output you're seeing comes from running loop iterations (0,1) in one thread, and (2,3) in another thread. If you output nTotalSum at the end of your loop, you should see the expected result of 6.
In you nLocalSum example, move the declaration of nLocalSum to before the #pragma omp line. The for loop must be on the line immediately following the pragma.
from my parallel programming in openmp book:
reduction clause can be trickier to understand, has both private and shared storage behavior. The reduction attribute is used on objects that are the target of an arithmetic reduction. This can be important in many applications...reduction allows it to be implemented by the compiler efficiently... this is such a common operation that openmp has the reduction data scope clause just to handle them...most common example is final summation of temporary local variables at the end of the parallel construct.
correction to your second example:
total_sum = 0; /* do all variable initialization prior to omp pragma */
#pragma omp parallel for \
private(i) \
reduction(+:total_sum)
for (int i = 0; i < 4; i++)
{
total_sum += i; /* you used nLocalSum here */
}
#pragma omp end parallel for
/* at this point in the code,
all threads will have done your `for` loop where total_sum is local to each thread,
openmp will then '+" together the values in `total_sum` coming from each thread because we used reduction,
do not do an explicit nTotalSum += nLocalSum after the omp for loop, it's not needed the reduction clause takes care of this
*/
In your first example, I'm not sure of your use of #pragma omp parallel for num_threads(nMaxThreads) reduction(+:nTotalSum) of what num_threads(nMaxThreads) is doing. But i suspect the weird output might be caused by print buffering.
In any case, the reduction clause is very useful and very efficient if used properly. It would be more obvious in a more complicated, real-world example.
Your posted example is so simple that it doesn't show off the usefulness of the reduction clause, and strictly speaking for your example since all threads are doing a summation the most efficient way to do it would just make total_sum a shared variable in the parallel section and have all threads pump in to it. At the end the answer would still be correct. would work if using critical directive.

Labeling data for Bag Of Words

I've been looking at this tutorial and the labeling part confuses me. Not the act of labeling itself, but the way the process is shown in the tutorial.
More specifically the #pragma omp sections:
#pragma omp parallel for schedule(dynamic,3)
for(..loop a directory?..) {
...
#pragma omp critical
{
if(classes_training_data.count(class_) == 0) { //not yet created...
classes_training_data[class_].create(0,response_hist.cols,response_hist.type());
classes_names.push_back(class_);
}
classes_training_data[class_].push_back(response_hist);
}
total_samples++;
}
As well as the following code below it.
Could anyone explain what is going on here?
The pragmas are from OpenMP, a specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs.
The #pragma omp parallel for schedule(dynamic,3) is a shorthand that combines several other pragmas. Let's see them:
#pragma omp parallel starts a parellel block with a set of threads that will execute the next stament in parallel.
You can also specify "parallel loops", like a for loop: #pragma omp parallel for. This pragma will split the for-loop between all the threads inside the parallel block and each thread will execute its portion of the loop.
For example:
#pragma omp parallel
{
#pragma omp for
for(int n(0); n < 5; ++n) {
std::cout << "Hello\n";
}
This will create a parallel block that will execute a for-loop. The threads will print to the standard output Hello five times, in no specified order (I mean, thread #3 can print its "Hello" before thread #1 and so.).
Now, you can also schedule which chunk of work will each thread receive. There are several policies: static (the default) and dynamic. Check this awesome answer in regards to scheduling policies.
Now, all of this pragmas can be shortened to one:
#pragma omp parallel for schedule(dynamic,3)
which will create a parallel block that will execute a for-loop, with dynamic scheduling and each thread in the block will execute 3 iterations of the loop before asking the scheduler for more chunks.
The critical pragma will restrict the execution of the next block to a single thread at time. In your example, only one thread at a time will execute this:
{
if(classes_training_data.count(class_) == 0) { //not yet created...
classes_training_data[class_].create(0,response_hist.cols,response_hist.type());
classes_names.push_back(class_);
}
classes_training_data[class_].push_back(response_hist);
}
Here you have an introduction to OpenMP 3.0.
Finally, the variables you mention are specified in the tutorial, just look before your posted code:
vector<KeyPoint> keypoints;
Mat response_hist;
Mat img;
string filepath;
map<string,Mat> classes_training_data;
Ptr<FeatureDetector > detector(new SurfFeatureDetector());
Ptr<DescriptorMatcher > matcher(new BruteForceMatcher<L2<float> >());
Ptr<DescriptorExtractor > extractor(new OpponentColorDescriptorExtractor(Ptr<DescriptorExtractor>(new SurfDescriptorExtractor())));
Ptr<BOWImgDescriptorExtractor> bowide(new BOWImgDescriptorExtractor(extractor,matcher));
bowide->setVocabulary(vocabulary);