How can I execute two for loops in parallel in C++? - c++

In C++ I would like two for loops to execute at the same time and not have one wait for the other one to go first or wait for it to end.
I would like the two for loops (or more) to finish the loops in the same speed it would take one loop of the same size to finish.
I know it's been asked and answered, but not in an example this simple. I'm hoping to solve this specific problem. I worked combinations of pragma omp code examples and couldn't get the result.
#include <iostream>
using namespace std;
#define N 5
int main(void) {
int i;
for (i = 0; i < N; i++) {
cout << "This is line ONE \n";
};
#pragma omp parallel
#pragma omp for
for (i = 0; i < N; i++) {
cout << "This is line TWO \n";
};
};
Compiling
$ g++ parallel.cpp -fopenmp && ./a.out
The output of the code is this, in the time it takes to run two loops...
This is line ONE
This is line ONE
This is line ONE
This is line ONE
This is line ONE
This is line TWO
This is line TWO
This is line TWO
This is line TWO
This is line TWO
The output I would like is this
They don't have to print one after the other like this, but I would think they would if they were both getting to the print part of the loops at the same times. What I really need is for the loops to start and finish at the same time (with the loops being equal).
This is line ONE
This is line TWO
This is line ONE
This is line TWO
This is line ONE
This is line TWO
This is line ONE
This is line TWO
This is line ONE
This is line TWO
There's this Q&A here, but I don't quite understand the undeclared foo and the //do stuff with item parts. What kinda stuff? What item? I have not been able to extrapolate from examples online to make what I need happen.

As already mentioned in the comments that OpenMP may not be the best solution to do so, but if you wish to do it with OpenMP, I suggest the following:
Use sections to start 2 threads, and communicate between the threads by using shared variables. The important thing is to use atomic operation to read (#pragma omp atomic read seq_cst) and to write (#pragma omp atomic write seq_cst) these variables. Here is an example:
#pragma omp parallel num_threads(2)
#pragma omp sections
{
#pragma omp section
{
//This is the sensor controlling part
while(exit_condition)
{
sensor_state = read_sensor();
// Read the currect state of motor from other thread
#pragma omp atomic read seq_cst
motor_state=shared_motor_state;
// Based on the motor_state and sensor state send
// a command to the other thread to control the motor
// or wait for the motor to be ready in a loop, etc.
#pragma omp atomic write seq_cst
shared_motor_command= //whaterver you wish ;
}
}
#pragma omp section
{
//This is the motor controlling part
while(exit_condition)
{
// read motor command form other thread
#pragma omp atomic read seq_cst
motor_command = shared_motor_command;
// Do whatewer you have to to based on motor command and
// You can set the state of motor by the following line
#pragma omp atomic write seq_cst
shared_motor_state= //what you need to pass to the other thread
}
}
}

I think the issue is that you are not trying to parallelize two loops but instead you try to parallelize the work of one loop. If you would add std::cout << "Hello from thread: " << omp_get_thread_num() << "\n"; to your second loop you would see:
This is line TWO
Hello from thread: 0
This is line TWO
Hello from thread: 1
This is line TWO
Hello from thread: 2
This is line TWO
Hello from thread: 3
This is line TWO
Hello from thread: 0
Depending on the assignment to threads, with four threads being the default amount of threads (often number of cores), the order might vary: example (0,1,2,3,0) could be (0,2,3,1,0)
So what you do is that the first loop is run in serial and then (4 or more/less) threads run the second loop in parallel.
The question is if you REALLY want to use OpenMP to parallelize your code. If so you could do something similar to:
#include <iostream>
#include <omp.h>
#include <String.h>
int main() {
#pragma omp parallel for schedule(static)
for(int i = 0; i < 10; i++){
int tid = omp_get_thread_num();
if (tid%2==0) {
std::cout << "This is line ONE" << "\n";
} else {
std::cout << "This is line TWO" << "\n";
}
}
return 0;
}
Where based on the threadID - if it is an even thread it will do task 1, if it is an uneven thread it will do task 2. But as many other commenters have commented, maybe you should consider using p_threads depending on the task.

Related

How is OpenMP communicating between threads with what should be a private variable?

I'm writing some code in C++ using OpenMP to parallelize some chunks. I run into some strange behavior that I can't quite explain. I've rewritten my code such that it replicates the issue minimally.
First, here is a function I wrote that is to be run in a parallel region.
void foo()
{
#pragma omp for
for (int i = 0; i < 3; i++)
{
#pragma omp critical
printf("Hello %d from thread %d.\n", i, omp_get_thread_num());
}
}
Then here is my whole program.
int main()
{
omp_set_num_threads(4);
#pragma omp parallel
{
for (int i = 0; i < 2; i++)
{
foo();
#pragma omp critical
printf("%d\n", i);
}
}
return 0;
}
When I compile and run this code (with g++ -std=c++17), I get the following output on the terminal:
Hello 0 from thread 0.
Hello 1 from thread 1.
Hello 2 from thread 2.
0
0
Hello 2 from thread 2.
Hello 1 from thread 1.
0
Hello 0 from thread 0.
0
1
1
1
1
i is a private variable. I would expect that the function foo would be run twice per thread. So I would expect to see eight "Hello from %d thread %d.\n" statements in the terminal, just like how I see eight numbers printed when printing i. So what gives here? Why is it that in the same loop, OMP behaves so differently?
It is because #pragma omp for is a worksharing construct, so it will distribute the work among threads and the number of threads used does not matter in this respect, just the number of loop counts (2*3=6).
If you use omp_set_num_threads(1); you also see 6 outputps. If you use more threads than loop counts, some threads will be idle in the inner loop, but you still see exactly 6 outputs.
On the other hand, if you remove #pragma omp for line you will see (number of threads)*2*3 (=24) outputs.
From the documentation of omp parallel:
Each thread in the team executes all statements within a parallel region except for work-sharing constructs.
Emphasis mine. Since the omp for in foo is a work-sharing construct, it is only executed once per outer iteration, no matter how many threads run the parallel block in main.

Wait for the whole omp block to be finished before calling the second function

I don't have experience with openmp in C++ and I would like to learn how to solve my problem properly. I have 30 files that need to be processed independently by the same function. Each time the function is activated, a new output file will be generated (out01.txt to out30.txt) saving the results. I have 12 processors in my machine and would like to use 10 for this problem.
I need to change my code to wait for all the 30 files to be processed to execute other routines in C++. At this moment, I'm not able to force my code to wait for all the omp scope to be executed and then move the second function.
Please find below a draft of my code.
int W = 10;
int i = 1;
ostringstream fileName;
int th_id, nthreads;
omp_set_num_threads(W);
#pragma omp parallel shared (nFiles) private(i,fileName,th_id)
{
#pragma omp for schedule(static)
for ( i = 1; i <= nFiles; i++)
{
th_id = omp_get_thread_num();
cout << "Th_id: " << th_id << endl;
// CALCULATION IS PERFORMED HERE FOR EACH FILE
}
}
// THIS is the point where the program should wait for the whole block to be finished
// Calling the second function ...
Both the "omp for" and "omp parallel" pragmas have an implicit barrier at the end of its scope. Therefore, the code after the parallel section can't be executed until the parallel section has concluded. So your code should run perfectly.
If there is still a problem then it isn't because your code isn't waiting at the end of the parallel region.
Please supply us with more details about what happens during execution of this code. This way we might be able to find the real cause of your problem.

How to write to file from different threads, OpenMP, C++

I use openMP for parallel my C++ program. My parallel code have very simple form
#pragma omp parallel for shared(a, b, c) private(i, result)
for (i = 0; i < N; i++){
result= F(a,b,c,i)//do some calculation
cout<<i<<" "<<result<<endl;
}
If two threads try to write into file simultaneously, the data is mixed up.
How I can solve this problem?
OpenMP provides pragmas to help with synchronisation. #pragma omp critical allows only one thread to be executing the attached statement at any time (a mutual exclusion critical region). The #pragma omp ordered pragma ensures loop iteration threads enter the region in order.
// g++ -std=c++11 -Wall -Wextra -pedantic -fopenmp critical.cpp
#include <iostream>
int main()
{
#pragma omp parallel for
for (int i = 0; i < 20; ++i)
std::cout << "unsynchronized(" << i << ") ";
std::cout << std::endl;
#pragma omp parallel for
for (int i = 0; i < 20; ++i)
#pragma omp critical
std::cout << "critical(" << i << ") ";
std::cout << std::endl;
#pragma omp parallel for ordered
for (int i = 0; i < 20; ++i)
#pragma omp ordered
std::cout << "ordered(" << i << ") ";
std::cout << std::endl;
return 0;
}
Example output (different each time in general):
unsynchronized(unsynchronized(unsynchronized(05) unsynchronized() 6unsynchronized() 1unsynchronized(7) ) unsynchronized(unsynchronized(28) ) unsynchronized(unsynchronized(93) ) unsynchronized(4) 10) unsynchronized(11) unsynchronized(12) unsynchronized(15) unsynchronized(16unsynchronized() 13unsynchronized() 17) unsynchronized(unsynchronized(18) 14unsynchronized() 19)
critical(5) critical(0) critical(6) critical(15) critical(1) critical(10) critical(7) critical(16) critical(2) critical(8) critical(17) critical(3) critical(9) critical(18) critical(11) critical(4) critical(19) critical(12) critical(13) critical(14)
ordered(0) ordered(1) ordered(2) ordered(3) ordered(4) ordered(5) ordered(6) ordered(7) ordered(8) ordered(9) ordered(10) ordered(11) ordered(12) ordered(13) ordered(14) ordered(15) ordered(16) ordered(17) ordered(18) ordered(19)
Problem is: you have a single resource all threads try to access. Those single resources must be protected against concurrent access (thread safe resources do this, too, just transparently for you; by the way: here is a nice answer about thread safety of std::cout). You could now protect this single resource e. g. with a std::mutex. Problem then is, that the threads will have to wait for the mutex until the other thread gives it back again. So you only will profit from parallelisation if F is a very complex function.
Further drawback: as threads work parallel, even with a mutex to protect std::in, the results can be printed out in arbitrary order, depending on which thread happens to operate earlier.
If I may assume that you want the results of F(... i) for smaller i before the results of greater i, you either should drop parallelisation entirely or do it differently:
Provide an array of size N and let each thread store its results there (array[i] = f(i);). Then iterate over the array in a separate non-parallel loop. Again, doing so is only worth the effort if F is a complex function (and for large N).
Additionally: Be aware that threads must be created, too, which causes some overhead somewhere (creating thread infrastructure and stack, registering thread at OS, ... – unless if you can reuse some threads already created in a thread pool earlier...). Consider this, too, when deciding if you want to parallelise or not. Sometimes, non-parallel calculations can be faster...

Labeling data for Bag Of Words

I've been looking at this tutorial and the labeling part confuses me. Not the act of labeling itself, but the way the process is shown in the tutorial.
More specifically the #pragma omp sections:
#pragma omp parallel for schedule(dynamic,3)
for(..loop a directory?..) {
...
#pragma omp critical
{
if(classes_training_data.count(class_) == 0) { //not yet created...
classes_training_data[class_].create(0,response_hist.cols,response_hist.type());
classes_names.push_back(class_);
}
classes_training_data[class_].push_back(response_hist);
}
total_samples++;
}
As well as the following code below it.
Could anyone explain what is going on here?
The pragmas are from OpenMP, a specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs.
The #pragma omp parallel for schedule(dynamic,3) is a shorthand that combines several other pragmas. Let's see them:
#pragma omp parallel starts a parellel block with a set of threads that will execute the next stament in parallel.
You can also specify "parallel loops", like a for loop: #pragma omp parallel for. This pragma will split the for-loop between all the threads inside the parallel block and each thread will execute its portion of the loop.
For example:
#pragma omp parallel
{
#pragma omp for
for(int n(0); n < 5; ++n) {
std::cout << "Hello\n";
}
This will create a parallel block that will execute a for-loop. The threads will print to the standard output Hello five times, in no specified order (I mean, thread #3 can print its "Hello" before thread #1 and so.).
Now, you can also schedule which chunk of work will each thread receive. There are several policies: static (the default) and dynamic. Check this awesome answer in regards to scheduling policies.
Now, all of this pragmas can be shortened to one:
#pragma omp parallel for schedule(dynamic,3)
which will create a parallel block that will execute a for-loop, with dynamic scheduling and each thread in the block will execute 3 iterations of the loop before asking the scheduler for more chunks.
The critical pragma will restrict the execution of the next block to a single thread at time. In your example, only one thread at a time will execute this:
{
if(classes_training_data.count(class_) == 0) { //not yet created...
classes_training_data[class_].create(0,response_hist.cols,response_hist.type());
classes_names.push_back(class_);
}
classes_training_data[class_].push_back(response_hist);
}
Here you have an introduction to OpenMP 3.0.
Finally, the variables you mention are specified in the tutorial, just look before your posted code:
vector<KeyPoint> keypoints;
Mat response_hist;
Mat img;
string filepath;
map<string,Mat> classes_training_data;
Ptr<FeatureDetector > detector(new SurfFeatureDetector());
Ptr<DescriptorMatcher > matcher(new BruteForceMatcher<L2<float> >());
Ptr<DescriptorExtractor > extractor(new OpponentColorDescriptorExtractor(Ptr<DescriptorExtractor>(new SurfDescriptorExtractor())));
Ptr<BOWImgDescriptorExtractor> bowide(new BOWImgDescriptorExtractor(extractor,matcher));
bowide->setVocabulary(vocabulary);

C++ OpenMP for-loop global variable problems

I'm new to OpenMP and from what I have read about OpenMP 2.0, which comes standard with Microsoft Visual Studio 2010, global variables are considered troublesome and error prone when used in parallel programming. I have also been adopting this feeling since I have found very little on how to deal with global variables and static global variables efficiently, or at all for that matter.
I have this snippet of code which runs but because of the local variable created in the parallel block I don't get the answer I'm looking for. I get 8 different print outs (because that how many threads I have on my PC) instead of 1 answer. I know that it's because of the local variables "list" created in the parallel block but this code will not run if I move the "list" variable and make it a global variable. Actually the code does run but it never gives me an answer back. This is the sample code that I would like to modify to use a global "list" variable :
#pragma omp parallel
{
vector<int> list;
#pragma omp for
for(int i = 0; i < 50000; i++)
{
list.push_back(i);
}
cout << list.size() << endl;
}
Output:
6250
6250
6250
6250
6250
6250
6250
6250
They add up to 50000 but I did not get the one answer with 50000, instead it's divided up.
Solution:
vector<int> list;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 50000; i++)
{
cout << i << endl;
#pragma omp critical
{
list.push_back(i);
}
}
}
cout << list.size() << endl;
According to the MSDN Documentation the parallel clause
Defines a parallel region, which is code that will be executed by
multiple threads in parallel.
And since the list variable is declared inside this section every thread will have its own list.
On the other hand, the for pragma
Causes the work done in a for loop inside a parallel region to be
divided among threads.
So the 50000 iterations will be split among threads but each thread will have its own list.
I think what you are trying to do can be achieved by:
Taking the list definition outside the "parallel" section.
Protect the list.push_back statement with a critical section.
Try this:
vector<int> list;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 50000; i++)
{
#pragma omp critical
{
list.push_back(i);
}
}
}
cout << list.size() << endl;
I don't think you should get any speedup from OpenMP in this case because there will be contention for the critical section. A faster solution for this (if you don't care about the order of elements) would be for every thread to have its own list, and get those lists merged after the loop finishes. The implementation using std::list instead of std::vector would look cleaner in this case (because you wouldn't have to copy arrays).
Some apps are memory bound and not compute bound. Bottom line: check if you actually get a speedup from OpenMP.
Why you need the first pragma here? (#pragma omp parallel). I think that's the issue.