Labeling data for Bag Of Words - c++

I've been looking at this tutorial and the labeling part confuses me. Not the act of labeling itself, but the way the process is shown in the tutorial.
More specifically the #pragma omp sections:
#pragma omp parallel for schedule(dynamic,3)
for(..loop a directory?..) {
...
#pragma omp critical
{
if(classes_training_data.count(class_) == 0) { //not yet created...
classes_training_data[class_].create(0,response_hist.cols,response_hist.type());
classes_names.push_back(class_);
}
classes_training_data[class_].push_back(response_hist);
}
total_samples++;
}
As well as the following code below it.
Could anyone explain what is going on here?

The pragmas are from OpenMP, a specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs.
The #pragma omp parallel for schedule(dynamic,3) is a shorthand that combines several other pragmas. Let's see them:
#pragma omp parallel starts a parellel block with a set of threads that will execute the next stament in parallel.
You can also specify "parallel loops", like a for loop: #pragma omp parallel for. This pragma will split the for-loop between all the threads inside the parallel block and each thread will execute its portion of the loop.
For example:
#pragma omp parallel
{
#pragma omp for
for(int n(0); n < 5; ++n) {
std::cout << "Hello\n";
}
This will create a parallel block that will execute a for-loop. The threads will print to the standard output Hello five times, in no specified order (I mean, thread #3 can print its "Hello" before thread #1 and so.).
Now, you can also schedule which chunk of work will each thread receive. There are several policies: static (the default) and dynamic. Check this awesome answer in regards to scheduling policies.
Now, all of this pragmas can be shortened to one:
#pragma omp parallel for schedule(dynamic,3)
which will create a parallel block that will execute a for-loop, with dynamic scheduling and each thread in the block will execute 3 iterations of the loop before asking the scheduler for more chunks.
The critical pragma will restrict the execution of the next block to a single thread at time. In your example, only one thread at a time will execute this:
{
if(classes_training_data.count(class_) == 0) { //not yet created...
classes_training_data[class_].create(0,response_hist.cols,response_hist.type());
classes_names.push_back(class_);
}
classes_training_data[class_].push_back(response_hist);
}
Here you have an introduction to OpenMP 3.0.
Finally, the variables you mention are specified in the tutorial, just look before your posted code:
vector<KeyPoint> keypoints;
Mat response_hist;
Mat img;
string filepath;
map<string,Mat> classes_training_data;
Ptr<FeatureDetector > detector(new SurfFeatureDetector());
Ptr<DescriptorMatcher > matcher(new BruteForceMatcher<L2<float> >());
Ptr<DescriptorExtractor > extractor(new OpponentColorDescriptorExtractor(Ptr<DescriptorExtractor>(new SurfDescriptorExtractor())));
Ptr<BOWImgDescriptorExtractor> bowide(new BOWImgDescriptorExtractor(extractor,matcher));
bowide->setVocabulary(vocabulary);

Related

openmp distribute threads to certain code blocks

In my program I need to divide n threads in following way:
1) Thread1 is doing his specific work
2) n-1 other threads do their own work
If number_of _threads == 1 then only action 1) is done.
Actions 1) and 2) are computed in parallel
void main(){
int number_of_threads;
std::cin>>number_of_threads;
omp_set_num_threads(number_of_threads);
#pragma omp parallel if (number_of_threads>1)
{
#pragma omp master // no barrier at the end of master block
single_calc();
#pragma omp ??(number_of_threads-1) //second block
// this section of code is computed by n-1 processes
}
}
I came up to following solutions
1)hardcode so that thread with id == 1 doesn't compute the second block
2)as far as master thread has id = 0 I can use #pragma omp for starting with i=1 in the second block
3) call single_calc() outside #pragma omp parallel (but I want to control amount of threads calculating this block)
Is there any elegant solution to this?

set_num_threads inside parallel not working

I'm struggling to set number of threads to 1 inside of a parallel region. I put a barrier so that all threads stop at that point and I can freely set number of threads to 1 (and there will be no threads executing). But wherever I placed omp_set_num_threads(1) it always returned 3. Is it possible to change number of threads during runtime? How can I do that?
#import<iostream>
#import<omp.h>
#import<stdio.h>
int main(){
int num_of_threads;
std::cin>>num_of_threads;
omp_set_dynamic(0);
#pragma omp parallel if(num_of_threads>1) num_threads(3)
{
int t_id = omp_get_thread_num();
int t_total = omp_get_num_threads();
printf("Current thread id: %d \n Total number_of_threads: %d \n",t_id,t_total);
#pragma omp barrier
#pragma omp single
{
omp_set_num_threads(1);
t_id = omp_get_thread_num();
t_total = omp_get_num_threads();
printf("Single section \n Current thread id: %d \n Total number_of_threads: %d \n",t_id,t_total);
}
}
}
TL;DR You can't change the number of threads in a parallel region.
Remember this is a pool of threads, which get forked at the beginning of the parallel region. Inside they are not even synchronized (if you dont tell them too), thus OpenMP would need to terminate some of them at an unknown position - obviously a bad idea.
Your #pragma omp single makes the following code section be executed by a single thread, thus no need to set it via omp_set_num_threads.
BUT it doesnt change your pool, it just advises the compiler to schedule the following section to one thread - while the rest ignores it.
To show this behavior e.g. for university purposes i would suggest to print out only the thread id in parallel and single part. That way you can already tell it's working or not.

How should I interpreter these VTune results?

I'm trying to parallelyzing this code using OpenMP. OpenCV (built using IPP for best efficiency) is used as external library.
I'm having problems unbalanced CPU usage in parallel fors, but it seems that there is no load imbalance. As you will see, this could be because of KMP_BLOCKTIME=0, but this could be necessary because of external libraries (IPP, TBB, OpenMP, OpenCV). In the rest of the questions you will find more details and data that you can download.
These are the Google Drive links to my VTune results:
c755823 basic KMP_BLOCKTIME=0 30 runs : basic hotspot with environment variable KMP_BLOCKTIME set to 0 on 30 runs of the same input
c755823 basic 30 runs : same as above, but with default KMP_BLOCKTIME=200
c755823 advanced KMP_BLOCKTIME=0 30 runs : same as first, but advanced hotspot
For those who are interested, I can send you the original code somehow.
On my Intel i7-4700MQ the actual wall-clock time of the application on average on 10 runs is around 0.73 seconds. I compile the code with icpc 2017 update 3 with the following compiler flags:
INTEL_OPT=-O3 -ipo -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2 -fma -align -finline-functions
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl
In addition I set KMP_BLOCKTIME=0 because the default value (200) was generating an huge overhead.
We can divide the code in 3 parallel regions (wrapped in only one #pragma parallel for efficiency) and a previous serial one, which is around 25% of the algorithm (and it can't be parallelized).
I'll try to describe them (or you can skip to the code structure directly):
We create a parallel region in order to avoid the overhead to create a new parallel region. The final result is to populate the rows of a matrix obejct, cv::Mat descriptor. We have 3 shared std::vector objects: (a) blurs which is a chain of blurs (not parallelizable) using GuassianBlur by OpenCV (which uses the IPP implementation of guassian blurs) (b) hessResps (size known, say 32) (c) findAffineShapeArgs (unkown size, but in order of thousands of elements, say 2.3k) (d) cv::Mat descriptors (unkown size, final result). In the serial part, we populate `blurs, which is a read only vector.
In the first parallel region,hessResps is populated using blurs without any synchronization mechanism.
In the second parallel region findLevelKeypoints is populated using hessResps as read only. Since findAffineShapeArgs size is unkown, we need a local vector localfindAffineShapeArgs which will be appended to findAffineShapeArgs in the next step
Since findAffineShapeArgs is shared and its size is unkown, we need a critical section where each localfindAffineShapeArgs is appended to it.
In the third parallel region, each findAffineShapeArgs is used to generate the rows of the final cv::Mat descriptor. Again, since descriptors is shared, we need a local version cv::Mat localDescriptors.
A final critical section push_back each localDescriptors to descriptors. Notice that this is extremely fast since cv::Mat is "kinda" of a smart pointer, so we push_back pointers.
This is the code structure:
cv::Mat descriptors;
std::vector<Mat> blurs(blursSize);
std::vector<Mat> hessResps(32);
std::vector<FindAffineShapeArgs> findAffineShapeArgs;//we don't know its tsize in advance
#pragma omp parallel
{
//compute all the hessianResponses
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
hessResps[/**/] = hessianResponse(/*...*/);
}
std::vector<FindAffineShapeArgs> localfindAffineShapeArgs;
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
findLevelKeypoints(localfindAffineShapeArgs, hessResps[/*...*], /*...*/); //populate localfindAffineShapeArgs with push_back
}
#pragma omp critical{
findAffineShapeArgs.insert(findAffineShapeArgs.end(), localfindAffineShapeArgs.begin(), localfindAffineShapeArgs.end());
}
#pragma omp barrier
#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
{
findAffineShape(findAffineShapeArgs[i]);
}
#pragma omp critical{
for(size_t i=0; i<localRes.size(); i++)
descriptors.push_back(localRes[i].descriptor);
}
}
At the end of the question, you can find FindAffineShapeArgs.
I'm using Intel Amplifier to see hotspots and evaluate my application.
The OpenMP Potential Gain analsysis says that the Potential Gain if there would be perfect load balancing would be 5.8%, so we can say that the workload is balanced between different CPUs.
This i the CPU usage histogram for the OpenMP region (remember that this is the result of 10 consecutive runs):
So as you can see, the Average CPU Usage is 7 cores, which is good.
This OpenMP Region Duration Histogram shows that in these 10 runs the parallel region is executed always with the same time (with a spread around 4 milliseconds):
This is the Caller/Calee tab:
For you knowledge:
interpolate is called in the last parallel region
l9_ownFilter* functions are all called in the last parallel region
samplePatch is called in the last parallel region.
hessianResponse is called in the second parallel region
Now, my first question is: how should I interpret the data above? As you can see, in many of the functions half of the time the "Effective Time by Utilization` is "ok", which would probably become "Poor" with more cores (for example on a KNL machine, where I'll test the application next).
Finally, this is the Wait and Lock analysis result:
Now, this is the first weird thing: line 276 Join Barrier (which corresponds to the most expensive wait object) is#pragma omp parallel`, so the beginning of the parallel region. So it seems that someone spawned threads before. Am I wrong? In addition, the wait time is longer than the program itself (0.827s vs 1.253s of the Join Barrier that I'm talking about)! But maybe that refers to the waiting of all threads (and not wall-clock time, which is clearly impossible since it's longer than the program itself).
Then, the Explicit Barrier at line 312 is #pragma omp barrier of the code above, and its duration is 0.183s.
Looking at the Caller/Callee tab:
As you can see, most of wait time is poor, so it refers to one thread. But I'm sure that I'm understanding this. My second question is: can we interpret this as "all the threads are waiting just for one thread who is staying behind?".
FindAffineShapeArgs definition:
struct FindAffineShapeArgs
{
FindAffineShapeArgs(float x, float y, float s, float pixelDistance, float type, float response, const Wrapper &wrapper) :
x(x), y(y), s(s), pixelDistance(pixelDistance), type(type), response(response), wrapper(std::cref(wrapper)) {}
float x, y, s;
float pixelDistance, type, response;
std::reference_wrapper<Wrapper const> wrapper;
};
The Top 5 Parallel Regions by Potential Gain in the summary view shows only one region (the only one)
Look at the "/OpenMP Region/OpenMP Barrier-to-Barrier" grouping, this is the order of the most expensives loops:
The 3th loop:
pragma omp for schedule(dynamic) nowait
for(int i=0; i
is the most expensive one (as I already knew) and here's a screenshot of the expended view:
As you can see, many functions are from OpenCV, which exploits IPP and is (should be) already optimized. Expanding the two other functions (interpolate and samplePatch) shows a [No call stack information]. Same for all the other functions (in other regions too).
The 2nd most expensive region is the second parallel for:
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
findLevelKeypoints(localfindAffineShapeArgs, hessResps[/*...*], /*...*/); //populate localfindAffineShapeArgs with push_back
}
Here's the expanded view:
And finally the 3th most expensive is the first loop:
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
hessResps[/**/] = hessianResponse(/*...*/);
}
Here's the expended view:
If you want to know more, please use my attached VTune files or just ask!
Try reading the info in this link, especially the part about "Nested OpenMP" since Intel IPP already uses OpenMP in their implementation. From my experience with Intel IPP and OpenMP, if you are doing some other type of multi-threading and when each of the created threads gets to the OpenMP calls performance was really bad. Also, you can try having #pragma omp parallel for for each of the parallel regions instead of #pragma omp for and getting rid of the outer #pragma omp parallel

Using OpenMP in c++ class

I'm new to OpenMP. I'm trying to use OpenMP in my c++ code. The code is too complicated so I simplify the question as follow:
class CTet
{
...
void cal_Mn(...);
}
int i, num_tet_phys;
vector<CTet> tet_phys;
num_tet_phys = ...;
tet_phys.resize(num_tet_phys);
#pragma omp parallel private(i)
for (i = 0; i < num_tet_phys; i++)
tet_phys[i].cal_Mn(...);
I hope that the for loop can run in parallel but it seems that all threads run the whole loop independently. The calculation is repeated by every threads. What's problem in my code? How to fix it?
Thank you!
Jun
Try
#pragma omp parallel for private(i)
for (i = 0; i < num_tet_phys; i++)
tet_phys[i].cal_Mn(...);
Note the use of parallel for.
and compile with the -fopenmp flag.
The #pragma omp parallel creates a team of threads, all of which execute the next statement (in your case, the entire for loop). After the statement, the threads join back into one.
The #pragma omp parallel for creates a team of threads, which divide the work of the for loop between them.

the OpenMP "master" pragma must not be enclosed by the "parallel for" pragma

Why won't the intel compiler let me specify that some actions in an openmp parallel for block should be executed by the master thread only?
And how can I do what I'm trying to achieve without this kind of functionality?
What I'm trying to do is update a progress bar through a callback in a parallel for:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
#pragma omp master
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
I want only the master thread to call the callback, because if I don't enforce that (say by using omp critical instead to ensure only one thread uses the callback at once) I get the following runtime exception:
The application called an interface that was marshalled for a different thread.
...hence the desire to keep all callbacks in the master thread.
Thanks in advance.
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
Compiler Error C3034
OpenMP 'master' directive cannot be directly nested within 'parallel for' directive
Visual Studio 2010 OpenMP 2.0
May be so:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
//#pragma omp master it is error
//#pragma omp critical it is right
if (omp_get_thread_num() == 0) // may be good
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
The reason why you get the error is because the master thread isn't there most of the times when the code reaches the #pragma omp master line.
For example, let's take the code from Artyom:
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
If the code would compile, the following could happen:
Let's say thread 0 starts (the master thread). It reaches the pragma that practically says "Master, do the following piece of code". It being the master can run the function.
However, what happens when thread 1 or 2 or 3, etc, reaches that piece of code?
The master directive is telling the present/listening team that the master thread has to execute f(). But the team is a single thread and there is no master present. The program wouldn't know what to do past that point.
And that's why, I think, the master isn't allowed to be inside the for-loop.
Substituting the master directive with if (omp_get_thread_num() == 0) works because now the program says, "If you are master, do this. Otherwise ignore".