Image stitching is taking more time with Seamfinder and Exposurecompensator - c++

I am trying to stitch images and the code on which I am working on, it uses SeamFinder and ExposureCompensator along with other functions. But while running the code, these two are taking so much of time. Is there any other alternative or is there a way to improve the performance.
Ptr<ExposureCompensator> compensator = ExposureCompensator::createDefault(expos_comp_type);
compensator->feed(corners, images_warped, masks_warped);
seam_finder = makePtr<GraphCutSeamFinder>(GraphCutSeamFinderBase::COST_COLOR);
seam_finder->find(images_warped_f, corners, masks_warped);
The above are the two functions which are taking time.
Please help me in solving the problem.
Thanks in advance.

The ImageStitching via OpenCV is known to be slow in many cases. Maybe you can give Open MP Parallel a shot here and counter the delay you are facing by using parallelization.
OpenMP consists of a set of compiler #pragmas that control how the program works. The pragmas are designed so that even if the compiler does not support them, the program will still yield correct behavior, but without any parallelism.
In cases where different iterations of loops have nothing to do with each other, therefore making these loops a prime target for parallelization. OpenMP effectively exploits these common program characteristics, so it is extremely easy to allow an OpenMP program to use multiple processors simply by adding a few lines of compiler directives into your source code.
In case you are running a loop in which a set of images are being stitched, you can make sure that the stiching for each set of images run parallely.
#pragma omp parallel for
for( ... )
{
// Image-stitching algorithms go here.
}
This compiler directive #pragma opm parallel for tells the compiler to auto-parallelize the for loop with OpenMP.
For non-loops, or just sections of code you can do something of this sort :
#pragma omp parallel sections
{
#pragma omp section
{
DoSomething();
}
#pragma omp section
{
DoSomethingElseParallely();
}
}
I know that the answer might not directly help you out, but might give you some avenues to dig.
You can go through more about the usage of OpenMP loop Parallelism and OpenMP: Sections before using it.

Related

Openmp consumes all the CPU power before even compiling/running the code? (parallel for)

I was looking for a way to parallelize "for loops", without implementing pthread routines, and so on, by myself. I stumbled over openmp, and the #pragma omp parallel for default(none) directive. Since my for loop has several variables which are "shared" (like some integer values, and also some arrays where I want to store stuff I calculate in the loop, at the respective index position), I have added shared(variable1, variable2, ...) and so on. However by doing so, I noticed that the warnings in CLion which highlight the shared variables, won't go away. Furthermore I noticed that when I put the shared clause in my code, all of my 6 CPU cores start getting busy with most of the time 100 percent usage.
This seems super odd to me, since I haven't even compiled the code yet. The cores start working as soon as I put the shared() clause with some variables to the code.
I have never worked with openmp so I dont know, if I am may using it wrong? Would be great if someone can help me out with that, or probably give a hint why this happens.
Edit:
For clarification: With warnings, I mean, that the IDE underlines in red all the variables which seem to be shared. The CPU consumption is by the IDE itself, when adding the shared() clause to the code. However I have no clue why adding this clause would consume all the CPUs this much?

Using multiple OMP parallel sections in parallel -> Performance Issue?

I am trying to understand a huge performance problem with one of our C++ applications using OpenMP (on Windows). The structure of the application is as follows:
I have an algorithm which basically consists of a couple of for-loops which are parallelized using OpenMP:
void algorithm()
{
#pragma omp parallel for numThreads(12)
for (int i=0; ...)
{
// do some heavy computation (pure memory and CPU work, no I/O, no waiting)
}
// ... some more for-loops of this kind
}
The application executes this algorithm n times in parallel from n different threads:
std::thread t1(algorithm);
std::thread t2(algorithm);
//...
std::thread tn(algorithm);
t1.join();
t2.join();
//...
tn.join();
// end of application
Now, the problem is as follows:
when I run the application with n=1 (only one call to algorithm()) on my system with 32 physical CPU cores (no hyperthreading), it takes about 5s and loads the CPU to about 30% as expected (given that I have told OpenMP to only use 12 threads).
when I run with n=2, the CPU load goes up to about 60%, but the application takes almost 10 seconds. This means that it is almost impossible to run multiple algorithm instances in parallel.
This alone, of course, can have many reasons (including cache misses, RAM bandwidth limitations, etc.), but there is one thing that strikes me:
if I run my application twice in two parallel processes, each with n=1, both processes complete after about 5 seconds, meaning that I was well able to run two of my algorithms in parallel, as long as they live in different processes.
This seems to exclude many possible reasons for this performance bottleneck. And indeed, I have been unable to understand the cause of this, even after profiling the code. One of my suspicions is that there might be some excessive synchronization in OpenMP between different parallel sections.
Has anyone ever seen an effect like this before? Or can anyone give me advice how to approach this? I have really come to a point where I have tried all I can imagine, but without any success so far. I thus appreciate any help I can get!
Thanks a lot,
Da
PS.:
I have been using both, MS Visual Studio 2015 and Intel's 2017 compiler - both show basically the same effect.
I have a very simple reproducer showing this problem which I can provide if needed. It is really not much more than the above, just adding some real work to be done inside the for-loops.

efficiently use openmp on small loop with very large nested loops

Basically I have a program that needs to go over several individual pictures
I do this by:
#pragma omp paralell num_threads(4)
#pragma omp paralell for
for(picture = 0; picture < 4; picture++){
for(int row = 0; row < 1000; row++){
for(int col = 0; col < 1000; col++){
//do stuff with pixel[picture][row][col]
}
}
}
I just want to split the work among 4 cores (1 core per picture) so that each core/thread is working on a specific picture. That way core 0 is working on picture 0, core 1 on picture 1, and so on. The machine it is being tested on only has 4 cores as well. What is the best way to use openmp declarations for this scenario. The one I posted is what I think would be the best performance for this scenario.
keep in mind this is pseudo code. The goal of the program is not important, parallelizing these loops efficiently is the goal.
Just adding a simple
#pragma omp parallel for
is a good starting point for your problem. Don't bother with statically writing in the how many threads it should use. The runtime will usually do the right thing.
However, it is impossible to generally say what is most efficient. There are many performance factors that are impossible to tell from your limited general example. Your code may be memory bound and benefit only very little from parallelization on desktop CPUs. You may have a load imbalance which means you need to split the work in to more chunks and process them dynamically. That could be done by parallelizing the middle loop or using nested parallelism. Whether the middle loop parallelization works well depends on the amount of work done by the inner loop (and hence the ratio of useful work / overhead). The memory layout also heavily influence the efficieny of the parallelization. Or maybe you even have data dependencies in the inner loop preventing parallelization there...
The only general recommendation once can give is to always measure, never guess. Learn to use the powerful available parallel performance analysis tools and incoperate that into your workflow.

Make Eigen run in a Multi-Thread

I have some questions regarding Eigen parallelization.
To my understanding, Eigen handles its internal parallelization but I want to activate multi threading. I just started compiling with G++ using the flag "-fopenmp" and running my executable with OMP_NUM_THREADS=4 ./exec.
Some parts of the code that run only using C++ code I used:
#pragma omp parallel
{
}
Looking at my system monitor I can see that sometimes I used more than one thread, but most of the time it isn't. I don't know if I have to use additional OpenMp code.
In the following link:
https://eigen.tuxfamily.org/dox/TopicMultiThreading.html
They mention that "in the case your application is parallelized with OpenMP, you might want to disable Eigen's own parallization as detailed in the previous section", but I don't really understand if I have to or how to do it.
I hope I am not mixing concepts here.
My thanks in advance.
Quoting from the link you posted:
Currently, the following algorithms can make use of multi-threading: general matrix - matrix products PartialPivLU
Thus, without knowing exactly what your program is doing, I'd hazard a guess that it's not mostly large matrix-matrix multiplications and/or PartialPivLU. This only regards Eigen's internal parallelization. What you do within the omp parallel blocks will probably run as expected (multiple threads).

Performance difference using dynamic number of threads

I'm using openMP to parallize some heavy loops, and it works as expected.
Testing showed that this directive gave the most performance:
#pragma omp parallel for num_threads(7)
However, that may differ from machine to machine. Also, i wanted to be able to switch threading on/off using a runtime switch.
Therefore, i figured i could use something like this:
if(shouldThread)
omp_set_num_threads(optimalNumberOfThreadsForThisMachine);
else
omp_set_num_threads(1);
Where on my computer, the optimal number of threads is 7 in this example. Then, use this directive instead:
#pragma omp parallel for
It works well - except that the code compiled with the second directive is about 50% slower. Is this to be expected? I figure the runtime has to do dynamic dispatching and work scheduling, while the compile-time directive can add some sort of optimization, i guess.
Code is compiled with msvc 2013, on an core i7-3740