Performance difference using dynamic number of threads - c++

I'm using openMP to parallize some heavy loops, and it works as expected.
Testing showed that this directive gave the most performance:
#pragma omp parallel for num_threads(7)
However, that may differ from machine to machine. Also, i wanted to be able to switch threading on/off using a runtime switch.
Therefore, i figured i could use something like this:
if(shouldThread)
omp_set_num_threads(optimalNumberOfThreadsForThisMachine);
else
omp_set_num_threads(1);
Where on my computer, the optimal number of threads is 7 in this example. Then, use this directive instead:
#pragma omp parallel for
It works well - except that the code compiled with the second directive is about 50% slower. Is this to be expected? I figure the runtime has to do dynamic dispatching and work scheduling, while the compile-time directive can add some sort of optimization, i guess.
Code is compiled with msvc 2013, on an core i7-3740

Related

Openmp consumes all the CPU power before even compiling/running the code? (parallel for)

I was looking for a way to parallelize "for loops", without implementing pthread routines, and so on, by myself. I stumbled over openmp, and the #pragma omp parallel for default(none) directive. Since my for loop has several variables which are "shared" (like some integer values, and also some arrays where I want to store stuff I calculate in the loop, at the respective index position), I have added shared(variable1, variable2, ...) and so on. However by doing so, I noticed that the warnings in CLion which highlight the shared variables, won't go away. Furthermore I noticed that when I put the shared clause in my code, all of my 6 CPU cores start getting busy with most of the time 100 percent usage.
This seems super odd to me, since I haven't even compiled the code yet. The cores start working as soon as I put the shared() clause with some variables to the code.
I have never worked with openmp so I dont know, if I am may using it wrong? Would be great if someone can help me out with that, or probably give a hint why this happens.
Edit:
For clarification: With warnings, I mean, that the IDE underlines in red all the variables which seem to be shared. The CPU consumption is by the IDE itself, when adding the shared() clause to the code. However I have no clue why adding this clause would consume all the CPUs this much?

Image stitching is taking more time with Seamfinder and Exposurecompensator

I am trying to stitch images and the code on which I am working on, it uses SeamFinder and ExposureCompensator along with other functions. But while running the code, these two are taking so much of time. Is there any other alternative or is there a way to improve the performance.
Ptr<ExposureCompensator> compensator = ExposureCompensator::createDefault(expos_comp_type);
compensator->feed(corners, images_warped, masks_warped);
seam_finder = makePtr<GraphCutSeamFinder>(GraphCutSeamFinderBase::COST_COLOR);
seam_finder->find(images_warped_f, corners, masks_warped);
The above are the two functions which are taking time.
Please help me in solving the problem.
Thanks in advance.
The ImageStitching via OpenCV is known to be slow in many cases. Maybe you can give Open MP Parallel a shot here and counter the delay you are facing by using parallelization.
OpenMP consists of a set of compiler #pragmas that control how the program works. The pragmas are designed so that even if the compiler does not support them, the program will still yield correct behavior, but without any parallelism.
In cases where different iterations of loops have nothing to do with each other, therefore making these loops a prime target for parallelization. OpenMP effectively exploits these common program characteristics, so it is extremely easy to allow an OpenMP program to use multiple processors simply by adding a few lines of compiler directives into your source code.
In case you are running a loop in which a set of images are being stitched, you can make sure that the stiching for each set of images run parallely.
#pragma omp parallel for
for( ... )
{
// Image-stitching algorithms go here.
}
This compiler directive #pragma opm parallel for tells the compiler to auto-parallelize the for loop with OpenMP.
For non-loops, or just sections of code you can do something of this sort :
#pragma omp parallel sections
{
#pragma omp section
{
DoSomething();
}
#pragma omp section
{
DoSomethingElseParallely();
}
}
I know that the answer might not directly help you out, but might give you some avenues to dig.
You can go through more about the usage of OpenMP loop Parallelism and OpenMP: Sections before using it.

Translating Intel's #pragma offload to OpenMP for Xeon Phi (performance issues and other questions)

I use Intel C++ compiler 17.0.01, and I have two code blocks.
The first code block allocates memory on Xeon Phi like this:
#pragma offload target(mic:1) nocopy(data[0:size]: alloc_if(1) free_if(0))
The second block evaluates the above memory and copies it back to the host:
#pragma offload target(mic:1) out(data[0:size]: alloc_if(0) free_if(0))
This code runs just fine but the #pragma offload is part of Intel's compiler only (I think). So, I want to convert that to OpenMP.
This is how I translated the first block to OpenMP:
#pragma omp target device(1) map(alloc:data[0:size])
And this is how I translated the second block to OpenMP:
#pragma omp target device(1) map(from:data[0:size])
Also, I used export OFFLOAD_REPORT=2 in order to get a better idea on what is going on during the runtime.
Here are my problems/questions:
The OpenMP version of the first code block is as fast as the Intel version (#pragma offload). Nothing strange here.
The OpenMP version of the second code block is 5 times slower than the Intel version. However, the MIC_TIME of the two is the same, but the CPU_TIME is different (OpenMP version much higher). Why is that?
Is my Intel directives optimal?
Is my Intel -> OpenMP translation correct and optimal?
And here are some other, a bit different, questions:
On the testing machine I have two Intel Phi cards. Since I want to use the 2nd one I do this: #pragma omp target device(1).... Is that correct?
If I do #pragma omp target device(5)... the code still works! And it runs on one of the Phi cards (and not the CPU) because the performance is similar. Why is that?
I also tried my software (the OpenMP version) on a machine without an Xeon Phi and it run just fine on the CPU! Is this guaranteed? When you have no accelerator on the machine the target device(1) is ignored?
Is it possible to do something like std::cout << print_phi_card_name_or_uid(); inside an OpenMP offloaded region (so I will know for sure in which card my software is running)?
The second OpenMP code block allocates memory again. You should map the data to a device data environment by enclosing both blocks into #pragma omp target data map(from:data[0:size]), or just add #pragma omp target enter data map(alloc:data[0:size]) prior to the first block.
On the testing machine I have two Intel Phi cards. Since I want to use the 2nd one I do this: #pragma omp target device(1).... Is that correct?
AFAIK, device(0) means the default card, device(1) means the first card, and device(2) is the second card.
If I do #pragma omp target device(5)... the code still works! And it runs on one of the Phi cards (and not the CPU) because the performance is similar. Why is that?
Because liboffload does this (liboffload is a runtime library used by both gcc and icc). However the OpenMP standard doesn't guarantee such behaviour.
I also tried my software (the OpenMP version) on a machine without an Xeon Phi and it run just fine on the CPU! Is this guaranteed? When you have no accelerator on the machine the target device(1) is ignored?
Yes. Not sure about the standard, but offloading in icc and gcc is implemented this way.
Is it possible to do something like std::cout << print_phi_card_name_or_uid(); inside an OpenMP offloaded region (so I will know for sure in which card my software is running)?
OpenMP 4.5 provides only omp_is_initial_device() function to distinguish between the host and the accelerator. Maybe there is some Intel-specific interface to do it.

Increased system CPU time in non-threaded regions of OpenMP code

I am experiencing a very peculiar issue. I am parallelizing parts of a very large code (about 600MB of source code) using OpenMP, but I'm not getting the speedups I expected. I profiled many parts of the code using omp_get_wtime() to get reliable estimates, and I discovered the following: the OpenMP parts of my code do have better wall times when I use more threads, however when I turn the threads on the program has increased system cpu time usage in parts of the code where there are no omp pragmas at all (e.g., time spent in IPOPT). I used grep for pragma to make sure. Just to clarify, I know for sure from my profiling and code structure that the threaded parts and the parts where I get the slowdowns run in completely different times.
My code consists of two main parts. The first is the base code (about 25MB of code), which I compile using -fopenmp, and the second is third-party code (the remaining 575MB) which is not compiled with the OpenMP flag.
Can anyone think of a reason why something like this could be happening? I can't fathom it being resource contention since the slowdowns are in the non-threaded parts of the code.
I am running my code on a 4-core (2 physical cores) Intel-i7 4600U compiled using clang++-3.8 in Ubuntu 14.04, with -O3 optimizations.
Any ideas would be great!

Do I need to disable OpenMP on a 1 core machine explicitly?

I parallelized some C++ code with OpenMP.
But what if my program will work on a 1 core machine?
Do I need disable usage threading at runtime:
Checks cores
If cores > 1 use OpenMP
Else ignore OpenMP devectives
If yes, does OpenMP have a special directive for it?
No, you don't need to disable OpenMP or threading for running on one core; and for situations where you might want to, you're probably better off explicitly re-compiling without OpenMP, although for complex parallelizations there are other measures, mentioned in the comments, that you can take as well.
When running on a single core or even hardware thread, even if you change nothing - not even the number of threads your code launches - correct, deadlock-free threading code should still run correctly, as the operating system schedules the various threads on the core.
Now, that context switching between threads is costly overhead. Typical OpenMP code, which is compute-bound and relies on work sharing constructs to assign work between threads, treats the number of threads as a parameter and launches as many threads as you have cores or hardware threads available. For such code, where you are just using constructs like
#pragma omp parallel for
for (i=0; i<N; i++)
data[i] = expensive_function(i)
then running on one core will likely only use one thread, or you can explicitly set the number of threads to be one using the OMP_NUM_THREADS environment variable. If OpenMP is to use only one thread and the computation is time-consuming enough, the overhead from the threading library in the above loop is negligible. In this case, there's really no disabling of OpenMP necessary; you're just running on one thread. You can also set the number of threads within the program using omp_set_num_threads(), but best practice is normally to do this at runtime.
However, there's a downside. Compiling with OpenMP disables certain optimizations. For instance, because the work decomposition is done at runtime, even loops with compiled-in trip count limits may not be able to, say, be unrolled or vectorized as effectively because it isn't known how many trips through the loop each thread will take. In that case, if you know that your code will be run on a single core, it may be worth doing the compilation without OpenMP enabled as well, and using that binary for single-core runs. You can also use this approach to test to see if the difference in optimizations matters, running the OpenMP-enabled version with OMP_NUM_THREADS=1 and comparing the timing to that of serial binary.
Of course, if your threading is more complex than using simple work sharing constructs, then it starts being harder to make generalizations. If you have assumptions built into your code about how many threads are present - maybe you have an explicit producer/consumer model for processing work, or a data decomposition hardcoded in, either of which are doable in OpenMP - then it's harder to see how things work. You may also have parallel regions which are much less work than a big computational loop; in those cases, where overhead even with one thread might be significant, it might be best to use if clauses to provide explicit serial paths, e.g.:
nThreadMax = imp_get_max_threads();
#pragma omp parallel if (nThreadMax > 1)
if (omp_in_parallel()) {
// Parallel code path
} else {
// Serial code path
}
But now doing compilation without OpenMP becomes more complicated.
Summarizing:
For big heavy computation work, which is what OpenMP is typically used for, it probably doesn't matter; use OMP_NUM_THREADS=1
You can test if it does matter, with overhead and disabled optimizations, by compiling without OpenMP and comparing the serial runtime to the one-thread OpenMP runtime
For more complicated threading cases, it's hard to say much in general; it depends.
I believe there is function called:
omp_get_num_procs()
that will let you know how many processors are available for OpenMP to work on. Then there are many ways to disable OpenMP. From your code you can run:
omp_set_num_threads(1)
Just remember that even on single core you can get some boost with OpenMP. It only depends on the specificity of your case.