OpenMP and MKL threading - fortran

I have a code in Fortran which uses DGESVD from MKL and runs on 8 cores with Intel compiler. The code is accelerated via OpenMP. Also I know that OpenMP and MKL has their own setting to set the number of threads (omp_num_threads and mkl_num_threads). I want to know the optimum number of threads. Am I supposed to set the OMP_NUM_THREADS=1 before calling the LAPACK routine? Does the number of OpenMP threads affect MKL number of threads?

MKL also uses OpenMP for its multithreaded driver. This means that the number of OpenMP threads does affect the number of MKL threads, but in a very intricate way.
First, being OpenMP code, MKL is also controlled by the usual OpenMP ways to set the number of threads, e.g. OMP_NUM_THREADS and calls to omp_set_num_threads. But it also provides override configuration mechanisms in the form of MKL_NUM_THREADS and mkl_set_num_threads(). This allows one to have different number of threads in the user code and in the MKL routines.
Having configured the desired number of threads, one should also know how MKL behaves in nested parallelism cases. That is, MKL would by default run single-threaded if called from inside an active parallel region in the user code. MKL provides the MKL_DYNAMIC switch that can override this behaviour but it requires that the same OpenMP compiler is used for the user code as for MKL (read that - you must use Intel's compiler) as no compatibility is guaranteed between different OpenMP runtimes.
Generally speaking, you do not need to set the number of threads to 1 before calling into MKL, as this would make it single-threaded, unless the number of MKL threads was overridden by configuring it explicitly. And you should be careful when calling it from inside parallel regions when nested parallelism is enabled.
Further read about controlling the number of threads in MKL is available in MKL's User Guide:
Using Additional Threading Control (mirror of otherwise dead link)
Techniques to Set the Number of Threads

Related

Is cv::dft transform multithreaded?

I need to boost cv::dft perfomance in multithreaded environment. I've done a simple test on Windows 10 on Core-i5 Intel processor:
Here I see that CPU is not fully loaded (50% usage only). Individual threads are loaded equally and also far from 100%. Why is that and how can I fix it? Can DFT easily pluralized? Is it implemented in OpenCV library? Are there special build flags to enable it (which)?
UPDATE: Running this code on linux gives a bit different result, but also below 100% utilization:
First of all, behavior of cv::dft depends on OpenCV build flags, for example if you set WITH_IPP, then it will use Intel Primitives to speedup computation. FFT is memory-bound, if you simply launch more threads, you most probably wouldn't significantly benefit from this parallelism, because threads will be waiting for each other to finish accessing memory, I've observed this both on Linux and Windows. To gain more performance you should use FFTW3 which has some sophisticated algorithm for multi-threaded mode (should be ./configure-d with special flag). I observed up to 7x speedup with 8 threads. But FFTW has only payed business-friendly license, imposing GNU license for your software. I have not found any other opensource components which can handle FFT parallelism in a smart way.

OpenBLAS set number of threads for one routine only

In C++17, I want to use several OpenBLAS subroutines with a different number of threads for each. Is there any way to accomplish this?
In the past, I have used openblas_set_num_threads();
to set the number of threads for my OpenBLAS subroutines. While this works, it sets the openblas num threads globally, preventing each subroutines to use a different number of threads when running in parallel. Because of this, I use the same number of threads for all of my OpenBLAS subroutines so they can run in parallel.
No way!! It seems that it is impossible so far.
Based on their user manual:
If your application is already multi-threaded, it will conflict with
OpenBLAS multi-threading
Actually, this feature is essential for most multithreaded libraries that want to use BLAS.
One easy option is to use MKL instead of OpenBLAS and use their mkl_set_num_threads_local that can play nicely, and the developer has a good control over threads. Look here.
A harder option is to call single threaded OpenBLAS and you implement the multithreading yourself. This can work either with OpenBLAS and MKL but it is cumbersome and you will probably lose performance if you don't know what you are doing.
For this problem there is no difference if you use C++17, C++11, any other flavoer of C++ or C.

Do I need to disable OpenMP on a 1 core machine explicitly?

I parallelized some C++ code with OpenMP.
But what if my program will work on a 1 core machine?
Do I need disable usage threading at runtime:
Checks cores
If cores > 1 use OpenMP
Else ignore OpenMP devectives
If yes, does OpenMP have a special directive for it?
No, you don't need to disable OpenMP or threading for running on one core; and for situations where you might want to, you're probably better off explicitly re-compiling without OpenMP, although for complex parallelizations there are other measures, mentioned in the comments, that you can take as well.
When running on a single core or even hardware thread, even if you change nothing - not even the number of threads your code launches - correct, deadlock-free threading code should still run correctly, as the operating system schedules the various threads on the core.
Now, that context switching between threads is costly overhead. Typical OpenMP code, which is compute-bound and relies on work sharing constructs to assign work between threads, treats the number of threads as a parameter and launches as many threads as you have cores or hardware threads available. For such code, where you are just using constructs like
#pragma omp parallel for
for (i=0; i<N; i++)
data[i] = expensive_function(i)
then running on one core will likely only use one thread, or you can explicitly set the number of threads to be one using the OMP_NUM_THREADS environment variable. If OpenMP is to use only one thread and the computation is time-consuming enough, the overhead from the threading library in the above loop is negligible. In this case, there's really no disabling of OpenMP necessary; you're just running on one thread. You can also set the number of threads within the program using omp_set_num_threads(), but best practice is normally to do this at runtime.
However, there's a downside. Compiling with OpenMP disables certain optimizations. For instance, because the work decomposition is done at runtime, even loops with compiled-in trip count limits may not be able to, say, be unrolled or vectorized as effectively because it isn't known how many trips through the loop each thread will take. In that case, if you know that your code will be run on a single core, it may be worth doing the compilation without OpenMP enabled as well, and using that binary for single-core runs. You can also use this approach to test to see if the difference in optimizations matters, running the OpenMP-enabled version with OMP_NUM_THREADS=1 and comparing the timing to that of serial binary.
Of course, if your threading is more complex than using simple work sharing constructs, then it starts being harder to make generalizations. If you have assumptions built into your code about how many threads are present - maybe you have an explicit producer/consumer model for processing work, or a data decomposition hardcoded in, either of which are doable in OpenMP - then it's harder to see how things work. You may also have parallel regions which are much less work than a big computational loop; in those cases, where overhead even with one thread might be significant, it might be best to use if clauses to provide explicit serial paths, e.g.:
nThreadMax = imp_get_max_threads();
#pragma omp parallel if (nThreadMax > 1)
if (omp_in_parallel()) {
// Parallel code path
} else {
// Serial code path
}
But now doing compilation without OpenMP becomes more complicated.
Summarizing:
For big heavy computation work, which is what OpenMP is typically used for, it probably doesn't matter; use OMP_NUM_THREADS=1
You can test if it does matter, with overhead and disabled optimizations, by compiling without OpenMP and comparing the serial runtime to the one-thread OpenMP runtime
For more complicated threading cases, it's hard to say much in general; it depends.
I believe there is function called:
omp_get_num_procs()
that will let you know how many processors are available for OpenMP to work on. Then there are many ways to disable OpenMP. From your code you can run:
omp_set_num_threads(1)
Just remember that even on single core you can get some boost with OpenMP. It only depends on the specificity of your case.

Recursive parallel function not using all cores

I recently implemented a recursive negamax algorithm, which I parallelized using OpenMP.
The interesting part is this:
#pragma omp parallel for
for (int i = 0; i < (int) pos.size(); i++)
{
int val = -negamax(pos[i].first, -player, depth - 1).first;
#pragma omp critical
if (val >= best)
{
best = val;
move = pos[i].second;
}
}
On my Intel Core i7 (4 physical cores and hyper threading), I observed something very strange: while running the algorithm, it was not using all 8 available threads (logical cores), but only 4.
Can anyone explain why is it so? I understand the reasons the algorithm doesn't scale well, but why doesn't it use all the available cores?
EDIT: I changed thread to core as it better express my question.
First, check whether you have enough iteration count, pos.size(). Obviously this should be a sufficient number.
Recursive parallelism is an interesting pattern, but it may not work very well with OpenMP, unless you're using OpenMP 3.0's task, Cilk, or TBB. There are several things that need to be considered:
(1) In order to use a recursive parallelism, you mostly need to explicitly call omp_set_nested(1). AFAIK, most implementations of OpenMP do not recursively spawn parallel for, because it may end up creating thousands physical threads, just exploding your operating system.
Until OpenMP 3.0's task, a OpenMP has a sort of 1-to-1 mapping of logical parallel task to a physical task. So, it won't work well in such recursive parallelism. Try out, but don't be surprised if even thousands threads are created!
(2) If you really want to use recursive parallelism with a traditional OpenMP, you need to implement code that controls the number of active threads:
if (get_total_thread_num() > TOO_MANY_THREADS) {
// Do not use OpenMP
...
} else {
#pragma omp parallel for
...
}
(3) You may consider OpenMP 3.0's task. In your code, there could be huge number of concurrent tasks due to a recursion. To be efficiently working on a parallel machine, there must be an efficient mapping algorithm these logical concurrent tasks to physical threads (or logical processor, core). A raw recursive parallelism in OpenMP will create actual physical threads. OpenMP 3.0's task does not.
You may refer to my previous answer related to a recursive parallelism: C OpenMP parallel quickSort.
(4) Intel's Cilk Plus and TBB support full nested and recursive parallelism. In my small test program, the performance was far better than OpenMP 3.0. But, that was 3 years ago. You should check the latest OpenMP's implementation.
I have not a detailed knowledge of negamax and minimax. But, my gut says that using recursive pattern and a lock are unlikely to give a speedup. A simple Google search gives me: http://supertech.csail.mit.edu/papers/dimacs94.pdf
"But negamax is not a efficient serial search algorithm, and thus, it
makes little sense to parallelize it."
Optimal parallelism level has some additional considerations except as much threads as available. For example, operation systems used to schedule all threads of a single process to a single processor to optimize cache performance (unless the programmer changed it explicitly).
I guess OpenMP makes similar considerations when executing such code and you cannot always assume the maximum thread number is executed/
Whaddya mean all 8 available threads ? A CPU like that can probably run 100s of threads ! You may believe that 4 cores with hyper-threading equates to 8 threads, but your OpenMP installation probably doesn't.
Check:
Has the environment variable OMP_NUM_THREADS been created and set ? If it is set to 4 there's your answer, your OpenMP environment is configured to start only 4 threads, at most.
If that environment variable hasn't been set, investigate the use, and impact, of the OpenMP routines omp_get_num_threads() and omp_set_num_threads(). If the environment variable has been set then omp_set_num_threads() will override it at run time.
Whether 8 hyper-threads outperform 4 real threads.
Whether oversubscribing, eg setting OMP_NUM_THREADS to 16, does anything other than ruin performance.

Is it possible for openmp to run different threads on a same cpu(core)

Assume I have two cores. Let us denote them as core1 and core2. If I use openmp to parallelize my program, two threads will be generated. Is it possible for openmp implementation to allocate both of the two threads on core1 to excute instead of on core1 and core2? In the first case we will lose parallelism.
I am using Intel openmp included with icc. By default, is it possible to run different threads on a same cpu(core)
Thanks.
It is possible to instruct the OpenMP runtime to do specific binding (or pinning in Intel's terminology) of the threads to the available CPU cores. OpenMP 4.0 comes with provisions to specify this in an abstract and portable way, while current OpenMP implementations provide their own specific mechanisms to do it:
KMP_AFFINITY for Intel compilers - see here;
GOMP_CPU_AFFINITY for GCC (and Intel in compatibility mode) - see here.
Unless these are set, both runtimes default to no binding and the OS is free to dispatch the threads as it deems fit, e.g. it might dispatch both threads on a single core. The latter is rather unlikely unless there are other running processes that require lots of CPU time. Still most OS schedulers tend to constantly migrate threads and processes around, therefore it is advisable that you employ the binding mechanisms for maximum performance.