C++ intel TBB inner loop optimisation - c++

I am trying to use Intel TBB to parallelise an inner loop (the 2nd of 3) however, i only get decent pay off when the inner 2 loops are significant in size.
Is TBB spawning new threads for every iteration of the major loop?
Is there anyway to reduce the overhead?
tbb::task_scheduler_init tbb_init(4); //I have 4 cores
tbb::blocked_range<size_t> blk_rng(0, crs_.y_sz, crs_.y_sz/4);
boost::chrono::system_clock::time_point start =boost::chrono::system_clock::now();
for(unsigned i=0; i!=5000; ++i)
{
tbb::parallel_for(blk_rng,
[&](const tbb::blocked_range<size_t>& br)->void
{
:::
It might be interesting to note that openMP (which I am trying to remove!!!) doesn't have this problem.
I am compiling with:
intel ICC 12.1 at -03 -xHost -mavx
On a intel 2500k (4 cores)
EDIT: I can really change the order of loops, because the out loops test need to be replace with a predicate based on the loops result.

No, TBB does not spawn new threads for every invocation of parallel_for. Actually, unlike OpenMP parallel regions that each may start a new thread team, TBB work with the same thread team until all task_scheduler_init objects are destroyed; and in case of implicit initialization (with task_scheduler_init omitted), same worker threads are used till the end of the program.
So the performance issue is caused by something else. The most likely reasons, from my experience, are:
lack of compiler optimizations, auto-vectorization being first (can be checked by comparing single-threaded performance of OpenMP and TBB; if TBB is much slower, then this is the most likely reason).
cache misses; if you 5000 times run through the same data, cache locality has huge importance, and OpenMP's default schedule(static) works very well, deterministically repeating exactly the same partitioning each time, while TBB's work stealing scheduler has significant randomness. Setting the blocked_range grain size equal to problem_size/num_threads ensures one piece of work per thread but does not guarantee the same distribution of pieces; and affinity_partitioner is supposed to help with that.

Related

Using multiple OMP parallel sections in parallel -> Performance Issue?

I am trying to understand a huge performance problem with one of our C++ applications using OpenMP (on Windows). The structure of the application is as follows:
I have an algorithm which basically consists of a couple of for-loops which are parallelized using OpenMP:
void algorithm()
{
#pragma omp parallel for numThreads(12)
for (int i=0; ...)
{
// do some heavy computation (pure memory and CPU work, no I/O, no waiting)
}
// ... some more for-loops of this kind
}
The application executes this algorithm n times in parallel from n different threads:
std::thread t1(algorithm);
std::thread t2(algorithm);
//...
std::thread tn(algorithm);
t1.join();
t2.join();
//...
tn.join();
// end of application
Now, the problem is as follows:
when I run the application with n=1 (only one call to algorithm()) on my system with 32 physical CPU cores (no hyperthreading), it takes about 5s and loads the CPU to about 30% as expected (given that I have told OpenMP to only use 12 threads).
when I run with n=2, the CPU load goes up to about 60%, but the application takes almost 10 seconds. This means that it is almost impossible to run multiple algorithm instances in parallel.
This alone, of course, can have many reasons (including cache misses, RAM bandwidth limitations, etc.), but there is one thing that strikes me:
if I run my application twice in two parallel processes, each with n=1, both processes complete after about 5 seconds, meaning that I was well able to run two of my algorithms in parallel, as long as they live in different processes.
This seems to exclude many possible reasons for this performance bottleneck. And indeed, I have been unable to understand the cause of this, even after profiling the code. One of my suspicions is that there might be some excessive synchronization in OpenMP between different parallel sections.
Has anyone ever seen an effect like this before? Or can anyone give me advice how to approach this? I have really come to a point where I have tried all I can imagine, but without any success so far. I thus appreciate any help I can get!
Thanks a lot,
Da
PS.:
I have been using both, MS Visual Studio 2015 and Intel's 2017 compiler - both show basically the same effect.
I have a very simple reproducer showing this problem which I can provide if needed. It is really not much more than the above, just adding some real work to be done inside the for-loops.

Do I need to disable OpenMP on a 1 core machine explicitly?

I parallelized some C++ code with OpenMP.
But what if my program will work on a 1 core machine?
Do I need disable usage threading at runtime:
Checks cores
If cores > 1 use OpenMP
Else ignore OpenMP devectives
If yes, does OpenMP have a special directive for it?
No, you don't need to disable OpenMP or threading for running on one core; and for situations where you might want to, you're probably better off explicitly re-compiling without OpenMP, although for complex parallelizations there are other measures, mentioned in the comments, that you can take as well.
When running on a single core or even hardware thread, even if you change nothing - not even the number of threads your code launches - correct, deadlock-free threading code should still run correctly, as the operating system schedules the various threads on the core.
Now, that context switching between threads is costly overhead. Typical OpenMP code, which is compute-bound and relies on work sharing constructs to assign work between threads, treats the number of threads as a parameter and launches as many threads as you have cores or hardware threads available. For such code, where you are just using constructs like
#pragma omp parallel for
for (i=0; i<N; i++)
data[i] = expensive_function(i)
then running on one core will likely only use one thread, or you can explicitly set the number of threads to be one using the OMP_NUM_THREADS environment variable. If OpenMP is to use only one thread and the computation is time-consuming enough, the overhead from the threading library in the above loop is negligible. In this case, there's really no disabling of OpenMP necessary; you're just running on one thread. You can also set the number of threads within the program using omp_set_num_threads(), but best practice is normally to do this at runtime.
However, there's a downside. Compiling with OpenMP disables certain optimizations. For instance, because the work decomposition is done at runtime, even loops with compiled-in trip count limits may not be able to, say, be unrolled or vectorized as effectively because it isn't known how many trips through the loop each thread will take. In that case, if you know that your code will be run on a single core, it may be worth doing the compilation without OpenMP enabled as well, and using that binary for single-core runs. You can also use this approach to test to see if the difference in optimizations matters, running the OpenMP-enabled version with OMP_NUM_THREADS=1 and comparing the timing to that of serial binary.
Of course, if your threading is more complex than using simple work sharing constructs, then it starts being harder to make generalizations. If you have assumptions built into your code about how many threads are present - maybe you have an explicit producer/consumer model for processing work, or a data decomposition hardcoded in, either of which are doable in OpenMP - then it's harder to see how things work. You may also have parallel regions which are much less work than a big computational loop; in those cases, where overhead even with one thread might be significant, it might be best to use if clauses to provide explicit serial paths, e.g.:
nThreadMax = imp_get_max_threads();
#pragma omp parallel if (nThreadMax > 1)
if (omp_in_parallel()) {
// Parallel code path
} else {
// Serial code path
}
But now doing compilation without OpenMP becomes more complicated.
Summarizing:
For big heavy computation work, which is what OpenMP is typically used for, it probably doesn't matter; use OMP_NUM_THREADS=1
You can test if it does matter, with overhead and disabled optimizations, by compiling without OpenMP and comparing the serial runtime to the one-thread OpenMP runtime
For more complicated threading cases, it's hard to say much in general; it depends.
I believe there is function called:
omp_get_num_procs()
that will let you know how many processors are available for OpenMP to work on. Then there are many ways to disable OpenMP. From your code you can run:
omp_set_num_threads(1)
Just remember that even on single core you can get some boost with OpenMP. It only depends on the specificity of your case.

OpenMP nested parallelization

I am using a library that is already parallelized with OpenMP. The issue is that 2-4 cores seem enough for the processing it is doing. Using more than 4 cores makes little difference.
My code is like this:
for (size_t i=0; i<4; ++i)
Call_To_Library (i, ...);
Since 4 cores seem enough for the library (i.e, 4 cores should be used in Call_To_Library), and I am working with a 16 cores machine, I intend to also parallelize my for loop. Note that this for consists at most of 3-4 iterations.
What would be the best approach to parallelize this outer for? Can I also use OpenMP? Is it a best practice to use nested parallelizations? The library I am calling already uses OpenMP and I cannot modify its code (and it wouldn't be straightforward anyway).
PS. Even if the outer loop consists only of 4 iterations, it is worth parallelizing. Each call to the library takes 4-5 seconds.
If there is no dependency between iterations of this loop you can do:
#pragma omp for schedule(static)
for (size_t i=0; i<4; ++i)
Call_To_Library (i, ...);
If, as you said, every invocation of Call_To_Library takes such a big amount of time the overhead of having nested OpenMP operators will probably be negligible.
Moreover you say that you have no control over the number of openmp threads created in Call_To_Library. This solution will multiply the number of openmp threads by 4 and most likely you will see a 4x speedup. Probably the inner Call_To_Library was parallelized in such a way that no more than a few openmp threads could be executed at the same time. With the external parallel for you increase that number 4 times.
The problem with nested parallelism could be that you have an explosion of the number of threads created at the same time and therefore you could see less than ideal speedup because of the overhead related to creation/closing of openmp threads.

Recursive parallel function not using all cores

I recently implemented a recursive negamax algorithm, which I parallelized using OpenMP.
The interesting part is this:
#pragma omp parallel for
for (int i = 0; i < (int) pos.size(); i++)
{
int val = -negamax(pos[i].first, -player, depth - 1).first;
#pragma omp critical
if (val >= best)
{
best = val;
move = pos[i].second;
}
}
On my Intel Core i7 (4 physical cores and hyper threading), I observed something very strange: while running the algorithm, it was not using all 8 available threads (logical cores), but only 4.
Can anyone explain why is it so? I understand the reasons the algorithm doesn't scale well, but why doesn't it use all the available cores?
EDIT: I changed thread to core as it better express my question.
First, check whether you have enough iteration count, pos.size(). Obviously this should be a sufficient number.
Recursive parallelism is an interesting pattern, but it may not work very well with OpenMP, unless you're using OpenMP 3.0's task, Cilk, or TBB. There are several things that need to be considered:
(1) In order to use a recursive parallelism, you mostly need to explicitly call omp_set_nested(1). AFAIK, most implementations of OpenMP do not recursively spawn parallel for, because it may end up creating thousands physical threads, just exploding your operating system.
Until OpenMP 3.0's task, a OpenMP has a sort of 1-to-1 mapping of logical parallel task to a physical task. So, it won't work well in such recursive parallelism. Try out, but don't be surprised if even thousands threads are created!
(2) If you really want to use recursive parallelism with a traditional OpenMP, you need to implement code that controls the number of active threads:
if (get_total_thread_num() > TOO_MANY_THREADS) {
// Do not use OpenMP
...
} else {
#pragma omp parallel for
...
}
(3) You may consider OpenMP 3.0's task. In your code, there could be huge number of concurrent tasks due to a recursion. To be efficiently working on a parallel machine, there must be an efficient mapping algorithm these logical concurrent tasks to physical threads (or logical processor, core). A raw recursive parallelism in OpenMP will create actual physical threads. OpenMP 3.0's task does not.
You may refer to my previous answer related to a recursive parallelism: C OpenMP parallel quickSort.
(4) Intel's Cilk Plus and TBB support full nested and recursive parallelism. In my small test program, the performance was far better than OpenMP 3.0. But, that was 3 years ago. You should check the latest OpenMP's implementation.
I have not a detailed knowledge of negamax and minimax. But, my gut says that using recursive pattern and a lock are unlikely to give a speedup. A simple Google search gives me: http://supertech.csail.mit.edu/papers/dimacs94.pdf
"But negamax is not a efficient serial search algorithm, and thus, it
makes little sense to parallelize it."
Optimal parallelism level has some additional considerations except as much threads as available. For example, operation systems used to schedule all threads of a single process to a single processor to optimize cache performance (unless the programmer changed it explicitly).
I guess OpenMP makes similar considerations when executing such code and you cannot always assume the maximum thread number is executed/
Whaddya mean all 8 available threads ? A CPU like that can probably run 100s of threads ! You may believe that 4 cores with hyper-threading equates to 8 threads, but your OpenMP installation probably doesn't.
Check:
Has the environment variable OMP_NUM_THREADS been created and set ? If it is set to 4 there's your answer, your OpenMP environment is configured to start only 4 threads, at most.
If that environment variable hasn't been set, investigate the use, and impact, of the OpenMP routines omp_get_num_threads() and omp_set_num_threads(). If the environment variable has been set then omp_set_num_threads() will override it at run time.
Whether 8 hyper-threads outperform 4 real threads.
Whether oversubscribing, eg setting OMP_NUM_THREADS to 16, does anything other than ruin performance.

Parallel for vs omp simd: when to use each?

OpenMP 4.0 introduces a new construct called "omp simd". What is the benefit of using this construct over the old "parallel for"? When would each be a better choice over the other?
EDIT:
Here is an interesting paper related to the SIMD directive.
A simple answer:
OpenMP only used to exploit multiple threads for multiple cores. This new simd extention allows you to explicitly use SIMD instructions on modern CPUs, such as Intel's AVX/SSE and ARM's NEON.
(Note that a SIMD instruction is executed in a single thread and a single core, by design. However, the meaning of SIMD can be quite expanded for GPGPU. But, but I don't think you need to consider GPGPU for OpenMP 4.0.)
So, once you know SIMD instructions, you can use this new construct.
In a modern CPU, roughly there are three types of parallelism: (1) instruction-level parallelism (ILP), (2) thread-level parallelism (TLP), and (3) SIMD instructions (we could say this is vector-level or so).
ILP is done automatically by your out-of-order CPUs, or compilers. You can exploit TLP using OpenMP's parallel for and other threading libraries. So, what about SIMD? Intrinsics were a way to use them (as well as compilers' automatic vectorization). OpenMP's simd is a new way to use SIMD.
Take a very simple example:
for (int i = 0; i < N; ++i)
A[i] = B[i] + C[i];
The above code computes a sum of two N-dimensional vectors. As you can easily see, there is no (loop-carried) data dependency on the array A[]. This loop is embarrassingly parallel.
There could be multiple ways to parallelize this loop. For example, until OpenMP 4.0, this can be parallelized using only parallel for construct. Each thread will perform N/#thread iterations on multiple cores.
However, you might think using multiple threads for such simple addition would be a overkill. That is why there is vectorization, which is mostly implemented by SIMD instructions.
Using a SIMD would be like this:
for (int i = 0; i < N/8; ++i)
VECTOR_ADD(A + i, B + i, C + i);
This code assumes that (1) the SIMD instruction (VECTOR_ADD) is 256-bit or 8-way (8 * 32 bits); and (2) N is a multiple of 8.
An 8-way SIMD instruction means that 8 items in a vector can be executed in a single machine instruction. Note that Intel's latest AVX provides such 8-way (32-bit * 8 = 256 bits) vector instructions.
In SIMD, you still use a single core (again, this is only for conventional CPUs, not GPU). But, you can use a hidden parallelism in hardware. Modern CPUs dedicate hardware resources for SIMD instructions, where each SIMD lane can be executed in parallel.
You can use thread-level parallelism at the same time. The above example can be further parallelized by parallel for.
(However, I have a doubt how many loops can be really transformed to SIMDized loops. The OpenMP 4.0 specification seems a bit unclear on this. So, real performance and practical restrictions would be dependent on actual compilers' implementations.)
To summarize, simd construct allows you to use SIMD instructions, in turn, more parallelism can be exploited along with thread-level parallelism. However, I think actual implementations would matter.
The linked-to standard is relatively clear (p 13, lines 19+20)
When any thread encounters a simd construct, the iterations of the
loop associated with the construct can be executed by the SIMD lanes
that are available to the thread.
SIMD is a sub-thread thing. To make it more concrete, on a CPU you could imagine using simd directives to specifically request vectorization of chunks of loop iterations that individually belong to the same thread. It's exposing the multiple levels of parallelism that exist within a single multicore processor, in a platform-independent way. See for instance the discussion (along with the accelerator stuff) on this intel blog post.
So basically, you'll want to use omp parallel to distribute work onto different threads, which can then migrate to multiple cores; and you'll want to use omp simd to make use of vector pipelines (say) within each core. Normally omp parallel would go on the "outside" to deal with coarser-grained parallel distribution of work and omp simd would go around tight loops inside of that to exploit fine-grained parallelism.
Compilers aren't required to make simd optimization in a parallel region conditional on presence of the simd clause. Compilers I'm familiar with continue to support nested loops, parallel outer, vector inner, in the same way as before.
In the past, OpenMP directives were usually taken to prevent loop-switching optimizations involving the outer parallelized loop (multiple loops with collapse clause). This seems to have changed in a few compilers.
OpenMP 4 opens up new possibilities including optimization of a parallel outer loop with a non-vectorizable inner loop, by a sort of strip mining, when omp parallel do [for] simd is set. ifort sometimes reports it as outer loop vectorization when it is done without the simd clause. It may then be optimized for a smaller number of threads than the omp parallel do simd, which seems to need more threads than the simd vector width to pay off. Such a distinction might be inferred, as, without the simd clause, the compiler is implicitly asked to optimize for a loop count such as 100 or 300, while the simd clause requests unconditional simd optimization.
gcc 4.9 omp parallel for simd looked quite effective when I had a 24-core platform.