OpenMP nested parallelization

OpenMP nested parallelization - c++

I am using a library that is already parallelized with OpenMP. The issue is that 2-4 cores seem enough for the processing it is doing. Using more than 4 cores makes little difference.
My code is like this:
for (size_t i=0; i<4; ++i)
Call_To_Library (i, ...);
Since 4 cores seem enough for the library (i.e, 4 cores should be used in Call_To_Library), and I am working with a 16 cores machine, I intend to also parallelize my for loop. Note that this for consists at most of 3-4 iterations.
What would be the best approach to parallelize this outer for? Can I also use OpenMP? Is it a best practice to use nested parallelizations? The library I am calling already uses OpenMP and I cannot modify its code (and it wouldn't be straightforward anyway).
PS. Even if the outer loop consists only of 4 iterations, it is worth parallelizing. Each call to the library takes 4-5 seconds.

If there is no dependency between iterations of this loop you can do:
#pragma omp for schedule(static)
for (size_t i=0; i<4; ++i)
Call_To_Library (i, ...);
If, as you said, every invocation of Call_To_Library takes such a big amount of time the overhead of having nested OpenMP operators will probably be negligible.
Moreover you say that you have no control over the number of openmp threads created in Call_To_Library. This solution will multiply the number of openmp threads by 4 and most likely you will see a 4x speedup. Probably the inner Call_To_Library was parallelized in such a way that no more than a few openmp threads could be executed at the same time. With the external parallel for you increase that number 4 times.
The problem with nested parallelism could be that you have an explosion of the number of threads created at the same time and therefore you could see less than ideal speedup because of the overhead related to creation/closing of openmp threads.

Related

efficiently use openmp on small loop with very large nested loops

Basically I have a program that needs to go over several individual pictures
I do this by:
#pragma omp paralell num_threads(4)
#pragma omp paralell for
for(picture = 0; picture < 4; picture++){
for(int row = 0; row < 1000; row++){
for(int col = 0; col < 1000; col++){
//do stuff with pixel[picture][row][col]
}
}
}
I just want to split the work among 4 cores (1 core per picture) so that each core/thread is working on a specific picture. That way core 0 is working on picture 0, core 1 on picture 1, and so on. The machine it is being tested on only has 4 cores as well. What is the best way to use openmp declarations for this scenario. The one I posted is what I think would be the best performance for this scenario.
keep in mind this is pseudo code. The goal of the program is not important, parallelizing these loops efficiently is the goal.

Just adding a simple
#pragma omp parallel for
is a good starting point for your problem. Don't bother with statically writing in the how many threads it should use. The runtime will usually do the right thing.
However, it is impossible to generally say what is most efficient. There are many performance factors that are impossible to tell from your limited general example. Your code may be memory bound and benefit only very little from parallelization on desktop CPUs. You may have a load imbalance which means you need to split the work in to more chunks and process them dynamically. That could be done by parallelizing the middle loop or using nested parallelism. Whether the middle loop parallelization works well depends on the amount of work done by the inner loop (and hence the ratio of useful work / overhead). The memory layout also heavily influence the efficieny of the parallelization. Or maybe you even have data dependencies in the inner loop preventing parallelization there...
The only general recommendation once can give is to always measure, never guess. Learn to use the powerful available parallel performance analysis tools and incoperate that into your workflow.

Recursive parallel function not using all cores

I recently implemented a recursive negamax algorithm, which I parallelized using OpenMP.
The interesting part is this:
#pragma omp parallel for
for (int i = 0; i < (int) pos.size(); i++)
{
int val = -negamax(pos[i].first, -player, depth - 1).first;
#pragma omp critical
if (val >= best)
{
best = val;
move = pos[i].second;
}
}
On my Intel Core i7 (4 physical cores and hyper threading), I observed something very strange: while running the algorithm, it was not using all 8 available threads (logical cores), but only 4.
Can anyone explain why is it so? I understand the reasons the algorithm doesn't scale well, but why doesn't it use all the available cores?
EDIT: I changed thread to core as it better express my question.

First, check whether you have enough iteration count, pos.size(). Obviously this should be a sufficient number.
Recursive parallelism is an interesting pattern, but it may not work very well with OpenMP, unless you're using OpenMP 3.0's task, Cilk, or TBB. There are several things that need to be considered:
(1) In order to use a recursive parallelism, you mostly need to explicitly call omp_set_nested(1). AFAIK, most implementations of OpenMP do not recursively spawn parallel for, because it may end up creating thousands physical threads, just exploding your operating system.
Until OpenMP 3.0's task, a OpenMP has a sort of 1-to-1 mapping of logical parallel task to a physical task. So, it won't work well in such recursive parallelism. Try out, but don't be surprised if even thousands threads are created!
(2) If you really want to use recursive parallelism with a traditional OpenMP, you need to implement code that controls the number of active threads:
if (get_total_thread_num() > TOO_MANY_THREADS) {
// Do not use OpenMP
...
} else {
#pragma omp parallel for
...
}
(3) You may consider OpenMP 3.0's task. In your code, there could be huge number of concurrent tasks due to a recursion. To be efficiently working on a parallel machine, there must be an efficient mapping algorithm these logical concurrent tasks to physical threads (or logical processor, core). A raw recursive parallelism in OpenMP will create actual physical threads. OpenMP 3.0's task does not.
You may refer to my previous answer related to a recursive parallelism: C OpenMP parallel quickSort.
(4) Intel's Cilk Plus and TBB support full nested and recursive parallelism. In my small test program, the performance was far better than OpenMP 3.0. But, that was 3 years ago. You should check the latest OpenMP's implementation.
I have not a detailed knowledge of negamax and minimax. But, my gut says that using recursive pattern and a lock are unlikely to give a speedup. A simple Google search gives me: http://supertech.csail.mit.edu/papers/dimacs94.pdf
"But negamax is not a efﬁcient serial search algorithm, and thus, it
makes little sense to parallelize it."

Optimal parallelism level has some additional considerations except as much threads as available. For example, operation systems used to schedule all threads of a single process to a single processor to optimize cache performance (unless the programmer changed it explicitly).
I guess OpenMP makes similar considerations when executing such code and you cannot always assume the maximum thread number is executed/

Whaddya mean all 8 available threads ? A CPU like that can probably run 100s of threads ! You may believe that 4 cores with hyper-threading equates to 8 threads, but your OpenMP installation probably doesn't.
Check:
Has the environment variable OMP_NUM_THREADS been created and set ? If it is set to 4 there's your answer, your OpenMP environment is configured to start only 4 threads, at most.
If that environment variable hasn't been set, investigate the use, and impact, of the OpenMP routines omp_get_num_threads() and omp_set_num_threads(). If the environment variable has been set then omp_set_num_threads() will override it at run time.
Whether 8 hyper-threads outperform 4 real threads.
Whether oversubscribing, eg setting OMP_NUM_THREADS to 16, does anything other than ruin performance.

Parallel for vs omp simd: when to use each?

OpenMP 4.0 introduces a new construct called "omp simd". What is the benefit of using this construct over the old "parallel for"? When would each be a better choice over the other?
EDIT:
Here is an interesting paper related to the SIMD directive.

A simple answer:
OpenMP only used to exploit multiple threads for multiple cores. This new simd extention allows you to explicitly use SIMD instructions on modern CPUs, such as Intel's AVX/SSE and ARM's NEON.
(Note that a SIMD instruction is executed in a single thread and a single core, by design. However, the meaning of SIMD can be quite expanded for GPGPU. But, but I don't think you need to consider GPGPU for OpenMP 4.0.)
So, once you know SIMD instructions, you can use this new construct.
In a modern CPU, roughly there are three types of parallelism: (1) instruction-level parallelism (ILP), (2) thread-level parallelism (TLP), and (3) SIMD instructions (we could say this is vector-level or so).
ILP is done automatically by your out-of-order CPUs, or compilers. You can exploit TLP using OpenMP's parallel for and other threading libraries. So, what about SIMD? Intrinsics were a way to use them (as well as compilers' automatic vectorization). OpenMP's simd is a new way to use SIMD.
Take a very simple example:
for (int i = 0; i < N; ++i)
A[i] = B[i] + C[i];
The above code computes a sum of two N-dimensional vectors. As you can easily see, there is no (loop-carried) data dependency on the array A[]. This loop is embarrassingly parallel.
There could be multiple ways to parallelize this loop. For example, until OpenMP 4.0, this can be parallelized using only parallel for construct. Each thread will perform N/#thread iterations on multiple cores.
However, you might think using multiple threads for such simple addition would be a overkill. That is why there is vectorization, which is mostly implemented by SIMD instructions.
Using a SIMD would be like this:
for (int i = 0; i < N/8; ++i)
VECTOR_ADD(A + i, B + i, C + i);
This code assumes that (1) the SIMD instruction (VECTOR_ADD) is 256-bit or 8-way (8 * 32 bits); and (2) N is a multiple of 8.
An 8-way SIMD instruction means that 8 items in a vector can be executed in a single machine instruction. Note that Intel's latest AVX provides such 8-way (32-bit * 8 = 256 bits) vector instructions.
In SIMD, you still use a single core (again, this is only for conventional CPUs, not GPU). But, you can use a hidden parallelism in hardware. Modern CPUs dedicate hardware resources for SIMD instructions, where each SIMD lane can be executed in parallel.
You can use thread-level parallelism at the same time. The above example can be further parallelized by parallel for.
(However, I have a doubt how many loops can be really transformed to SIMDized loops. The OpenMP 4.0 specification seems a bit unclear on this. So, real performance and practical restrictions would be dependent on actual compilers' implementations.)
To summarize, simd construct allows you to use SIMD instructions, in turn, more parallelism can be exploited along with thread-level parallelism. However, I think actual implementations would matter.

The linked-to standard is relatively clear (p 13, lines 19+20)
When any thread encounters a simd construct, the iterations of the
loop associated with the construct can be executed by the SIMD lanes
that are available to the thread.
SIMD is a sub-thread thing. To make it more concrete, on a CPU you could imagine using simd directives to specifically request vectorization of chunks of loop iterations that individually belong to the same thread. It's exposing the multiple levels of parallelism that exist within a single multicore processor, in a platform-independent way. See for instance the discussion (along with the accelerator stuff) on this intel blog post.
So basically, you'll want to use omp parallel to distribute work onto different threads, which can then migrate to multiple cores; and you'll want to use omp simd to make use of vector pipelines (say) within each core. Normally omp parallel would go on the "outside" to deal with coarser-grained parallel distribution of work and omp simd would go around tight loops inside of that to exploit fine-grained parallelism.

Compilers aren't required to make simd optimization in a parallel region conditional on presence of the simd clause. Compilers I'm familiar with continue to support nested loops, parallel outer, vector inner, in the same way as before.
In the past, OpenMP directives were usually taken to prevent loop-switching optimizations involving the outer parallelized loop (multiple loops with collapse clause). This seems to have changed in a few compilers.
OpenMP 4 opens up new possibilities including optimization of a parallel outer loop with a non-vectorizable inner loop, by a sort of strip mining, when omp parallel do [for] simd is set. ifort sometimes reports it as outer loop vectorization when it is done without the simd clause. It may then be optimized for a smaller number of threads than the omp parallel do simd, which seems to need more threads than the simd vector width to pay off. Such a distinction might be inferred, as, without the simd clause, the compiler is implicitly asked to optimize for a loop count such as 100 or 300, while the simd clause requests unconditional simd optimization.
gcc 4.9 omp parallel for simd looked quite effective when I had a 24-core platform.

Which library for parallel for-loops that iterate 1M*1k times, OpenMP or boost::thread?

I want to iterate an image pixel by pixel and do a 1000 of floating point operations per pixel. Do you think I should use multi-threading or multiprocessing, i.e. boost::thread or OpenMP for this? Is there a rule of thumb to choose between these 2 (for fastest speed)? I have understood that creating threads or switching between threads is multiple times faster than creating/switching processes. On the other hand implementing OpenMP code is much easier.
My solution right now:
#pragma omp parallel for
for(size_t i=0; i<640; ++i) {
for(size_t j=0; j<480; ++j) {
// do 1000 float operations
}
}

OpenMP is more than sufficient for this, in fact boost does not even have a built-in parallel loop construct.
Do you think I should use multi-threading or multiprocessing
Although OpenMP stands for Open MultiProcessing, it is in fact a multithreading library.
An alternative library worth looking at is Intel TBB.

C++ intel TBB inner loop optimisation

I am trying to use Intel TBB to parallelise an inner loop (the 2nd of 3) however, i only get decent pay off when the inner 2 loops are significant in size.
Is TBB spawning new threads for every iteration of the major loop?
Is there anyway to reduce the overhead?
tbb::task_scheduler_init tbb_init(4); //I have 4 cores
tbb::blocked_range<size_t> blk_rng(0, crs_.y_sz, crs_.y_sz/4);
boost::chrono::system_clock::time_point start =boost::chrono::system_clock::now();
for(unsigned i=0; i!=5000; ++i)
{
tbb::parallel_for(blk_rng,
[&](const tbb::blocked_range<size_t>& br)->void
{
:::
It might be interesting to note that openMP (which I am trying to remove!!!) doesn't have this problem.
I am compiling with:
intel ICC 12.1 at -03 -xHost -mavx
On a intel 2500k (4 cores)
EDIT: I can really change the order of loops, because the out loops test need to be replace with a predicate based on the loops result.

No, TBB does not spawn new threads for every invocation of parallel_for. Actually, unlike OpenMP parallel regions that each may start a new thread team, TBB work with the same thread team until all task_scheduler_init objects are destroyed; and in case of implicit initialization (with task_scheduler_init omitted), same worker threads are used till the end of the program.
So the performance issue is caused by something else. The most likely reasons, from my experience, are:
lack of compiler optimizations, auto-vectorization being first (can be checked by comparing single-threaded performance of OpenMP and TBB; if TBB is much slower, then this is the most likely reason).
cache misses; if you 5000 times run through the same data, cache locality has huge importance, and OpenMP's default schedule(static) works very well, deterministically repeating exactly the same partitioning each time, while TBB's work stealing scheduler has significant randomness. Setting the blocked_range grain size equal to problem_size/num_threads ensures one piece of work per thread but does not guarantee the same distribution of pieces; and affinity_partitioner is supposed to help with that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js