Recursive parallel function not using all cores - c++

I recently implemented a recursive negamax algorithm, which I parallelized using OpenMP.
The interesting part is this:
#pragma omp parallel for
for (int i = 0; i < (int) pos.size(); i++)
{
int val = -negamax(pos[i].first, -player, depth - 1).first;
#pragma omp critical
if (val >= best)
{
best = val;
move = pos[i].second;
}
}
On my Intel Core i7 (4 physical cores and hyper threading), I observed something very strange: while running the algorithm, it was not using all 8 available threads (logical cores), but only 4.
Can anyone explain why is it so? I understand the reasons the algorithm doesn't scale well, but why doesn't it use all the available cores?
EDIT: I changed thread to core as it better express my question.

First, check whether you have enough iteration count, pos.size(). Obviously this should be a sufficient number.
Recursive parallelism is an interesting pattern, but it may not work very well with OpenMP, unless you're using OpenMP 3.0's task, Cilk, or TBB. There are several things that need to be considered:
(1) In order to use a recursive parallelism, you mostly need to explicitly call omp_set_nested(1). AFAIK, most implementations of OpenMP do not recursively spawn parallel for, because it may end up creating thousands physical threads, just exploding your operating system.
Until OpenMP 3.0's task, a OpenMP has a sort of 1-to-1 mapping of logical parallel task to a physical task. So, it won't work well in such recursive parallelism. Try out, but don't be surprised if even thousands threads are created!
(2) If you really want to use recursive parallelism with a traditional OpenMP, you need to implement code that controls the number of active threads:
if (get_total_thread_num() > TOO_MANY_THREADS) {
// Do not use OpenMP
...
} else {
#pragma omp parallel for
...
}
(3) You may consider OpenMP 3.0's task. In your code, there could be huge number of concurrent tasks due to a recursion. To be efficiently working on a parallel machine, there must be an efficient mapping algorithm these logical concurrent tasks to physical threads (or logical processor, core). A raw recursive parallelism in OpenMP will create actual physical threads. OpenMP 3.0's task does not.
You may refer to my previous answer related to a recursive parallelism: C OpenMP parallel quickSort.
(4) Intel's Cilk Plus and TBB support full nested and recursive parallelism. In my small test program, the performance was far better than OpenMP 3.0. But, that was 3 years ago. You should check the latest OpenMP's implementation.
I have not a detailed knowledge of negamax and minimax. But, my gut says that using recursive pattern and a lock are unlikely to give a speedup. A simple Google search gives me: http://supertech.csail.mit.edu/papers/dimacs94.pdf
"But negamax is not a efficient serial search algorithm, and thus, it
makes little sense to parallelize it."

Optimal parallelism level has some additional considerations except as much threads as available. For example, operation systems used to schedule all threads of a single process to a single processor to optimize cache performance (unless the programmer changed it explicitly).
I guess OpenMP makes similar considerations when executing such code and you cannot always assume the maximum thread number is executed/

Whaddya mean all 8 available threads ? A CPU like that can probably run 100s of threads ! You may believe that 4 cores with hyper-threading equates to 8 threads, but your OpenMP installation probably doesn't.
Check:
Has the environment variable OMP_NUM_THREADS been created and set ? If it is set to 4 there's your answer, your OpenMP environment is configured to start only 4 threads, at most.
If that environment variable hasn't been set, investigate the use, and impact, of the OpenMP routines omp_get_num_threads() and omp_set_num_threads(). If the environment variable has been set then omp_set_num_threads() will override it at run time.
Whether 8 hyper-threads outperform 4 real threads.
Whether oversubscribing, eg setting OMP_NUM_THREADS to 16, does anything other than ruin performance.

Related

Using multiple OMP parallel sections in parallel -> Performance Issue?

I am trying to understand a huge performance problem with one of our C++ applications using OpenMP (on Windows). The structure of the application is as follows:
I have an algorithm which basically consists of a couple of for-loops which are parallelized using OpenMP:
void algorithm()
{
#pragma omp parallel for numThreads(12)
for (int i=0; ...)
{
// do some heavy computation (pure memory and CPU work, no I/O, no waiting)
}
// ... some more for-loops of this kind
}
The application executes this algorithm n times in parallel from n different threads:
std::thread t1(algorithm);
std::thread t2(algorithm);
//...
std::thread tn(algorithm);
t1.join();
t2.join();
//...
tn.join();
// end of application
Now, the problem is as follows:
when I run the application with n=1 (only one call to algorithm()) on my system with 32 physical CPU cores (no hyperthreading), it takes about 5s and loads the CPU to about 30% as expected (given that I have told OpenMP to only use 12 threads).
when I run with n=2, the CPU load goes up to about 60%, but the application takes almost 10 seconds. This means that it is almost impossible to run multiple algorithm instances in parallel.
This alone, of course, can have many reasons (including cache misses, RAM bandwidth limitations, etc.), but there is one thing that strikes me:
if I run my application twice in two parallel processes, each with n=1, both processes complete after about 5 seconds, meaning that I was well able to run two of my algorithms in parallel, as long as they live in different processes.
This seems to exclude many possible reasons for this performance bottleneck. And indeed, I have been unable to understand the cause of this, even after profiling the code. One of my suspicions is that there might be some excessive synchronization in OpenMP between different parallel sections.
Has anyone ever seen an effect like this before? Or can anyone give me advice how to approach this? I have really come to a point where I have tried all I can imagine, but without any success so far. I thus appreciate any help I can get!
Thanks a lot,
Da
PS.:
I have been using both, MS Visual Studio 2015 and Intel's 2017 compiler - both show basically the same effect.
I have a very simple reproducer showing this problem which I can provide if needed. It is really not much more than the above, just adding some real work to be done inside the for-loops.

General tips to improve multithreading (in C++)

I have build a C++ code without thinking that I would later have the need to multithread it. I have now multithreaded the 3 main for loops with openMP. Here are the performance comparisons (as measured with time from bash)
Single thread
real 5m50.008s
user 5m49.072s
sys 0m0.877s
Multi thread (24 threads)
real 1m22.572s
user 28m28.206s
sys 0m4.170s
The use of 24 cores have reduced the real time by a factor of 4.24. Of course, I did not expect the code to be 24 times faster. I did not really know what to expect actually.
- Is there a rule of thumb that would allow one to make prediction about how much faster will a given code run with n threads in comparison to a single thread?
- Are there general tips in order to improve the performance of multithreaded processes?
I'm sure you know of the obvious like the cost of barriers. But it's hard to draw a line between what is trivial and what could be helpful to someone. Here are a few lessons learned from use, if I think of more I'll add them:
Always try to use thread private variables as long as possible, consider that even for reductions, providing only a small number of collective results.
Prefer parallel runs of long sections of code and long parallel sections (#pragma omp parallel ... #pragma omp for), instead of parallelizing loops separately (#pragma omp parallel for).
Don't parallelize short loops. In a 2-dimensional iteration it often suffices to parallelize the outer loop. If you do parallelize the whole thing using collapse, be aware that OpenMP will linearize it introducing a fused variable and accessing the indices separately incurs overhead.
Use thread private heaps. Avoid sharing pools and collections if possible, even though different members of the collection would be accessed independently by different threads.
Profile your code and see how much time is spent on busy waiting and where that may be occurring.
Learn the consequences of using different schedule strategies. Try what's better, don't assume.
If you use critical sections, name them. All unnamed CSs have to wait for each other.
If your code uses random numbers, make it reproducible: define thread-local RNGs, seed everything in a controllable manner, impose order on reductions. Benchmark deterministically, not statistically.
Browse similar questions on Stack Overflow, e.g., the wonderful answers here.

Intel Tbb overhead issue

Im using Intel TBB to parallel processing some parts of an algorithm processed on images. Although the processing for each pixel is data dependent, there are some cases which 2 consecutive pixels could be processed in parallel as below.
ProcessImage(image)
for each row in image // Create and wait root task for each line here using allocate_root()
ProcessRow(row)
for each 2 pixel
if(parallel())
ProcessPixel(A) and ProcessPixel(B) in parallel // For testing, create and process 2 tbb::empty_task() here as child tasks
else
ProcessPixel(A)
ProcessPixel(B)
However, the overhead occurs because this processing is very fast. For each input image (size of 512x512), the processing costs about 5-6 ms.
When I experimentally used Intel TBB as comment block above, the processing costs more than 25 ms.
So is there any better way using Intel TBB without overhead issue or other more efficient way to improve performance of simple and fast processing program like this ?
TBB does not add such a big (~20ms) overheads for invocation of a parallel algorithm. My guess (since there is no specifics provided) is that it is related to one of the following:
If you measure only the first invocation, it includes overheads for worker threads creation. And note, TBB does not have barriers like OpenMP, so one call to parallel_for might not be enough to create all the threads)
Same situation happens after worker threads go to sleep because of absence of the parallel work for them. The overheads for the wakeup are orders of magnitude lower than for the threads creation but still can affect measurements and impose wrong conclusions.
TBB scheduler can steal a task from outer level to the nested level (blocking call) thus the measurements will look like it takes too long for processing the nested part only while it is busy with an extra work there.
There is a contention for processing (A) and (B) in parallel caused by either explicit (e.g. mutex) or implicit (e.g. false sharing) reasons. But anyway, it is not TBB-specific.
Thus, the advice for performance measurements with TBB is to consider only the total time for long enough sequence of computations that will hide initialization overheads.
And of course as was advised, parallel first on the outer level. TBB provides enough different patterns for that including tbb::parallel_pipeline and tbb::flow::graph

Do I need to disable OpenMP on a 1 core machine explicitly?

I parallelized some C++ code with OpenMP.
But what if my program will work on a 1 core machine?
Do I need disable usage threading at runtime:
Checks cores
If cores > 1 use OpenMP
Else ignore OpenMP devectives
If yes, does OpenMP have a special directive for it?
No, you don't need to disable OpenMP or threading for running on one core; and for situations where you might want to, you're probably better off explicitly re-compiling without OpenMP, although for complex parallelizations there are other measures, mentioned in the comments, that you can take as well.
When running on a single core or even hardware thread, even if you change nothing - not even the number of threads your code launches - correct, deadlock-free threading code should still run correctly, as the operating system schedules the various threads on the core.
Now, that context switching between threads is costly overhead. Typical OpenMP code, which is compute-bound and relies on work sharing constructs to assign work between threads, treats the number of threads as a parameter and launches as many threads as you have cores or hardware threads available. For such code, where you are just using constructs like
#pragma omp parallel for
for (i=0; i<N; i++)
data[i] = expensive_function(i)
then running on one core will likely only use one thread, or you can explicitly set the number of threads to be one using the OMP_NUM_THREADS environment variable. If OpenMP is to use only one thread and the computation is time-consuming enough, the overhead from the threading library in the above loop is negligible. In this case, there's really no disabling of OpenMP necessary; you're just running on one thread. You can also set the number of threads within the program using omp_set_num_threads(), but best practice is normally to do this at runtime.
However, there's a downside. Compiling with OpenMP disables certain optimizations. For instance, because the work decomposition is done at runtime, even loops with compiled-in trip count limits may not be able to, say, be unrolled or vectorized as effectively because it isn't known how many trips through the loop each thread will take. In that case, if you know that your code will be run on a single core, it may be worth doing the compilation without OpenMP enabled as well, and using that binary for single-core runs. You can also use this approach to test to see if the difference in optimizations matters, running the OpenMP-enabled version with OMP_NUM_THREADS=1 and comparing the timing to that of serial binary.
Of course, if your threading is more complex than using simple work sharing constructs, then it starts being harder to make generalizations. If you have assumptions built into your code about how many threads are present - maybe you have an explicit producer/consumer model for processing work, or a data decomposition hardcoded in, either of which are doable in OpenMP - then it's harder to see how things work. You may also have parallel regions which are much less work than a big computational loop; in those cases, where overhead even with one thread might be significant, it might be best to use if clauses to provide explicit serial paths, e.g.:
nThreadMax = imp_get_max_threads();
#pragma omp parallel if (nThreadMax > 1)
if (omp_in_parallel()) {
// Parallel code path
} else {
// Serial code path
}
But now doing compilation without OpenMP becomes more complicated.
Summarizing:
For big heavy computation work, which is what OpenMP is typically used for, it probably doesn't matter; use OMP_NUM_THREADS=1
You can test if it does matter, with overhead and disabled optimizations, by compiling without OpenMP and comparing the serial runtime to the one-thread OpenMP runtime
For more complicated threading cases, it's hard to say much in general; it depends.
I believe there is function called:
omp_get_num_procs()
that will let you know how many processors are available for OpenMP to work on. Then there are many ways to disable OpenMP. From your code you can run:
omp_set_num_threads(1)
Just remember that even on single core you can get some boost with OpenMP. It only depends on the specificity of your case.

C++ intel TBB inner loop optimisation

I am trying to use Intel TBB to parallelise an inner loop (the 2nd of 3) however, i only get decent pay off when the inner 2 loops are significant in size.
Is TBB spawning new threads for every iteration of the major loop?
Is there anyway to reduce the overhead?
tbb::task_scheduler_init tbb_init(4); //I have 4 cores
tbb::blocked_range<size_t> blk_rng(0, crs_.y_sz, crs_.y_sz/4);
boost::chrono::system_clock::time_point start =boost::chrono::system_clock::now();
for(unsigned i=0; i!=5000; ++i)
{
tbb::parallel_for(blk_rng,
[&](const tbb::blocked_range<size_t>& br)->void
{
:::
It might be interesting to note that openMP (which I am trying to remove!!!) doesn't have this problem.
I am compiling with:
intel ICC 12.1 at -03 -xHost -mavx
On a intel 2500k (4 cores)
EDIT: I can really change the order of loops, because the out loops test need to be replace with a predicate based on the loops result.
No, TBB does not spawn new threads for every invocation of parallel_for. Actually, unlike OpenMP parallel regions that each may start a new thread team, TBB work with the same thread team until all task_scheduler_init objects are destroyed; and in case of implicit initialization (with task_scheduler_init omitted), same worker threads are used till the end of the program.
So the performance issue is caused by something else. The most likely reasons, from my experience, are:
lack of compiler optimizations, auto-vectorization being first (can be checked by comparing single-threaded performance of OpenMP and TBB; if TBB is much slower, then this is the most likely reason).
cache misses; if you 5000 times run through the same data, cache locality has huge importance, and OpenMP's default schedule(static) works very well, deterministically repeating exactly the same partitioning each time, while TBB's work stealing scheduler has significant randomness. Setting the blocked_range grain size equal to problem_size/num_threads ensures one piece of work per thread but does not guarantee the same distribution of pieces; and affinity_partitioner is supposed to help with that.