Intel TBB disable nested parallelism - c++

Consider the following scenario: I am writing a function, within which there is a computationally intensive loop. I parallelized it with TBB's parallel_for. Now, the problem is that this function may be used on its own, and benefits from the parallelization. Or it maybe used within another loop. In the later case, the outer loop can also be parallelized. And often, it is better to only parallelize the outer loop.
Normally in TBB parallelize both the outer and inner loop is not a problem, since unlike OpenMP, nested parallelization in TBB does not results in additional threads being created. TBB merely create more tasks. However, sometime the overhead of the creating more tasks in the inner loop is still undesirable (I observed a 40% slowdown in one extreme situations).
So is there a way to have TBB do not create any task when parallel_for etc is invoked while execution another parallel_for algorithm? Similar to the effect of OMP_NESTED=FALSE for OpenMP.

Simple answer: No
Simple advice: Don't use simple_partitioner
There is no way to affect the parallel_for or other algorithms from outside or on the outer level except restricting their concurrency via task_scheduler_init or task_arena. Though, they are not well-suited for nested parallelism in any case.
Anyway, there should not be such a big impact on the performance if auto_partitioner is used (especially, on the nested level) and you follow TBB recommendation on the amount of work which is efficient for parallelization.
Though I admit that in the extreme cases it can be a problem. We (TBB developers) thought on optimizing auto partitioning parameters of parallel_for depending on the context where it is being executed. But the issue is that knowing whether we are on the nested level or not is not enough to reliably define the parameters. E.g. consider when a parallel_for is launched from a single task: formally, it is nesting but there is no parallelism on the outer level. Some parts of the task scheduler needs to be significantly reworked to be able to provide information about the number of busy workers at any given time in order to enable this idea.

Related

General tips to improve multithreading (in C++)

I have build a C++ code without thinking that I would later have the need to multithread it. I have now multithreaded the 3 main for loops with openMP. Here are the performance comparisons (as measured with time from bash)
Single thread
real 5m50.008s
user 5m49.072s
sys 0m0.877s
Multi thread (24 threads)
real 1m22.572s
user 28m28.206s
sys 0m4.170s
The use of 24 cores have reduced the real time by a factor of 4.24. Of course, I did not expect the code to be 24 times faster. I did not really know what to expect actually.
- Is there a rule of thumb that would allow one to make prediction about how much faster will a given code run with n threads in comparison to a single thread?
- Are there general tips in order to improve the performance of multithreaded processes?
I'm sure you know of the obvious like the cost of barriers. But it's hard to draw a line between what is trivial and what could be helpful to someone. Here are a few lessons learned from use, if I think of more I'll add them:
Always try to use thread private variables as long as possible, consider that even for reductions, providing only a small number of collective results.
Prefer parallel runs of long sections of code and long parallel sections (#pragma omp parallel ... #pragma omp for), instead of parallelizing loops separately (#pragma omp parallel for).
Don't parallelize short loops. In a 2-dimensional iteration it often suffices to parallelize the outer loop. If you do parallelize the whole thing using collapse, be aware that OpenMP will linearize it introducing a fused variable and accessing the indices separately incurs overhead.
Use thread private heaps. Avoid sharing pools and collections if possible, even though different members of the collection would be accessed independently by different threads.
Profile your code and see how much time is spent on busy waiting and where that may be occurring.
Learn the consequences of using different schedule strategies. Try what's better, don't assume.
If you use critical sections, name them. All unnamed CSs have to wait for each other.
If your code uses random numbers, make it reproducible: define thread-local RNGs, seed everything in a controllable manner, impose order on reductions. Benchmark deterministically, not statistically.
Browse similar questions on Stack Overflow, e.g., the wonderful answers here.

Intel Tbb overhead issue

Im using Intel TBB to parallel processing some parts of an algorithm processed on images. Although the processing for each pixel is data dependent, there are some cases which 2 consecutive pixels could be processed in parallel as below.
ProcessImage(image)
for each row in image // Create and wait root task for each line here using allocate_root()
ProcessRow(row)
for each 2 pixel
if(parallel())
ProcessPixel(A) and ProcessPixel(B) in parallel // For testing, create and process 2 tbb::empty_task() here as child tasks
else
ProcessPixel(A)
ProcessPixel(B)
However, the overhead occurs because this processing is very fast. For each input image (size of 512x512), the processing costs about 5-6 ms.
When I experimentally used Intel TBB as comment block above, the processing costs more than 25 ms.
So is there any better way using Intel TBB without overhead issue or other more efficient way to improve performance of simple and fast processing program like this ?
TBB does not add such a big (~20ms) overheads for invocation of a parallel algorithm. My guess (since there is no specifics provided) is that it is related to one of the following:
If you measure only the first invocation, it includes overheads for worker threads creation. And note, TBB does not have barriers like OpenMP, so one call to parallel_for might not be enough to create all the threads)
Same situation happens after worker threads go to sleep because of absence of the parallel work for them. The overheads for the wakeup are orders of magnitude lower than for the threads creation but still can affect measurements and impose wrong conclusions.
TBB scheduler can steal a task from outer level to the nested level (blocking call) thus the measurements will look like it takes too long for processing the nested part only while it is busy with an extra work there.
There is a contention for processing (A) and (B) in parallel caused by either explicit (e.g. mutex) or implicit (e.g. false sharing) reasons. But anyway, it is not TBB-specific.
Thus, the advice for performance measurements with TBB is to consider only the total time for long enough sequence of computations that will hide initialization overheads.
And of course as was advised, parallel first on the outer level. TBB provides enough different patterns for that including tbb::parallel_pipeline and tbb::flow::graph

Do I need to disable OpenMP on a 1 core machine explicitly?

I parallelized some C++ code with OpenMP.
But what if my program will work on a 1 core machine?
Do I need disable usage threading at runtime:
Checks cores
If cores > 1 use OpenMP
Else ignore OpenMP devectives
If yes, does OpenMP have a special directive for it?
No, you don't need to disable OpenMP or threading for running on one core; and for situations where you might want to, you're probably better off explicitly re-compiling without OpenMP, although for complex parallelizations there are other measures, mentioned in the comments, that you can take as well.
When running on a single core or even hardware thread, even if you change nothing - not even the number of threads your code launches - correct, deadlock-free threading code should still run correctly, as the operating system schedules the various threads on the core.
Now, that context switching between threads is costly overhead. Typical OpenMP code, which is compute-bound and relies on work sharing constructs to assign work between threads, treats the number of threads as a parameter and launches as many threads as you have cores or hardware threads available. For such code, where you are just using constructs like
#pragma omp parallel for
for (i=0; i<N; i++)
data[i] = expensive_function(i)
then running on one core will likely only use one thread, or you can explicitly set the number of threads to be one using the OMP_NUM_THREADS environment variable. If OpenMP is to use only one thread and the computation is time-consuming enough, the overhead from the threading library in the above loop is negligible. In this case, there's really no disabling of OpenMP necessary; you're just running on one thread. You can also set the number of threads within the program using omp_set_num_threads(), but best practice is normally to do this at runtime.
However, there's a downside. Compiling with OpenMP disables certain optimizations. For instance, because the work decomposition is done at runtime, even loops with compiled-in trip count limits may not be able to, say, be unrolled or vectorized as effectively because it isn't known how many trips through the loop each thread will take. In that case, if you know that your code will be run on a single core, it may be worth doing the compilation without OpenMP enabled as well, and using that binary for single-core runs. You can also use this approach to test to see if the difference in optimizations matters, running the OpenMP-enabled version with OMP_NUM_THREADS=1 and comparing the timing to that of serial binary.
Of course, if your threading is more complex than using simple work sharing constructs, then it starts being harder to make generalizations. If you have assumptions built into your code about how many threads are present - maybe you have an explicit producer/consumer model for processing work, or a data decomposition hardcoded in, either of which are doable in OpenMP - then it's harder to see how things work. You may also have parallel regions which are much less work than a big computational loop; in those cases, where overhead even with one thread might be significant, it might be best to use if clauses to provide explicit serial paths, e.g.:
nThreadMax = imp_get_max_threads();
#pragma omp parallel if (nThreadMax > 1)
if (omp_in_parallel()) {
// Parallel code path
} else {
// Serial code path
}
But now doing compilation without OpenMP becomes more complicated.
Summarizing:
For big heavy computation work, which is what OpenMP is typically used for, it probably doesn't matter; use OMP_NUM_THREADS=1
You can test if it does matter, with overhead and disabled optimizations, by compiling without OpenMP and comparing the serial runtime to the one-thread OpenMP runtime
For more complicated threading cases, it's hard to say much in general; it depends.
I believe there is function called:
omp_get_num_procs()
that will let you know how many processors are available for OpenMP to work on. Then there are many ways to disable OpenMP. From your code you can run:
omp_set_num_threads(1)
Just remember that even on single core you can get some boost with OpenMP. It only depends on the specificity of your case.

how to use quad core CPU in application

For using all the cores of a quad core processor what do I need to change in my code is it about adding support of multi threading or is it which is taken care by OS itself. I am having FreeBSD and language I am using is C++. I want to give complete CPU cycles to my application at least 90%.
You need some form of parallelism. Multi-threading or multi-processing would be fine.
Usually, multiple threads are easier to handle (since they can access shared data) than multiple processes. However, usually, multiple threads are harder to handle (since they access shared data) than multiple processes.
And, yes, I wrote this deliberately.
If you have a SIMD scenario, Ninefingers' suggestion to look at OpenMP is also very good. (If you don't know what SIMD means, see Ninefingers' helpful comment below.)
For multi-threaded applications in C++ may I suggest Boost.Thread which should help you access the full potential of your quad-core machine.
As for changing your code, you might want to consider making things as immutable as possible. State transitions between threads are much more difficult to debug. There a plethora of things that could potentially happen in unexpected ways. See this SO thread.
Another option not mentioned here, threading aside, is the use of OpenMP available via the -fopenmp and the libgomp library, both of which I have installed on my FreeBSD 8 system.
These give you #pragma directives to parallelise certain loops, while statements etc i.e. the bits you can parallelise. It takes care of threading and cpu association for you. Note it is a general solution and therefore might not be the optimum way to parallelise, but it will allow you to parallelise certain routines.
Take a look at this: https://computing.llnl.gov/tutorials/openMP/
As for using threads/processes themselves, certain routines and ways of working lend themselves to it. Can you break tasks out into such a way? Does it make sense to fork() your process or create a thread? If so, do so, but if not, don't try to force your application to be multi-threaded just because. An example I usually give is the greatest common divisor algorithm - it relies on the step before all the time in the traditional implementation therefore is difficult to make parallel.
Also note it is well known that for certain algorithms, parallelisation is actually slower for small values of whatever you are doing in parallel, because although the jobs complete more quickly, the associated time cost of forking and joining (be that threads or processes) actually pushes the time above that of a serial implementation.
I think your only option is to run several threads. If your application is single-threaded, then it will only run on one of the cores (at a time), but if you have more threads, they can run simultaneously.
You need to add support to your application for parallelism through the use of Threading.
Once you have support for parallelism, it's up to the OS to assign your threads to CPU cores.
The first thing I think you should look at is whether your application and its algorithms are suited to be executed in parellel (or possibly as a set of serial tasks that can be processed independently). If this is not the case, it will be difficult to multithread it or break it up into parallel processes, and you may need to look into modifying the way it works.
Once you have established that you will be able to benefit from parallel processing you have the option to either use several processes or threads. The choice depends a lot on the nature of your application and how independent the parallel processes can be. It is easier to coordinate and share data between threads since they are in the same process, but also quite a bit more challenging to develop and debug.
Boost.Thread is a good library if you decide to go down the multi-threaded route.
I want to give complete CPU cycles to my application at least 90%.
Why? Your chip's not hot enough?
Seriously, it takes world experts dozens if not hundreds of hours to parallelize and load-balance an application so that it uses 90% of all four cores. Your CPU is already paid for and it costs the same whether you use it or not. (Actually, it costs slightly less to run, electrically speaking, if you don't use it.) How much is your time worth? How many hours are you willing to invest in order to make more effective use of a resource that may have cost you $300 and is probably sitting idle most of the time anyway?
It's possible to get speedups through parallelism, but it's expensive in human time. You need a good reason to justify it. (Learning how is a good enough reason.)
All the good books I know on parallel programming are for languages other than C++, and for good reason. If you want interesting stuff on parallelism check out Implicit Parallel Programmin in pH or Concurrent Programming in ML or the Fortress Project.

Explicit code parallelism in c++

Out of order execution in CPUs means that a CPU can reorder instructions to gain better performance and it means the CPU is having to do some very nifty bookkeeping and such. There are other processor approaches too, such as hyper-threading.
Some fancy compilers understand the (un)interrelatedness of instructions to a limited extent, and will automatically interleave instruction flows (probably over a longer window than the CPU sees) to better utilise the processor. Deliberate compile-time interleaving of floating and integer instructions is another example of this.
Now I have highly-parallel task. And I typically have an ageing single-core x86 processor without hyper-threading.
Is there a straight-forward way to get my the body of my 'for' loop for this highly-parallel task to be interleaved so that two (or more) iterations are being done together? (This is slightly different from 'loop unwinding' as I understand it.)
My task is a 'virtual machine' running through a set of instructions, which I'll really simplify for illustration as:
void run(int num) {
for(int n=0; n<num; n++) {
vm_t data(n);
for(int i=0; i<data.len(); i++) {
data.insn(i).parse();
data.insn(i).eval();
}
}
}
So the execution trail might look like this:
data(1) insn(0) parse
data(1) insn(0) eval
data(1) insn(1) parse
...
data(2) insn(1) eval
data(2) insn(2) parse
data(2) insn(2) eval
Now, what I'd like is to be able to do two (or more) iterations explicitly in parallel:
data(1) insn(0) parse
data(2) insn(0) parse \ processor can do OOO as these two flow in
data(1) insn(0) eval /
data(2) insn(0) eval \ OOO opportunity here too
data(1) insn(1) parse /
data(2) insn(1) parse
I know, from profiling, (e.g. using Callgrind with --simulate-cache=yes), that parsing is about random memory accesses (cache missing) and eval is about doing ops in registers and then writing results back. Each step is several thousand instructions long. So if I can intermingle the two steps for two iterations at once, the processor will hopefully have something to do whilst the cache misses of the parse step are occurring...
Is there some c++ template madness to get this kind of explicit parallelism generated?
Of course I can do the interleaving - and even staggering - myself in code, but it makes for much less readable code. And if I really want unreadable, I can go so far as assembler! But surely there is some pattern for this kind of thing?
Given optimizing compilers and pipelined processors, I would suggest you just write clear, readable code.
Your best plan may be to look into OpenMP. It basically allows you to insert "pragmas" into your code which tell the compiler how it can split between processors.
Hyperthreading is a much higher-level system than instruction reordering. It makes the processor look like two processors to the operating system, so you'd need to use an actual threading library to take advantage of that. The same thing naturally applies to multicore processors.
If you don't want to use low-level threading libraries and instead want to use a task-based parallel system (and it sounds like that's what you're after) I'd suggest looking at OpenMP or Intel's Threading Building Blocks.
TBB is a library, so it can be used with any modern C++ compiler. OpenMP is a set of compiler extensions, so you need a compiler that supports it. GCC/G++ will from verion 4.2 and newer. Recent versions of the Intel and Microsoft compilers also support it. I don't know about any others, though.
EDIT: One other note. Using a system like TBB or OpenMP will scale the processing as much as possible - that is, if you have 100 objects to work on, they'll get split about 50/50 in a two-core system, 25/25/25/25 in a four-core system, etc.
Modern processors like the Core 2 have an enormous instruction reorder buffer on the order of nearly 100 instructions; even if the compiler is rather dumb the CPU can still make up for it.
The main issue would be if the code used a lot of registers, in which case the register pressure could force the code to be executed in sequence even if theoretically it could be done in parallel.
There is no support for parallel execution in the current C++ standard. This will change for the next version of the standard, due out next year or so.
However, I don't see what you are trying to accomplish. Are you referring to one single-core processor, or multiple processors or cores? If you have only one core, you should do whatever gets the fewest cache misses, which means whatever approach uses the smallest memory working set. This would probably be either doing all the parsing followed by all the evaluation, or doing the parsing and evaluation alternately.
If you have two cores, and want to use them efficiently, you're going to have to either use a particularly smart compiler or language extensions. Is there one particular operating system you're developing for, or should this be for multiple systems?
It sounds like you ran into the same problem chip designers face: Executing a single instruction takes a lot of effort, but it involves a bunch of different steps that can be strung together in an execution pipeline. (It is easier to execute things in parallel when you can build them out of separate blocks of hardware.)
The most obvious way is to split each task into different threads. You might want to create a single thread to execute each instruction to completion, or create one thread for each of your two execution steps and pass data between them. In either case, you'll have to be very careful with how you share data between threads and make sure to handle the case where one instruction affects the result of the following instruction. Even though you only have one core and only one thread can be running at any given time, your operating system should be able to schedule compute-intense threads while other threads are waiting for their cache misses.
(A few hours of your time would probably pay for a single very fast computer, but if you're trying to deploy it widely on cheap hardware it might make sense to consider the problem the way you're looking at it. Regardless, it's an interesting problem to consider.)
Take a look at cilk. It's an extension to ANSI C that has some nice constructs for writing parallelized code in C. However, since it's an extension of C, it has very limited compiler support, and can be tricky to work with.
This answer was written assuming the questions does not contain the part "And I typically have an ageing single-core x86 processor without hyper-threading.". I hope it might help other people who want to parallelize highly-parallel tasks, but target dual/multicore CPUs.
As already posted in another answer, OpenMP is a portable way how to do this. However my experience is OpenMP overhead is quite high and it is very easy to beat it by
rolling a DIY (Do It Youself) implementation. Hopefully OpenMP will improve over time, but as it is now, I would not recommend using it for anything else than prototyping.
Given the nature of your task, What you want to do is most likely a data based parallelism, which in my experience is quite easy - the programming style can be very similar to a single-core code, because you know what other threads are doing, which makes maintaining thread safety a lot easier - an approach which worked for me: avoid dependencies and call only thread safe functions from the loop.
To create a DYI OpenMP parallel loop you need to:
as a preparation create a serial for loop template and change your code to use functors to implement the loop bodies. This can be tedious, as you need to pass all references across the functor object
create a virtual JobItem interface for the functor, and inherit your functors from this interface
create a thread function which is able process individual JobItems objects
create a thread pool of the thread using this thread function
experiment with various synchronizations primitives to see which works best for you. While semaphore is very easy to use, its overhead is quite significant and if your loop body is very short, you do not want to pay this overhead for each loop iteration. What worked great for me was a combination of manual reset event + atomic (interlocked) counter as a much faster alternative.
experiment with various JobItem scheduling strategies. If you have long enough loop, it is better if each thread picks up multiple successive JobItems at a time. This reduces the synchronization overhead and at the same time it makes the threads more cache friendly. You may also want to do this in some dynamic way, reducing the length of the scheduled sequence as you are exhausting your tasks, or letting individual threads to steal items from other thread schedules.