Intel Tbb overhead issue - c++

Im using Intel TBB to parallel processing some parts of an algorithm processed on images. Although the processing for each pixel is data dependent, there are some cases which 2 consecutive pixels could be processed in parallel as below.
ProcessImage(image)
for each row in image // Create and wait root task for each line here using allocate_root()
ProcessRow(row)
for each 2 pixel
if(parallel())
ProcessPixel(A) and ProcessPixel(B) in parallel // For testing, create and process 2 tbb::empty_task() here as child tasks
else
ProcessPixel(A)
ProcessPixel(B)
However, the overhead occurs because this processing is very fast. For each input image (size of 512x512), the processing costs about 5-6 ms.
When I experimentally used Intel TBB as comment block above, the processing costs more than 25 ms.
So is there any better way using Intel TBB without overhead issue or other more efficient way to improve performance of simple and fast processing program like this ?

TBB does not add such a big (~20ms) overheads for invocation of a parallel algorithm. My guess (since there is no specifics provided) is that it is related to one of the following:
If you measure only the first invocation, it includes overheads for worker threads creation. And note, TBB does not have barriers like OpenMP, so one call to parallel_for might not be enough to create all the threads)
Same situation happens after worker threads go to sleep because of absence of the parallel work for them. The overheads for the wakeup are orders of magnitude lower than for the threads creation but still can affect measurements and impose wrong conclusions.
TBB scheduler can steal a task from outer level to the nested level (blocking call) thus the measurements will look like it takes too long for processing the nested part only while it is busy with an extra work there.
There is a contention for processing (A) and (B) in parallel caused by either explicit (e.g. mutex) or implicit (e.g. false sharing) reasons. But anyway, it is not TBB-specific.
Thus, the advice for performance measurements with TBB is to consider only the total time for long enough sequence of computations that will hide initialization overheads.
And of course as was advised, parallel first on the outer level. TBB provides enough different patterns for that including tbb::parallel_pipeline and tbb::flow::graph

Related

Threading: Most efficient way for many repeated parallel sweeps over a small array?

I'm optimizing a solver (systems of linear equations) whose most critical part consists of
Many (1000+) short (~10-1000 Microseconds) massively parallel (128 threads on 64 CPU cores) sweeps over small (CPU cache size) arrays, pseudocode:
for(i=0;i<num_iter;i++)
{
// SYNC-POINT
parallel_for(j=0;j<array_size;j++)
array_out[j] = some_function( array_in[j] )
swap( array_in, array_out );
}
Unfortunately, the standard parallelization constructs provided by OMP or TBB I tried so far
(serial outer loop plus parallel inner loop, e.g. via tbb::parallel_for) doesn't seem to handle this extremly fine grained parallelism very well, because the thread libraries' setup cost for the inner loop seems to dominates the time spent within the relatively short inner loop. (Note that very fine grained inner loops are crucial for the total performance of the algorithm because this way all data can be kept in L2/L3 CPU cache))
EDIT to address answers,questions & comments so far:
The example is just pseudo code to illustrate the idea. The actual implementation takes care about false sharing by padding ARRAY lines with CPU cache-line.
some_func(array_in, j) is a simple stencil that accesses the current point j and a small neighborhood around it, e.g. sume_func( array, j ) = array[j-1]+array[j]+array[j+1];
The answer given by Jérôme Richard touches a very intersting point
about barriers ( here is IMO the root of the problem). It is suggested to "replace barriers by local point-to-point neighbor synchronizations. Using task-based parallel runtimes can help to do that easily. Weaker synchronization patterns are the key". Interesting but very general. How exactly would this be accomplished in this case ?
Does "point-to-point-neighbor synchronization" involve an atomic primitive for every entry of the array ?
The general solution to this problem is to create the threads and distribute the work only once, and then use fast synchronization point in the threads. In this case, the outer loop is moved in the threaded function. This is possible with the TBB library by providing a range (tbb::blocked_range<size_t> ) and a function to tbb::parallel_for (see an example here).
Barriers are known to scale poorly on many core architectures, especially in such codes. The only way to make the program scale is to reduce the synchronization between threads so to make it more local. For example, for stencils, the solution is to replace barriers by local point-to-point neighbor synchronizations. Using task-based parallel runtimes can help to do that easily. Weaker synchronization patterns are the key to solve such problem. In fact, note the fundamental laws of physics prevent barriers to scale because clocks cannot be fully synchronized in general relativity and computers (unfortunately) obeys to physics law.
Many core systems are now nearly always NUMA ones. Regarding your configuration, you certainly use an AMD Threadripper processor which have multiple NUMA nodes. This means you should care about locality and the NUMA allocation policy. The default policy is generally the first touch. This means that is your initialization is sequential or threads are mapped differently, then cores have to fetch data from remote NUMA nodes which is slow. In the worst case, all cores can access to the same NUMA node and saturate it resulting in a possibly slower execution than the sequential version. You should generally make it parallel for better performance. Getting high-performance on such architecture is far from being easy. You need to carefully control the allocation policy (numactl can help for that), the initialization (parallel), the thread binding (with taskset, hwloc and/or environment variables) and the memory access pattern (by reading articles/books about how NUMA machines work and applying dedicated algorithms).
By the way, there is probably a false-sharing effect happening in your code because cache lines of array_out are certainly shared between thread. This should not have a critical impact but it does contribute to get a poor scalability (especially on your 64-core processor). The general solution to this problem is to align the array in memory on a cache line and take take the parallel splitting is done on a cache line boundary. Alternatively, you can allocate the array part in each thread. This is generally a better approach as is ensure data is locally allocated, locally filled and make data-sharing/communication between NUMA nodes and even cores more explicit (ie. better control), though it can make the code more complex (there is no free lunch).
Sharing data across threads is slow. This is because each CPU core has at least one layer of personal cache. The minute you share data between cores/threads, the personal caches need to be synchronised which is slow.
Threads running in parallel across different cores work best if they do not share data.
In your case, if data is read only you might be best off giving each thread its own copy of the data. For read write data, you have to accept the synchronisation slowdown.

What's the "real world" performance improvement for multithreading I can expect?

I'm programming a recursive tree search with multiple branches and works fine. To speed up I'm implementing a simple multithreading: I distribute the search into main branches and scatter them among the threads. Each thread doesn't have to interact with the others, and when a solve is found I add it to a common std::vector using a mutex this way:
if (CubeTest.IsSolved())
{ // Solve algorithm found
std::lock_guard<std::mutex> guard(SearchMutex); // Thread safe code
Solves.push_back(Alg); // Add the solve
}
I don't allocate variables in dynamic store (heap) with new and delete, since the memory needs are small.
The maximum number of threads I use is the quantity I get from: std::thread::hardware_concurrency()
I did some tests, always the same search but changing the amount or threads used, and I found things that I don't expected.
I know that if you double the amount of threads (if the processor has enougth capacity) you can't expect to double the performance, because of context switching and things like that.
For example, I have an old Intel Xeon X5650 with 6 cores / 12 threads. If I execute my code, until the sixth thread things are as expected, but if I use an additional thread the performace is worst. Using more threads increase the performace very little, to the point that the use of all avaliable threads (12) barely compensates for the use of only 6:
Threads vs processing time chart for Xeon X5650:
(I repeat the test several times and I show the average times of all the tests).
I repeat the tests in other computer with an Intel i7-4600U (2 cores / 4 threads) and I found this:
Threads vs processing time chart for i7-4600U:
I understand that with less cores the performance gain using more threads is worst.
I think also that when you start to use the second thread in the same core the performance is penalized in some way. Am I right? How can I improve the performance in this situation?
So my question is if this performance gains for multithreading is what I can expect in the real world, or on the other hand, this numbers are telling me that I'm doing things wrong and I should learn more about mutithreading programming.
What's the “real world” performance improvement for multithreading I can expect?
It depends on many factors. In general, the most optimistic improvement that one can hope for is reduction of runtime by factor of number of cores1. In most cases this is unachievable because of the need for threads to synchronise with one another.
In worst case, not only is there no improvement due to lack of parallelism, but also the overhead of synchronisation as well as cache contention can make the runtime much worse than the single threaded program.
Peak memory use often increases linearly by number of threads because each thread needs to operate on data of their own.
Total CPU time usage, and therefore energy use also increases due to extra time spent on synchronisation. This is relevant to systems that operate on battery power as well as those that have poor heat management (both apply to phones and laptops).
Binary size would be marginally larger due to extra code that deals with threads.
1 Whether you get all of the performance out of "logical" cores i.e. "hyper threading" or "clustered multi threading" also depends on many factors. Often, one executes the same function in all threads, in which case they tend to use the same parts of the CPU, in which case sharing the core with multiple threads doesn't necessarily yield benefit.
A CPU which uses hyperthreading claims to be able to execute two threads simultaneously on one core. But actually it doesn't. It just pretends to be able to do that. Internally it performs preemptive multitasking: Execute a bit of thread A, then switch to thread B, execute a bit of B, back to A and so on.
So what's the point of hyperthreading at all?
The thread switches inside the CPU are faster than thread switches managed by the thread scheduler of the operating system. So the performance gains are mostly through avoiding overhead of thread switches. But it does not allow the CPU core to perform more operations than it did before.
Conclusion: The performance gain you can expect from concurrency depend on the number of physical cores of the CPU, not logical cores.
Also keep in mind that thread synchronization methods like mutexes can become pretty expensive. So the less locking you can get away with the better. When you have multiple threads filling the same result set, then it can sometimes be better to let each thread build their own result set and then merge those sets later when all threads are finished.

why does having more than one thread(parallel processing) in some specific cases degrade performance?

i noticed that having more than a thread running for some code is much much slower than having one thread, and i have been really pulling my hair to know why,can anyone help?
code explanation :
i have ,sometimes, a very large array that i need to process parts of in a parallel way for optimization,each "part" of a row gets looped on and processed on in a specific thread, now i've noticed that if i only have one "part",i.e the whole array and a single worker thread that runs through it is noticeably faster than if i divide the array and process it as separate sub arrays with different threads.
bool m_generate_row_worker(ull t_row_start,ull t_row_end)
{
for(;t_row_start<t_row_end;t_row_start++)
{
m_current_row[t_row_start]=m_singularity_checker(m_previous_row[t_row_start],m_shared_random_row[t_row_start]);
}
return true;
}
...
//code
...
for(unsigned short thread_indx=0;thread_indx<noThreads-1;thread_indx++)
{
m_threads_array[thread_indx]=std::thread(
m_generate_row_worker,this,
thread_indx*(m_parts_per_thread),(thread_indx+1)*(m_parts_per_thread));
}
m_threads_array[noThreads-1]=std::thread(m_generate_row_worker,this,
(noThreads-1)*(m_parts_per_thread),std::max((noThreads)*(m_parts_per_thread),m_blocks_per_row));
//join
for(unsigned short thread_indx=0;thread_indx<noThreads;thread_indx++)
{
m_threads_array[thread_indx].join();
}
//EDIT
inline ull m_singularity_checker(ull t_to_be_ckecked_with,ull
t_to_be_ckecked)
{
return (t_to_be_ckecked & (t_to_be_ckecked_with<<1)
& (t_to_be_ckecked_with>>1) ) | (t_to_be_ckecked_with &
t_to_be_ckecked);
}
why does having more than one thread(parallel processing) in some specific cases degrade performance?
Because thread creation has overhead. If the task to be performed has only small computational cost, then the cost of creating multiple threads is more than the time saved by parallelism. This is especially the case when creating significantly more threads than there are CPU cores.
Because many algorithms do not easily divide into independent sub-tasks. Dependencies on other threads requires synchronisation, which has overhead that can in some cases be more than the time saved by parallelism.
Because in poorly designed programs, synchronization can cause all tasks to be processed sequentially even if they are in separate threads.
Because (depending on CPU architecture) sometimes otherwise correctly implemented, and seemingly independent tasks have effectual dependency because they operate on the same area of memory. More specifically, when a threads writes into a piece of memory, all threads operating on the same cache line must synchronise (the CPU does this for you automatically) to remain consistent. The cost of cache misses is often much higher than the time saved by parallelism. This problem is called "false sharing".
Because sometimes introduction of multi threading makes the program more complex, which makes it more difficult for the compiler / optimiser to make use of instruction level parallelism.
...
In conclusion: Threads are not a silver bullet that automatically multiplies the performance of your program.
Regarding your program, we cannot count out any of the above potential issues given the excerpt that you have shown.
Some tips on avoiding or finding above issues:
Don't create more threads than you have cores, discounting the number of threads that are expected to be blocking (waiting for input, disk, etc).
Only use multi-threading with problems that are computationally expensive, (or to do work while a thread is blocking, but this may be more efficiently solved using asynchronous I/O and coroutines).
Don't do (or do as little as possible) I/O from more than one thread into a single device (disk, NIC, virtual terminal, ...) unless it is specially designed to handle it.
Minimise the number of dependencies between threads. Consider all access to global things that may cause synchronisation, and avoid them. For example, avoid memory allocation. Keep in mind that things like operations on standard containers do memory allocation.
Keep the memory touched by distinct threads far from each other (not adjacent small elements of array). If processing an array, divide it in consecutive blocks, rather than striping one element every (number of threads)th element. In some extreme cases, extra copying into thread specific data structures, and then joining in the end may be efficient.
If you've done all you can, and multi threading measures slower, consider whether perhaps it is not a good solution for your problem.
Using threads do not always mean that you will get more work done. For example using 2 threads does not mean you will get a task done in half the time. There is an overhead to setting up the threads and depending on how many cores and OS etc... how much context switching is occurring between threads (saving the thread stack/regs and loading the next one - it all adds up). At some point adding more threads will start to slow your program down since there will be more time spent switching between threads/setting threads up/down then there is work being done. So you may be a victim of this.
If you have 100 very small items (like 1 instruction) of work to do, then 100 threads will be guaranteed to be slower since you now have ("many instructions" + 1) x 100 of work to do. Where the "many instructions" are the work of setting up the threads and clearing them up at the end - and switching between them.
So, you may want to start to profile this for yourself.. How much work is done processing each row and how many threads in total are you setting up?
One very crude, but quick/simple way to start to measure is to just take the time elapsed to processes one row in isolation (e.g. use std::chrono functions to measure the time at the start of processing one row and then take the time at the end to see total time spent. Then maybe do the same test over the entire table to get an idea how total time.
If you find that a individual row is taking very little time then you may not be getting so much benefit from the threads... You may be better of splitting the table into chunks of work that are equal to the number of cores your CPU has, then start changing the number of threads (+/-) to find the sweet spot. Just making threads based on number of rows is a poor choice - you really want to design it to max out each core (for example).
So if you had 4 cores, maybe start by splitting the work into 4 threads to start with. Then test it with 8 if its better try 16, if its worse try 12....etc...
Also you might get different results on different PCs...

How can I run tasks on the CPU and a GPU device concurrently?

I have this piece of code that is as profiled, optimised and cache-efficient as I am likely to get it with my level of knowledge. It runs on the CPU conceptually like this:
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < numberOfTasks; ++i)
{
result[i] = RunTask(i); // result is some array where I store the result of RunTask.
}
It just so happens that RunTask() is essentially a set of linear algebra operations that operate repeatedly on the same, very large dataset every time, so it's suitable to run on a GPU. So I would like to achieve the following:
Offload some of the tasks to the GPU
While the GPU is busy, process the rest of the tasks on the CPU
For the CPU-level operations, keep my super-duper RunTask() function without having to modify it to comply with restrict(amp). I could of course design a restrict(amp) compliant lambda for the GPU tasks.
Initially I thought of doing the following:
// assume we know exactly how much time the GPU/CPU needs per task, and this is the
// most time-efficient combination:
int numberOfTasks = 1000;
int ampTasks = 800;
// RunTasksAMP(start,end) sends a restrict(amp) kernel to the GPU, and stores the result in the
// returned array_view on the GPU
Concurrency::array_view<ResulType, 1> concurrencyResult = RunTasksAMP(0,ampTasks);
// perform the rest of the tasks on the CPU while we wait
#pragma omp parallel for schedule(dynamic)
for (int i = ampTasks; i < numberOfTasks; ++i)
{
result[i] = RunTask(i); // this is a thread-safe
}
// do something to wait for the parallel_for_each in RunTasksAMP to finish.
concurrencyResult.synchronize();
//... now load the concurrencyResult array into the first elements of "result"
But I doubt you could do something like this because
A call to parallel_for_each behaves as though it's synchronous
(http://msdn.microsoft.com/en-us/library/hh305254.aspx)
So is it possible to achieve 1-3 of my requests, or do I have to ditch number 3? Even so, how would I implement it?
See my answer to will array_view.synchronize_asynch wait for parallel_for_each completion? for an explanation as to why parallel_for_each can be though of as a queuing or scheduling operation rather than a synchronous one. This explains why your code should satisfy your requirements 1 & 2. It should also meet requirement 3, although you might want to consider having one function that are restrict(cpu, amp) as this will give you less code to maintain.
However you may want to consider some of the performance implications of your approach.
Firstly, the parallel_for_each only queues work, the data copies from the host and GPU memory use host resources (assuming your GPU is discrete and/or does not support direct copy). If your work on the host saturates all the resources required to keep the GPU working then you may actually slow up your GPU calculation.
Secondly, for many calculations that are data parallel and amenable to running on a GPU they are so much faster that the additional overhead of trying to run work on the CPU doesn't result in an overall speedup. Overhead includes item one (above) and the additional overhead of coordinating work on the host (scheduling threads, merging the results, etc.).
Finally your implementation above does not take into account any variability in the time taken to run tasks on the GPU and CPU. It assumes that 800 AMP tasks will take as long as 200 cpu tasks. This may be true on some hardware but not on others. If one set of tasks takes longer than expected then your application will block and wait for the slower set of tasks to complete. You can avoid this using a master/worker pattern to pull tasks from a queue until there are no more available tasks. This approach means that in the worst case your application will have to wait for the final task to complete, not a block of tasks. Using the master/worker approach also means that your application will run with equal efficiency regardless of the relative CPU/GPU performance.
My book discusses examples of scheduling work across multiple GPUs using a master/worker (n-body) and parallel queue (cartoonizer). You can download the source code from CodePlex. Note that it deliberately does not cover sharing work on both CPU and GPU for the reasons outlined above based on discussions with the C++ AMP product team.

Parallelization efficiency of openMP

I have a C++ code containig many for-loops parallelized with openMP on a 8-thread computer.
But the speed of execution with single thread is faster than parallel 8 thread. I was told that if the load of the for-loops increases parallelization will become efficient.
Here with load I mean for example maximum number of iterations for a loop. The thing is I dont have a chance to compare single and 8-thread parallel code for a huge amount of data.
Should I use parallel code anyway? Is it true that parallelization efficiency will increase with load of for-loops?
The canonical use case for OpenMP is the distribution among a team of threads of the iterations of a high iteration count loop with the condition that the loop iterations have no direct or indirect dependencies.
You can spot what I mean by direct dependencies by considering the question Does the order of loop iteration execution affect the results ?. If, for example, iteration N+1 uses the results of iteration N you have such a dependency, running the loop iterations in reverse order will change the output of the routine.
By indirect dependencies I mean mainly data races, in which threads have to coordinate their access to shared data, in particular they have to ensure that writes to shared variables happen in the correct sequence.
In many cases you can redesign a loop-with-dependencies to remove those dependencies.
IF you have a high iteration count loop which has no such dependencies THEN you have a candidate for good speed-up with OpenMP. Here are the buts:
There is some parallel overhead to the computation at the start and end of each such loop, if the loop count isn't high enough this overhead may outweigh, partially or wholly, the speedup of running the iterations in parallel. The only way to determine if this is affecting your code is to test and measure.
There can be dependencies between loop iterations more subtle than I have already outlined. Depending on your system architecture and the computations inside the loop you might (without realising it) program your threads to fight over access to cache or to I/O resources, or to any other resource. In the worst cases this can lead to increasing the number of threads leading to decreasing execution rate.
You have to make sure that each OpenMP thread is backed up by hardware, not by the pseudo-hardware that hyperthreading represents. One core per OpenMP thread, hyperthreading is snake oil in this domain.
I expect there are other buts to put in here, perhaps someone else will help out.
Now, turning to your questions:
Should I use parallel code anyway? Test and measure.
Is it true that parallelization efficiency will increase with load of for-loops? Approximately, but for your code on your hardware, test and measure.
Finally, you can't become a serious parallel computationalist without measuring run times under various combinations of circumstances and learning what the measurements you make are telling you. If you can't compare sequential and parallel execution for huge amounts of data, you'll have to measure them for modest amounts of data and understand the lessons you learn before making predictions about behaviour when dealing with huge amounts of data.