I cannot get speedup higher than 2 with in-place sorting algorithms (quick sort and balanced quick sort; QS/BQS) from the parallel implementation of libstdc++ (parallel mode). I have tried to run the code on many different systems consisting of from 16 to 24 cores. I have also tried GNU and Intel C++ compilers, even in different versions, always with same results. The speedup around 2 is the same for any number of cores between 2 and max.
On the contrary, multi-way merge sort (MWMS) scales well (speedup around 10 using 16 threads on 16 cores machine). According to J. Singler's presentation "The GNU libstdc++ parallel mode: Benefit from Multi-Core using the STL", their measured speedups for BQS are almost the same as for MWMS (see page 18, http://ls11-www.cs.uni-dortmund.de/people/gutweng/AD08/VO11_parallel_mode_overview.pdf); they observed speedup over 20 with BQS using 32 threads.
Any idea why this happens or what do I wrong?
I have seemingly solved the problem simply by calling:
omp_set_nested(1);
The documentation is little bit unclear about this requirement. Moreover, I would expect that the library is able to perform the call by itself. Hopefully, this will help also someone else.
Related
What is the expected theoretical speed-up of using parallelization in C++?
For example, say I have 2 cores, and 4 logical processors. If I use a fully optimized parallel program to execute some tasks for me using 4 threads working at maximum capacity, how much of a speed-up over the serial code can I expect? Twice as fast? Four times as fast?
Please provide a reference for your answer.
And please do not close this question as being too broad or not containing a code sample. Providing a code sample would defeat the purpose of the question, since I am in search of a general, theoretical answer that might be used in a sales pitch for parallel computing. I am NOT wondering about the particular efficiency of some particular piece of code.
There is no limit imposed by using <thread>. It creates OS threads so can scale linearly with how many cores you have.
Now for the question of real cores vs. logical processors (Hyperthreading, SMT) you might find https://superuser.com/a/279803/112292 interesting. There is also lots of other benchmarks out there.
SMT is generally good when it can hide memory latency. So the speedup of SMT you can gain is purely dependent on your application (is it compute heavy, is it memory heavy?) and the only way to find is benchmark.
There is no specific number.
More practically, there is nothing in std::thread that has to impede linear scaling. And that translates to the real world. Using dozens of CPU cores is trivial with STD: thread.
I have parallelized an already existing code for computer vision applications using OpenMP. I think that I well designed it because:
The workload is well-balanced
There is no synchronization/locking mechanism
I parallelized the outer most loops
All the cores are used for most of the time (there are no idle cores)
There is enough work for each thread
Now, the application doesn't scale when using many cores, e.g. it doesn't scale well after 15 cores.
The code uses external libraries (i.e. OpenCV and IPP) where the code is already optimized and vectorized, while I manually vectorized some portions of the code as best as I could. However, according to Intel Advisor, the code isn't well vectorized, but there is no much left to do: I already vectorized the code where I could and I can't improve the external libraries.
So my question is: is it possible that vectorization is the reason why the code doesn't scale well at some point? If so, why?
In line with comments from Adam Nevraumont, VTune Amplifier can do a lot to pinpoint memory bandwidth issues: https://software.intel.com/en-us/vtune-amplifier-help-memory-access-analysis.
It may be useful to start at a higher level of analysis than that though, like looking at hot spots. If it turns out that most of your time is spent in OpenCV or similar like you're concerned about, finding that out early might save some time vs. digging into memory bottlenecks directly.
I have a problem where I need to process a known number of threads in parallel (great), but for which each thread may have a vastly different number of internal iterations (not great). In my mind, this makes it better to do a kernel scheme like this:
__kernel something(whatever)
{
unsigned int glIDx = get_global_id(0);
for(condition_from_whatever)
{
}//alternatively, do while
}
where id(0) is known beforehand, rather than:
__kernel something(whatever)
{
unsigned int glIDx = get_global_id(0);
unsigned int glIDy = get_global_id(1); // max "unroll dimension"
if( glIDy_meets_condition)
do_something();
else
dont_do_anything();
}
which would necessarily execute for the FULL POSSIBLE RANGE of glIDy, with no way to terminate beforehand, as per this discussion:
Killing OpenCL Kernels
I can't seem to find any specific information about costs of dynamic-sized forloops / do-while statements within kernels, though I do see them everywhere in kernels in Nvidia's and AMD's SDK. I remember reading something about how the more aperiodic an intra-kernel condition branch is, the worse the performance.
ACTUAL QUESTION:
Is there a more efficient way to deal with this on a GPU architecture than the first scheme I proposed?
I'm also open to general information about this topic.
Thanks.
I don't think there's a general answer that can be given to that question. It really depends on your problem.
However here are some considerations about this topic:
for loop / if else statements may or may not have an impact to the performance of a kernel. The fact is the performance cost is not at the kernel level but at the work-group level. A work-group is composed of one or more warps (NVIDIA)/ wavefront (AMD). These warps (I'll keep the NVIDIA terminology but it's exactly the same for AMD) are executed in lock-step.
So if within a warp you have divergence because of an if else (or a for loop with different iterations number) the execution will be serialized. That is to say that the threads within this warp following the first path will do their jobs will the others will idle. Once their job is finished, these threads will idle while the others will start working.
Another problem arise with these statements if you need to synchronize your threads with a barrier. You'll have an undefined behavior if not all the threads hit the barrier.
Now, knowing that and depending on your specific problem, you might be able to group your threads in such a fashion that within the work-groups there is not divergence, though you'll have divergence between work-groups (no impact there).
Knowing also that a warp is composed of 32 threads and a wavefront of 64 (maybe not on old AMD GPUs - not sure) you could make the size of your well organized work-groups equal or a multiple of these numbers. Note that it is quite simplified because some other problems should be taken into consideration. See for instance this question and the answer given by Chanakya.sun (maybe more digging on that topic would be nice).
In the case your problem could not be organized as just described, I'd suggest to consider using OpenCL on CPUs which are quite good dealing with branching. If I well recall, typically you'll have one work-item per work-group. In that case, better to check the documentation from Intel and AMD for CPU. I also like very much the chapter 6 of Heterogeneous Computing with OpenCL which explains the differences between using OCL with GPUs and CPUs when programming.
I like this article too. It's mainly a discussion about increasing performance for a simple reduction on GPU (not your problem), but the last part of the article examines also performance on CPUs.
Last thing, regarding your comments on the answer provided by #Oak about the "intra-device thread queuing support" which is actually called dynamic parallelism. This feature would obviously solve your problem but even using CUDA you'd need a device with capability 3.5 or higher. So even NVIDIA GPUs with Kepler GK104 architecture don't support it (capability 3.0). For OCL, the dynamic parallelism is part of the standard version 2.0. (as far as I know there is no implementation yet).
I like the 2nd version more, since for inserts a false dependency between iterations. If the inner iterations are independent, send each to a different work item and let the OpenCL implementation sort out how best to run them.
Two caveats:
If the average number of iterations is significantly lower than the max number of iterations, this might not be worth the extra dummy work items.
You will have a lot more work items and you still need to calculate the condition for each... if calculating the condition is complicated this might not be a good idea.
Alternatively, you can flatten the indices into the x dimension, group all the iterations into the same work-group, then calculate the condition just once per workgroup and use local memory + barriers to sync it.
I'm just starting to learn C++ AMP and I've obtained a few examples that I've built with the VS 2012 RC, but I'm finding that the performance of the GPU is slower than the CPU. For instance, the examples by Kate Gregory: http://ampbook.codeplex.com/releases/view/90595 (relevant to her upcoming book http://www.gregcons.com/cppamp/). They were demonstrated by her in a lecture I watched where she obtained a ~5x performance improvement for the chapter 4 example by using her laptop's GPU (I believe she said it was a 6650) compared to CPU (not sure what CPU she had). I've tried testing the example myself and on a couple of system configurations (as below) I've always found the CPU to be faster. I've also tested other examples and found the same. Am I doing something wrong? Is there a reason for the slower than expected performance? Does anyone have an example that would definitely show the GPU being faster?
System 1: Intel i7 2600K with onboard graphics (I expect this to be
slower)
System 2: Intel i7 2630QM with Intel HD switchable with AMD
6770 (I have it running in performance mode so it should be using the
6770)
System 3: Intel i5 750 with 2xCrossfire AMD HD 5850
Example of results: chapter4 project results in 1.15ms CPU, 2.57ms GPU, 2.55ms GPU tiled.
Edit:
Doh, I think I just found the reason why - the values for the size of the matrices she used in the lecture were different. The sample on the website uses M=N=W=64. If I use 64, 512 and 256 as she did in the lecture then I get the corresponding ~5x increase in performance.
It seems like your overarching question is WHY moving things to the GPU doesn't always get you a benefit. The answer is copy time. Imagine a calculation that takes a time proprotional to n squared. Copying takes a time proportional to n. You might need quite a large n before spending the time to copy to and from the GPU is outweighed by the time saved doing the calculation there.
The book mentions this briefly in the early chapters, and Chapters 7 and 8 are all about performance and optimization. Chapter 7 is on Rough Cuts now; Chapter 8 should be there shortly. (Its code is already on Codeplex - the Reduction case study.)
I've just checked in an update to the Chapter 4 code that uses the Tech Ed starting numbers instead of the ones that were there before. Smaller matrices lose too much time to the copy to/from the GPU; larger ones take too long to be a good demo. But do feel free to play around with the sizes. Make them even larger since you don't mind a minute or two of "dead air", and see what happens.
I'm developing a C/C++ application to manipulate large quantities of data in a generic way (aggregation/selection/transformation).
I'm using a AMD Phenom II X4 965 Black Edition, so with decent amount of different caches.
I've developed both ST and MT version of the functions to perform all the single operations and, not surprisingly, in the best case the MT version are 2x faster than the ST, even when using 4 cores.
Given I'm a fan of using 100% of available resources, I was pissed about the fact just 2x, I'd want 4x.
For this reason I've spent already quite a considerable amount of time with -pg and valgrind, using the cache simulator and callgraph. The program is working as expected and cores are sharing the input process data (i.e. operations to apply on data) and the cache misses are reported (as expected sic.) when the different threads load the data to be processed (millions of entities or rows if now you have an idea what I'm trying to do :-) ).
Eventually I've used different compilers, g++ and clang++, with -O3 both, and performance is identical.
My conclusion is that due to the large amount of data (GB of data) to process, given the fact the data has got to be loaded eventually in the CPU, this is real wait time.
Can I further improve my software? Have I hit a limit?
I'm using C/C++ on Linux x86-64, Ubuntu 11.10.
I'm all ears! :-)
What kind of application is it? Could you show us some code?
As I commented, you might have reached some hardware limit like RAM bandwidth. If you did, no software trick could improve it.
You might investigate using MPI, OpenMP, or OpenCL (on GPUs) but without an idea of your application we cannot help.
If compiling with GCC and if you want to help the processor cache prefetching, consider using with care and parsimony __builtin_prefetch (but using it too much or badly would decrease performance).