What is the expected speed-up of using parallelization in C++ (not OpenMp, but <thread>) - c++

What is the expected theoretical speed-up of using parallelization in C++?
For example, say I have 2 cores, and 4 logical processors. If I use a fully optimized parallel program to execute some tasks for me using 4 threads working at maximum capacity, how much of a speed-up over the serial code can I expect? Twice as fast? Four times as fast?
Please provide a reference for your answer.
And please do not close this question as being too broad or not containing a code sample. Providing a code sample would defeat the purpose of the question, since I am in search of a general, theoretical answer that might be used in a sales pitch for parallel computing. I am NOT wondering about the particular efficiency of some particular piece of code.

There is no limit imposed by using <thread>. It creates OS threads so can scale linearly with how many cores you have.
Now for the question of real cores vs. logical processors (Hyperthreading, SMT) you might find https://superuser.com/a/279803/112292 interesting. There is also lots of other benchmarks out there.
SMT is generally good when it can hide memory latency. So the speedup of SMT you can gain is purely dependent on your application (is it compute heavy, is it memory heavy?) and the only way to find is benchmark.

There is no specific number.
More practically, there is nothing in std::thread that has to impede linear scaling. And that translates to the real world. Using dozens of CPU cores is trivial with STD: thread.

Related

Can bad vectorized code impact on scalability?

I have parallelized an already existing code for computer vision applications using OpenMP. I think that I well designed it because:
The workload is well-balanced
There is no synchronization/locking mechanism
I parallelized the outer most loops
All the cores are used for most of the time (there are no idle cores)
There is enough work for each thread
Now, the application doesn't scale when using many cores, e.g. it doesn't scale well after 15 cores.
The code uses external libraries (i.e. OpenCV and IPP) where the code is already optimized and vectorized, while I manually vectorized some portions of the code as best as I could. However, according to Intel Advisor, the code isn't well vectorized, but there is no much left to do: I already vectorized the code where I could and I can't improve the external libraries.
So my question is: is it possible that vectorization is the reason why the code doesn't scale well at some point? If so, why?
In line with comments from Adam Nevraumont, VTune Amplifier can do a lot to pinpoint memory bandwidth issues: https://software.intel.com/en-us/vtune-amplifier-help-memory-access-analysis.
It may be useful to start at a higher level of analysis than that though, like looking at hot spots. If it turns out that most of your time is spent in OpenCV or similar like you're concerned about, finding that out early might save some time vs. digging into memory bottlenecks directly.

Will multithreading improve performance significantly if I have a fixed amount of calculations that are independet from each other?

I am programming a raycasting game engine.
Each ray can be calculated without knowing anything about the other rays (I'm only calculating distances).
Since there is no waiting time between calculations, I wonder whether it's worth the effort to make the ray calculations multithreaded or not.
Is it likely that there will be a performance boost?
Mostly likely multi-threading will improve performance if done correctly. The way you've stated your problem, it is a perfect candidate for multi-threading since the computations are independent, reducing the need for coordination between threads to a minimum.
Some reasons you still might not get a speed up, or may not get the full speed up you expect could include:
1) The bottleneck may not be on-die CPU execution resources (e.g., ALU-bound operations), but rather something shared like memory or shared LLC bandwidth.
For example, on some architectures, a single thread may be able to saturate memory bandwidth, so adding more cores may not help. A more common case is that a single core can saturate some fraction, 1/N < 1 of main memory bandwidth, and this value is larger than 1/C where C is the core count. For instance, on a 4 core box, one core may be able to consume 50% of the bandwidth. Then, for a memory-bound computation, you'll get good scaling to 2 cores (using 100% of bandwidth), but little to none above that.
Other resources which are shared among cores include disk and network IO, GPU, snoop bandwidth, etc. If you have a hyper-threaded platform, this list increases to include all levels of cache and ALU resources for logical cores sharing the same physical core.
2) Contention "in practice" between operations which are "theoretically" independent.
You mention that your operations are independent. Typically this means that they are logically independent - they don't share any data (other than perhaps immutable input) and they can write to separate output areas. That doesn't exclude the possibility, however, than any given implementation actually has some hidden sharing going on.
One classic example is false-sharing - where independent variables fall in the same cache line, so logically independent writes to different variables from different threads end up thrashing the cache line between cores.
Another example, frequently encountered in practice, is contention via library - if your routines use malloc heavily, you may find that all the threads spend most of their time waiting on a lock inside the allocator as malloc is shared resource. This can be remedied by reducing reliance on malloc (perhaps via fewer, larger mallocs) or with a good concurrent malloc such as hoard or tcmalloc.
3) Implementation of the distribution and collection of the computation across threads may overwhelm the advantage you get from multiple threads. For example, if you spin up a new thread for every individual ray, the thread creation overhead would dominate your runtime and you would likely see a negative benefit. Even if you use a thread-pool of persistent threads, choosing a "work unit" that is too fine grained will impose a lot of coordination overhead which may eliminate your benefits.
Similarly, if you have to copy the input data to and from the worker threads, you may not see the scaling you expect. Where possible, use pass-by-reference for read-only data.
4) You don't have more than 1 core, or you do have more than 1 core but they are already occupied running other threads or processes. In these cases, the effort to coordinate multiple threads is pure overhead.
In general, it depends. Given that the calculations are independent, it sounds like this is a good candidate for potential performance improvements due to threading. Ray calculations typically can benefit from this.
However, there are many other factors, such as memory access requirements, as well as the underlying system on which this runs, which will have a tremendous impact on this. It's often possible to have multithreaded versions run slower than single threaded versions if not written correctly, so profiling is the only way to answer this definitively.
Probably yes, multithreading (e.g. with pthreads) could improve performance; but you surely want to benchmark (and you might be disappointed if your program is memory bound, not CPU bound). And you could also consider OpenCL (to run some regular numeric computations on the GPGPU) and OpenMP (to explicitly ask the compiler, thru pragmas, to parallelize some of your code).
Maybe Open-MPI might be considered to run on several communicating processes. And if you are brave (or crazy) you could mix several approaches.
In reality, it depends upon the algorithm and the system (both hardware and operating system), and you should benchmark (e.g. some micro-prototype related to your needs).
If on some particular system the bottleneck is the memory bandwidth (not the CPU), multi-threading or multi-processing won't help much (and probably could degrade performance).
Also, the cost of synchronization may vary widely (e.g. locking a mutex can be very fast on some systems and 50x slower on others).
Very likely. Independent calculations are a perfect candidate for parallelization. In the case of raycasting, there is so many of them that they would spread nicely across as many parallel threads as the hardware permits.
An unexpected bottleneck for calculations that would otherwise have perfect data-independence can be concurrent writes to nearby locations (false sharing of cache lines).

OpenCL - Vectorization vs In-thread for loop

I have a problem where I need to process a known number of threads in parallel (great), but for which each thread may have a vastly different number of internal iterations (not great). In my mind, this makes it better to do a kernel scheme like this:
__kernel something(whatever)
{
unsigned int glIDx = get_global_id(0);
for(condition_from_whatever)
{
}//alternatively, do while
}
where id(0) is known beforehand, rather than:
__kernel something(whatever)
{
unsigned int glIDx = get_global_id(0);
unsigned int glIDy = get_global_id(1); // max "unroll dimension"
if( glIDy_meets_condition)
do_something();
else
dont_do_anything();
}
which would necessarily execute for the FULL POSSIBLE RANGE of glIDy, with no way to terminate beforehand, as per this discussion:
Killing OpenCL Kernels
I can't seem to find any specific information about costs of dynamic-sized forloops / do-while statements within kernels, though I do see them everywhere in kernels in Nvidia's and AMD's SDK. I remember reading something about how the more aperiodic an intra-kernel condition branch is, the worse the performance.
ACTUAL QUESTION:
Is there a more efficient way to deal with this on a GPU architecture than the first scheme I proposed?
I'm also open to general information about this topic.
Thanks.
I don't think there's a general answer that can be given to that question. It really depends on your problem.
However here are some considerations about this topic:
for loop / if else statements may or may not have an impact to the performance of a kernel. The fact is the performance cost is not at the kernel level but at the work-group level. A work-group is composed of one or more warps (NVIDIA)/ wavefront (AMD). These warps (I'll keep the NVIDIA terminology but it's exactly the same for AMD) are executed in lock-step.
So if within a warp you have divergence because of an if else (or a for loop with different iterations number) the execution will be serialized. That is to say that the threads within this warp following the first path will do their jobs will the others will idle. Once their job is finished, these threads will idle while the others will start working.
Another problem arise with these statements if you need to synchronize your threads with a barrier. You'll have an undefined behavior if not all the threads hit the barrier.
Now, knowing that and depending on your specific problem, you might be able to group your threads in such a fashion that within the work-groups there is not divergence, though you'll have divergence between work-groups (no impact there).
Knowing also that a warp is composed of 32 threads and a wavefront of 64 (maybe not on old AMD GPUs - not sure) you could make the size of your well organized work-groups equal or a multiple of these numbers. Note that it is quite simplified because some other problems should be taken into consideration. See for instance this question and the answer given by Chanakya.sun (maybe more digging on that topic would be nice).
In the case your problem could not be organized as just described, I'd suggest to consider using OpenCL on CPUs which are quite good dealing with branching. If I well recall, typically you'll have one work-item per work-group. In that case, better to check the documentation from Intel and AMD for CPU. I also like very much the chapter 6 of Heterogeneous Computing with OpenCL which explains the differences between using OCL with GPUs and CPUs when programming.
I like this article too. It's mainly a discussion about increasing performance for a simple reduction on GPU (not your problem), but the last part of the article examines also performance on CPUs.
Last thing, regarding your comments on the answer provided by #Oak about the "intra-device thread queuing support" which is actually called dynamic parallelism. This feature would obviously solve your problem but even using CUDA you'd need a device with capability 3.5 or higher. So even NVIDIA GPUs with Kepler GK104 architecture don't support it (capability 3.0). For OCL, the dynamic parallelism is part of the standard version 2.0. (as far as I know there is no implementation yet).
I like the 2nd version more, since for inserts a false dependency between iterations. If the inner iterations are independent, send each to a different work item and let the OpenCL implementation sort out how best to run them.
Two caveats:
If the average number of iterations is significantly lower than the max number of iterations, this might not be worth the extra dummy work items.
You will have a lot more work items and you still need to calculate the condition for each... if calculating the condition is complicated this might not be a good idea.
Alternatively, you can flatten the indices into the x dimension, group all the iterations into the same work-group, then calculate the condition just once per workgroup and use local memory + barriers to sync it.

Multithreading efficiency in C++

I am trying to learn threading in C++, and just had a few questions about it (more specifically <thread>.
Let's say the machine this code will run on has 4 cores, should I split up an operation into 4 threads? If I were to create 8 threads instead of 4, would this run slower on a 4 core machine? What if the processor has hyperthreading, should I try and make the threads match the number of physical cores or logical cores?
Should I just not worry about the number of cores a machine has, and try to create as many threads as possible?
I apologize if these questions have been already answered; I've been looking for information about threading with <thread>, which was introduced in c11 so I haven't been able to find too much about it.
The program in question is going to run many independent simulations.
If anybody has any insight about <thread> or just multithreading in general, I would be glad to hear it.
If you are performing pure calculations with no I/O - and those calculations are freestanding and not relying on results from other calculations happening in another thread, the maximum number of such threads should be the number of cores (possibly one or two less if the system is also loaded with other tasks).
If you are doing network I/O or similar, more threads are certainly a possibility.
If you are doing disk-I/O, a single thread reading from the disk is often best, because disk reads from multiple threads leads to moving the read/write head around on the disk, which just makes things slower.
If you're using threads for to make the code simpler, then the number of threads will probably depend on what you are doing.
It also depends on how "freestanding" each thread is. If they need to share data in complex ways, the sharing/waiting for other thread/etc, may well make it slower with more threads.
And as others have said, try to make your framework for this flexible and test different options. Preferably on multiple machines (unless you only have one kind of machine that you will ever run your code on).
There is no such thing as <threads.h>, you mean <thread>, the thread support library introduced in C++11.
The only answer to your question is "test and see". You can make your code flexible enough, so that it can be run by passing an N parameter (where N is the desired number of threads).
If you are CPU-bound, the answer will be very different from the case when you are IO bound.
So, test and see! For your reference, this link can be helpful. And if you are serious, then go ahead and get this book. Multithreading, concurrency, and the like are hairy topics.
Let's say the machine this code will run on has 4 cores, should I split up an operation into 4 threads?
If some portions of your code can be run in parallel, then yes it can be made to go faster, but this is very tricky to do since loading threads and switching data between them takes a ton of time.
If I were to create 8 threads instead of 4, would this run slower on a 4 core machine?
It depends on the context switching it has to do. Sometimes the execution will switch between threads very often and sometimes it will not but this is very difficult to control. It will not in any case run faster than 4 threads doing the same work.
What if the processor has hyperthreading, should I try and make the threads match the number of physical cores or logical cores?
Hyperthreading works nearly the same as having more cores. When you will notice the differences between a real core and an execution core, you will have enough knowledge to work around the caveats.
Should I just not worry about the number of cores a machine has, and try to create as many threads as possible?
NO, threads are hard to manage, avoid them as much as you can.
The program in question is going to run many independent simulations.
You should look into openmp. It is a library in C made to parallelize computation when your program can be split up. Do not confuse parallel with concurrent. Concurrent is simply multiple threads working together while parallel is made specifically to speed up your application. Maybe openmp is overkill for your thing, but it is a good thing to know when you are approaching parallel computing
Don't think of the number of threads you need as in comparison to the machine you're running on. Threading is valuablue any time you have a process that:
A: There is some very slow operation, that the rest of the process need not wait for.
B: Certain functions can run faster than one another and don't need to be executed inline.
C: There is a lot of non-order dependant I/O going on(web servers).
These are just a few of the obvious examples when launching a thread makes sense. The number of threads you launch is therefore more dependant on the number of these scenarios that pop up in your code, than the architecture you expect to run on. In fact unless you're running a process that really really needs to be optimized, it is likely that you can only eek out a few percentage points of additional performance by benchmarking for your architecture in comparison to the number of threads that you launch, and in modern computers this number shouldn't vary much at all.
Let's take the I/O example, as it is the scenario that will see the most benefit. Let's assume that some program needs to interract with 200 users over the network. Network I/O is very very slow. Thousands of times slower than the CPU. If we were to handle each user in turn we would waste thousands of processor cycles just waiting for data to come from the first user. Could we not have been processing information from more than one user at a time? In this case since we have roughly 200 users, and the data that we're waiting for we know to be 1000s of times slower than what we can handle(assuming we have a minimal amount of processing to do on this data), we should launch as many threads as the operating system allows. A web server that takes advantage of threading can serve hundreds of more people per second than one that does not.
Now, let's consider a less I/O intensive example, where say we have several functions that execute in turn, but are independant of one another and some of them might run faster, say because there is disk I/O in one, and no disk I/O in another. In this case, our I/O is still fairly fast, but we will certainly waste processing time waiting for the disk to catch up. As such we can launch a few threads, just to take advantage of our processing power, and minimize wasted cycles. However, if we launch as many threads as the operating system allows we are likely to cuase memory management issues for branch predictors, etc... and launching too many threads in this case is actually sub optimal and can slow the program down. Note that in this, I never mentioned how many cores the machine has! NOt that optimizing for different architectures isn't valuable, but if you optimize for one architecture you are likely very close to optimal for most. Assuming, again, that you're dealing with all reasonably modern processors.
I think most people would say that large scale threading projects are better supported by languages other than c++ (go, scala,cuda). Task parallelism as opposed to data parallelism works better in c++. I would say that you should create as many threads as you have tasks to dole out but if data parallelism is more related to your problem consider maybe using cuda and linking to the rest of your project at a later time
NOTE: if you look at some sort of system monitor you will notice that there are likely far more than 8 threads running, I looked at my computer and it had hundreds of threads running at once so don't worry too much about the overhead. The main reason I choose to mention the other languages is that managing many threads in c++ or c tends to be very difficult and error prone, I did not mention it because the c++ program will run slower(which unless you use cuda it probably won't)
In regards to hyper-threading let me comment on what I have found from experience.
In large dense matrix multiplication hyper-threading actually gives worse performance. For example Eigen and MKL both use OpenMP (at least the way I have used them) and get better results on my system which has four cores and hyper-threading using only four threads instead of eight. Also, in my own GEMM code which gets better performance than Eigen I also get better results using four threads instead of eight.
However, in my Mandelbrot drawing code I get a big performance increase using hyper-threading with OpenMP (eight threads instead of four). The general trend (so far) seems to be that if the code works well using schedule(static) in OpenMP then hyper-threading does not help and may even be worse. If the code works better using schedule(dynamic) then hyper-threading may help.
In other words, my observation so far is that if the run time of each thread can vary a lot hyper-threading can help. If the run time of each thread is constant then it may even make performance worse. But YOU have to test and see for each case.

OpenMP and cores/threads

My CPU is a Core i3 330M with 2 cores and 4 threads. When I execute the command cat /proc/cpuinfo in my terminal, it is like I have 4 CPUS. When I use the OpenMP function get_omp_num_procs() I also get 4.
Now I have a standard C++ vector class, I mean a fixed-size double array class that does not use expression templates. I have carefully parallelized all the methods of my class and I get the "expected" speedup.
The question is: can I guess the expected speedup in such a simple case? For instance, if I add two vectors without parallelized for-loops I get some time (using the shell time command). Now if I use OpenMP, should I get a time divided by 2 or 4, according to the number of cores/threads? I emphasize that I am only asking for this particular simple problem, where there is no interdependence in the data and everything is linear (vector addition).
Here is some code:
Vector Vector::operator+(const Vector& rhs) const
{
assert(m_size == rhs.m_size);
Vector result(m_size);
#pragma omp parallel for schedule(static)
for (unsigned int i = 0; i < m_size; i++)
result.m_data[i] = m_data[i]+rhs.m_data[i];
return result;
}
I have already read this post: OpenMP thread mapping to physical cores.
I hope that somebody will tell me more about how OpenMP get the work done in this simple case. I should say that I am a beginner in parallel computing.
Thanks!
EDIT : Now that some code has been added.
In that particular example, there is very little computation and lots of memory access. So the performance will depend heavily on:
The size of the vector.
How you are timing it. (do you have an outer-loop for timing purposes)
Whether the data is already in cache.
For larger vector sizes, you will likely find that the performance is limited by your memory bandwidth. In which case, parallelism is not going to help much. For smaller sizes, the overhead of threading will dominate. If you're getting the "expected" speedup, you're probably somewhere in between where the result is optimal.
I refuse to give hard numbers because in general, "guessing" performance, especially in multi-threaded applications is a lost cause unless you have prior testing knowledge or intimate knowledge of both the program and the system that it's running on.
Just as a simple example taken from my answer here: How to get 100% CPU usage from a C program
On a Core i7 920 # 3.5 GHz (4 cores, 8 threads):
If I run with 4 threads, the result is:
This machine calculated all 78498 prime numbers under 1000000 in 39.3498 seconds
If I run with 4 threads and explicitly (using Task Manager) pin the threads on 4 distinct physical cores, the result is:
This machine calculated all 78498 prime numbers under 1000000 in 30.4429 seconds
So this shows how unpredictable it is for even a very simple and embarrassingly parallel application. Applications involving heavy memory usage and synchronization get a lot uglier...
To add to Mysticals answer. Your problem is purely memory bandwidth bounded. Have a look at the STREAM benchmark. Run it on your computer in single and multi-threaded cases, and look at the Triad results - this is your case (well, almost, since your output vector is at the same time one of your input vectors). Calculate how much data you move around and You will know exactly what performance to expect.
Does multi-threading work for this problem? Yes. It is rare that a single CPU core can saturate the entire memory bandwidth of the system. Modern computers balance the available memory bandwidth with the number of cores available. From my experience you will need around half of the cores to saturate the memory bandwidth with a simple memcopy operation. It might take a few more if you do some calculations on the way.
Note that on NUMA systems you will need to bind the threads to cpu cores and use local memory allocation to get optimal results. This is because on such systems every CPU has its own local memory, to which the access is the fastest. You can still access the entire system memory like on usual SMPs, but this incurs communication cost - CPUs have to explicitly exchange data. Binding threads to CPUs and using local allocation is extremely important. Failing to do this kills the scalability. Check libnuma if you want to do this on Linux.