Related
What is the expected theoretical speed-up of using parallelization in C++?
For example, say I have 2 cores, and 4 logical processors. If I use a fully optimized parallel program to execute some tasks for me using 4 threads working at maximum capacity, how much of a speed-up over the serial code can I expect? Twice as fast? Four times as fast?
Please provide a reference for your answer.
And please do not close this question as being too broad or not containing a code sample. Providing a code sample would defeat the purpose of the question, since I am in search of a general, theoretical answer that might be used in a sales pitch for parallel computing. I am NOT wondering about the particular efficiency of some particular piece of code.
There is no limit imposed by using <thread>. It creates OS threads so can scale linearly with how many cores you have.
Now for the question of real cores vs. logical processors (Hyperthreading, SMT) you might find https://superuser.com/a/279803/112292 interesting. There is also lots of other benchmarks out there.
SMT is generally good when it can hide memory latency. So the speedup of SMT you can gain is purely dependent on your application (is it compute heavy, is it memory heavy?) and the only way to find is benchmark.
There is no specific number.
More practically, there is nothing in std::thread that has to impede linear scaling. And that translates to the real world. Using dozens of CPU cores is trivial with STD: thread.
I parallelized some C++ code with OpenMP.
But what if my program will work on a 1 core machine?
Do I need disable usage threading at runtime:
Checks cores
If cores > 1 use OpenMP
Else ignore OpenMP devectives
If yes, does OpenMP have a special directive for it?
No, you don't need to disable OpenMP or threading for running on one core; and for situations where you might want to, you're probably better off explicitly re-compiling without OpenMP, although for complex parallelizations there are other measures, mentioned in the comments, that you can take as well.
When running on a single core or even hardware thread, even if you change nothing - not even the number of threads your code launches - correct, deadlock-free threading code should still run correctly, as the operating system schedules the various threads on the core.
Now, that context switching between threads is costly overhead. Typical OpenMP code, which is compute-bound and relies on work sharing constructs to assign work between threads, treats the number of threads as a parameter and launches as many threads as you have cores or hardware threads available. For such code, where you are just using constructs like
#pragma omp parallel for
for (i=0; i<N; i++)
data[i] = expensive_function(i)
then running on one core will likely only use one thread, or you can explicitly set the number of threads to be one using the OMP_NUM_THREADS environment variable. If OpenMP is to use only one thread and the computation is time-consuming enough, the overhead from the threading library in the above loop is negligible. In this case, there's really no disabling of OpenMP necessary; you're just running on one thread. You can also set the number of threads within the program using omp_set_num_threads(), but best practice is normally to do this at runtime.
However, there's a downside. Compiling with OpenMP disables certain optimizations. For instance, because the work decomposition is done at runtime, even loops with compiled-in trip count limits may not be able to, say, be unrolled or vectorized as effectively because it isn't known how many trips through the loop each thread will take. In that case, if you know that your code will be run on a single core, it may be worth doing the compilation without OpenMP enabled as well, and using that binary for single-core runs. You can also use this approach to test to see if the difference in optimizations matters, running the OpenMP-enabled version with OMP_NUM_THREADS=1 and comparing the timing to that of serial binary.
Of course, if your threading is more complex than using simple work sharing constructs, then it starts being harder to make generalizations. If you have assumptions built into your code about how many threads are present - maybe you have an explicit producer/consumer model for processing work, or a data decomposition hardcoded in, either of which are doable in OpenMP - then it's harder to see how things work. You may also have parallel regions which are much less work than a big computational loop; in those cases, where overhead even with one thread might be significant, it might be best to use if clauses to provide explicit serial paths, e.g.:
nThreadMax = imp_get_max_threads();
#pragma omp parallel if (nThreadMax > 1)
if (omp_in_parallel()) {
// Parallel code path
} else {
// Serial code path
}
But now doing compilation without OpenMP becomes more complicated.
Summarizing:
For big heavy computation work, which is what OpenMP is typically used for, it probably doesn't matter; use OMP_NUM_THREADS=1
You can test if it does matter, with overhead and disabled optimizations, by compiling without OpenMP and comparing the serial runtime to the one-thread OpenMP runtime
For more complicated threading cases, it's hard to say much in general; it depends.
I believe there is function called:
omp_get_num_procs()
that will let you know how many processors are available for OpenMP to work on. Then there are many ways to disable OpenMP. From your code you can run:
omp_set_num_threads(1)
Just remember that even on single core you can get some boost with OpenMP. It only depends on the specificity of your case.
For using all the cores of a quad core processor what do I need to change in my code is it about adding support of multi threading or is it which is taken care by OS itself. I am having FreeBSD and language I am using is C++. I want to give complete CPU cycles to my application at least 90%.
You need some form of parallelism. Multi-threading or multi-processing would be fine.
Usually, multiple threads are easier to handle (since they can access shared data) than multiple processes. However, usually, multiple threads are harder to handle (since they access shared data) than multiple processes.
And, yes, I wrote this deliberately.
If you have a SIMD scenario, Ninefingers' suggestion to look at OpenMP is also very good. (If you don't know what SIMD means, see Ninefingers' helpful comment below.)
For multi-threaded applications in C++ may I suggest Boost.Thread which should help you access the full potential of your quad-core machine.
As for changing your code, you might want to consider making things as immutable as possible. State transitions between threads are much more difficult to debug. There a plethora of things that could potentially happen in unexpected ways. See this SO thread.
Another option not mentioned here, threading aside, is the use of OpenMP available via the -fopenmp and the libgomp library, both of which I have installed on my FreeBSD 8 system.
These give you #pragma directives to parallelise certain loops, while statements etc i.e. the bits you can parallelise. It takes care of threading and cpu association for you. Note it is a general solution and therefore might not be the optimum way to parallelise, but it will allow you to parallelise certain routines.
Take a look at this: https://computing.llnl.gov/tutorials/openMP/
As for using threads/processes themselves, certain routines and ways of working lend themselves to it. Can you break tasks out into such a way? Does it make sense to fork() your process or create a thread? If so, do so, but if not, don't try to force your application to be multi-threaded just because. An example I usually give is the greatest common divisor algorithm - it relies on the step before all the time in the traditional implementation therefore is difficult to make parallel.
Also note it is well known that for certain algorithms, parallelisation is actually slower for small values of whatever you are doing in parallel, because although the jobs complete more quickly, the associated time cost of forking and joining (be that threads or processes) actually pushes the time above that of a serial implementation.
I think your only option is to run several threads. If your application is single-threaded, then it will only run on one of the cores (at a time), but if you have more threads, they can run simultaneously.
You need to add support to your application for parallelism through the use of Threading.
Once you have support for parallelism, it's up to the OS to assign your threads to CPU cores.
The first thing I think you should look at is whether your application and its algorithms are suited to be executed in parellel (or possibly as a set of serial tasks that can be processed independently). If this is not the case, it will be difficult to multithread it or break it up into parallel processes, and you may need to look into modifying the way it works.
Once you have established that you will be able to benefit from parallel processing you have the option to either use several processes or threads. The choice depends a lot on the nature of your application and how independent the parallel processes can be. It is easier to coordinate and share data between threads since they are in the same process, but also quite a bit more challenging to develop and debug.
Boost.Thread is a good library if you decide to go down the multi-threaded route.
I want to give complete CPU cycles to my application at least 90%.
Why? Your chip's not hot enough?
Seriously, it takes world experts dozens if not hundreds of hours to parallelize and load-balance an application so that it uses 90% of all four cores. Your CPU is already paid for and it costs the same whether you use it or not. (Actually, it costs slightly less to run, electrically speaking, if you don't use it.) How much is your time worth? How many hours are you willing to invest in order to make more effective use of a resource that may have cost you $300 and is probably sitting idle most of the time anyway?
It's possible to get speedups through parallelism, but it's expensive in human time. You need a good reason to justify it. (Learning how is a good enough reason.)
All the good books I know on parallel programming are for languages other than C++, and for good reason. If you want interesting stuff on parallelism check out Implicit Parallel Programmin in pH or Concurrent Programming in ML or the Fortress Project.
Does MSVC automatically optimize computation on dual core architecture?
void Func()
{
Computation1();
Computation2();
}
If given the 2 computation with no relations in a function, does the visual studio
compiler automatically optimize the computation and allocate them to different cores?
Don't quote me on it but I doubt it. The OpenMP pragmas are the closest thing to what you're trying to do here, but even then you have to tell the compiler to use OpenMP and delineate the tasks.
Barring linking to libraries which are inherently multi-threaded, if you want to use both cores you have to set up threads and divide the work you want done intelligently.
No. It is up to you to create threads (or fibers) and specify what code runs on each one. The function as defined will run sequentially. It may switch to another thread (thanks Drew) core during execution, but it will still be sequential. In order for two functions to run concurrently on two different cores, they must first be running in two separate threads.
As greyfade points out, the compiler is unable to detect whether it is possible. In fact, I suspect that this is in the class of NP-Complete problems. If I am wrong, I am sure one of the compiler gurus will let me know.
There's no reliable way for the compiler to detect that the two functions are completely independent and that they have no state. Therefore, there's no way for the compiler to know that it's safe to break them out into separate threads of execution. In fact, threads aren't even part of the C++ standard (until C++1x), and even when they will be, they won't be an intrinsic feature - you must use the feature explicitly to benefit from it.
If you want your two functions to run in independent threads, then create independent threads for them to execute in. Check out boost::thread (which is also available in the std::tr1 namespace if your compiler has it). It's easy to use and works perfectly for your use case.
No. Madness would ensue if compilers did such a thing behind your back; what if Computation2 depended on side effects of Computation1?
If you're using VC10, look into the Concurrency Runtime (ConcRT or "concert") and it's partner the Parallel Patterns Library (PPL)
Similar solutions include OpenMP (kind of old and busted IMO, but widely supported) and Intel's Threading Building Blocks (TBB).
The compiler can't tell if it's a good idea.
First, of course, the compiler must be able to prove that it would be a safe optimization: That the functions can safely be executed in parallel. In general, that's a NP-complete problem, but in many simple cases, the compiler can figure that out (it already does a lot of dependency analysis).
Some bigger problems are:
it might turn out to be slower. Creating threads is a fairly expensive operation. The cost of that may just outweigh the gain from parallelizing the code.
it has to work well regardless of the number of CPU cores. The compiler doesn't know how many cores will be available when you run the program. So it'd have to insert some kind of optional forking code. If a core is available, follow this code path and branch out into a separate thread, otherwise follow this other code path. And again, more code and more conditionals also has an effect on performance. Will the result still be worth it? Perhaps, but how is the compiler supposed to know that?
it might not be what the programmer expects. What if I already create precisely two CPU-heavy threads on a dual-core system? I expect them both to be running 99% of the time. Suddenly the compiler decides to create more threads under the hood, and suddenly I have three CPU-heavy threads, meaning that mine get less execution time than I'd expected.
How many times should it do this? If you run the code in a loop, should it spawn a new thread in every iteration? Sooner or later the added memory usage starts to hurt.
Overall, it's just not worth it. There are too many cases where it might backfire. Added to the fact that the compiler could only safely apply the optimization in fairly simple cases in the first place, it's just not worth the bother.
Out of order execution in CPUs means that a CPU can reorder instructions to gain better performance and it means the CPU is having to do some very nifty bookkeeping and such. There are other processor approaches too, such as hyper-threading.
Some fancy compilers understand the (un)interrelatedness of instructions to a limited extent, and will automatically interleave instruction flows (probably over a longer window than the CPU sees) to better utilise the processor. Deliberate compile-time interleaving of floating and integer instructions is another example of this.
Now I have highly-parallel task. And I typically have an ageing single-core x86 processor without hyper-threading.
Is there a straight-forward way to get my the body of my 'for' loop for this highly-parallel task to be interleaved so that two (or more) iterations are being done together? (This is slightly different from 'loop unwinding' as I understand it.)
My task is a 'virtual machine' running through a set of instructions, which I'll really simplify for illustration as:
void run(int num) {
for(int n=0; n<num; n++) {
vm_t data(n);
for(int i=0; i<data.len(); i++) {
data.insn(i).parse();
data.insn(i).eval();
}
}
}
So the execution trail might look like this:
data(1) insn(0) parse
data(1) insn(0) eval
data(1) insn(1) parse
...
data(2) insn(1) eval
data(2) insn(2) parse
data(2) insn(2) eval
Now, what I'd like is to be able to do two (or more) iterations explicitly in parallel:
data(1) insn(0) parse
data(2) insn(0) parse \ processor can do OOO as these two flow in
data(1) insn(0) eval /
data(2) insn(0) eval \ OOO opportunity here too
data(1) insn(1) parse /
data(2) insn(1) parse
I know, from profiling, (e.g. using Callgrind with --simulate-cache=yes), that parsing is about random memory accesses (cache missing) and eval is about doing ops in registers and then writing results back. Each step is several thousand instructions long. So if I can intermingle the two steps for two iterations at once, the processor will hopefully have something to do whilst the cache misses of the parse step are occurring...
Is there some c++ template madness to get this kind of explicit parallelism generated?
Of course I can do the interleaving - and even staggering - myself in code, but it makes for much less readable code. And if I really want unreadable, I can go so far as assembler! But surely there is some pattern for this kind of thing?
Given optimizing compilers and pipelined processors, I would suggest you just write clear, readable code.
Your best plan may be to look into OpenMP. It basically allows you to insert "pragmas" into your code which tell the compiler how it can split between processors.
Hyperthreading is a much higher-level system than instruction reordering. It makes the processor look like two processors to the operating system, so you'd need to use an actual threading library to take advantage of that. The same thing naturally applies to multicore processors.
If you don't want to use low-level threading libraries and instead want to use a task-based parallel system (and it sounds like that's what you're after) I'd suggest looking at OpenMP or Intel's Threading Building Blocks.
TBB is a library, so it can be used with any modern C++ compiler. OpenMP is a set of compiler extensions, so you need a compiler that supports it. GCC/G++ will from verion 4.2 and newer. Recent versions of the Intel and Microsoft compilers also support it. I don't know about any others, though.
EDIT: One other note. Using a system like TBB or OpenMP will scale the processing as much as possible - that is, if you have 100 objects to work on, they'll get split about 50/50 in a two-core system, 25/25/25/25 in a four-core system, etc.
Modern processors like the Core 2 have an enormous instruction reorder buffer on the order of nearly 100 instructions; even if the compiler is rather dumb the CPU can still make up for it.
The main issue would be if the code used a lot of registers, in which case the register pressure could force the code to be executed in sequence even if theoretically it could be done in parallel.
There is no support for parallel execution in the current C++ standard. This will change for the next version of the standard, due out next year or so.
However, I don't see what you are trying to accomplish. Are you referring to one single-core processor, or multiple processors or cores? If you have only one core, you should do whatever gets the fewest cache misses, which means whatever approach uses the smallest memory working set. This would probably be either doing all the parsing followed by all the evaluation, or doing the parsing and evaluation alternately.
If you have two cores, and want to use them efficiently, you're going to have to either use a particularly smart compiler or language extensions. Is there one particular operating system you're developing for, or should this be for multiple systems?
It sounds like you ran into the same problem chip designers face: Executing a single instruction takes a lot of effort, but it involves a bunch of different steps that can be strung together in an execution pipeline. (It is easier to execute things in parallel when you can build them out of separate blocks of hardware.)
The most obvious way is to split each task into different threads. You might want to create a single thread to execute each instruction to completion, or create one thread for each of your two execution steps and pass data between them. In either case, you'll have to be very careful with how you share data between threads and make sure to handle the case where one instruction affects the result of the following instruction. Even though you only have one core and only one thread can be running at any given time, your operating system should be able to schedule compute-intense threads while other threads are waiting for their cache misses.
(A few hours of your time would probably pay for a single very fast computer, but if you're trying to deploy it widely on cheap hardware it might make sense to consider the problem the way you're looking at it. Regardless, it's an interesting problem to consider.)
Take a look at cilk. It's an extension to ANSI C that has some nice constructs for writing parallelized code in C. However, since it's an extension of C, it has very limited compiler support, and can be tricky to work with.
This answer was written assuming the questions does not contain the part "And I typically have an ageing single-core x86 processor without hyper-threading.". I hope it might help other people who want to parallelize highly-parallel tasks, but target dual/multicore CPUs.
As already posted in another answer, OpenMP is a portable way how to do this. However my experience is OpenMP overhead is quite high and it is very easy to beat it by
rolling a DIY (Do It Youself) implementation. Hopefully OpenMP will improve over time, but as it is now, I would not recommend using it for anything else than prototyping.
Given the nature of your task, What you want to do is most likely a data based parallelism, which in my experience is quite easy - the programming style can be very similar to a single-core code, because you know what other threads are doing, which makes maintaining thread safety a lot easier - an approach which worked for me: avoid dependencies and call only thread safe functions from the loop.
To create a DYI OpenMP parallel loop you need to:
as a preparation create a serial for loop template and change your code to use functors to implement the loop bodies. This can be tedious, as you need to pass all references across the functor object
create a virtual JobItem interface for the functor, and inherit your functors from this interface
create a thread function which is able process individual JobItems objects
create a thread pool of the thread using this thread function
experiment with various synchronizations primitives to see which works best for you. While semaphore is very easy to use, its overhead is quite significant and if your loop body is very short, you do not want to pay this overhead for each loop iteration. What worked great for me was a combination of manual reset event + atomic (interlocked) counter as a much faster alternative.
experiment with various JobItem scheduling strategies. If you have long enough loop, it is better if each thread picks up multiple successive JobItems at a time. This reduces the synchronization overhead and at the same time it makes the threads more cache friendly. You may also want to do this in some dynamic way, reducing the length of the scheduled sequence as you are exhausting your tasks, or letting individual threads to steal items from other thread schedules.