I got a C++ program (source) that is said to work in parallel. However, if I compile it (I am using Ubuntu 10.04 and g++ 4.4.3) with g++ and run it, one of my two CPU cores gets full load while the other is doing "nothing".
So I spoke to the one who gave me the program. I was told that I had to set specific flags for g++ in order to get the program compiled for 2 CPU cores. However, if I look at the code I'm not able to find any lines that point to parallelism.
So I have two questions:
Are there any C++-intrinsics for multithreaded applications, i.e. is it possible to write parallel code without any extra libraries (because I did not find any non-standard libraries included)?
Is it true that there are indeed flags for g++ that tell the compiler to compile the program for 2 CPU cores and to compile it so it runs in parallel (and if: what are they)?
AFAIK there are no compiler flags designed to make a single-threaded application exploit parallelism (it's definitely a nontrivial operation), with the exception of parallelization of loops iterations (-ftree-parallelize-loops), that, still, must be activated carefully; still, even if there's no explicit threads creation, there may be some OpenMP directives to parallelize several instruction sequences.
Look for the occurrence of "thread" and/or "std::thread" in the source code.
The current C++ language standard has no support for multi-processing in the language or the standard library. The proposed C++0x standard does have some support for threads, locks etc. I am not aware of any flags for g++ that would magically make your program do multi-processing, and it's hard to see what such flags could do.
The only thing I can think of is openMosix or LinuxPMI (the successor of openMosix). If the code uses processes then process "migration" technique makes is possible to put processes at work on different machines (which have the specified linux distribution installed).
Check for threads (grep -i thread), processes (grep fork) in your code. If none of this exists, then check for MPI. MPI requires some extra configuration since I recall (only used it for some homeworks in faculty).
As mentioned gcc (and others) implements some ways of parallelism with OpenMP with some pragmas.
Related
I parallelized some C++ code with OpenMP.
But what if my program will work on a 1 core machine?
Do I need disable usage threading at runtime:
Checks cores
If cores > 1 use OpenMP
Else ignore OpenMP devectives
If yes, does OpenMP have a special directive for it?
No, you don't need to disable OpenMP or threading for running on one core; and for situations where you might want to, you're probably better off explicitly re-compiling without OpenMP, although for complex parallelizations there are other measures, mentioned in the comments, that you can take as well.
When running on a single core or even hardware thread, even if you change nothing - not even the number of threads your code launches - correct, deadlock-free threading code should still run correctly, as the operating system schedules the various threads on the core.
Now, that context switching between threads is costly overhead. Typical OpenMP code, which is compute-bound and relies on work sharing constructs to assign work between threads, treats the number of threads as a parameter and launches as many threads as you have cores or hardware threads available. For such code, where you are just using constructs like
#pragma omp parallel for
for (i=0; i<N; i++)
data[i] = expensive_function(i)
then running on one core will likely only use one thread, or you can explicitly set the number of threads to be one using the OMP_NUM_THREADS environment variable. If OpenMP is to use only one thread and the computation is time-consuming enough, the overhead from the threading library in the above loop is negligible. In this case, there's really no disabling of OpenMP necessary; you're just running on one thread. You can also set the number of threads within the program using omp_set_num_threads(), but best practice is normally to do this at runtime.
However, there's a downside. Compiling with OpenMP disables certain optimizations. For instance, because the work decomposition is done at runtime, even loops with compiled-in trip count limits may not be able to, say, be unrolled or vectorized as effectively because it isn't known how many trips through the loop each thread will take. In that case, if you know that your code will be run on a single core, it may be worth doing the compilation without OpenMP enabled as well, and using that binary for single-core runs. You can also use this approach to test to see if the difference in optimizations matters, running the OpenMP-enabled version with OMP_NUM_THREADS=1 and comparing the timing to that of serial binary.
Of course, if your threading is more complex than using simple work sharing constructs, then it starts being harder to make generalizations. If you have assumptions built into your code about how many threads are present - maybe you have an explicit producer/consumer model for processing work, or a data decomposition hardcoded in, either of which are doable in OpenMP - then it's harder to see how things work. You may also have parallel regions which are much less work than a big computational loop; in those cases, where overhead even with one thread might be significant, it might be best to use if clauses to provide explicit serial paths, e.g.:
nThreadMax = imp_get_max_threads();
#pragma omp parallel if (nThreadMax > 1)
if (omp_in_parallel()) {
// Parallel code path
} else {
// Serial code path
}
But now doing compilation without OpenMP becomes more complicated.
Summarizing:
For big heavy computation work, which is what OpenMP is typically used for, it probably doesn't matter; use OMP_NUM_THREADS=1
You can test if it does matter, with overhead and disabled optimizations, by compiling without OpenMP and comparing the serial runtime to the one-thread OpenMP runtime
For more complicated threading cases, it's hard to say much in general; it depends.
I believe there is function called:
omp_get_num_procs()
that will let you know how many processors are available for OpenMP to work on. Then there are many ways to disable OpenMP. From your code you can run:
omp_set_num_threads(1)
Just remember that even on single core you can get some boost with OpenMP. It only depends on the specificity of your case.
Assume I have two cores. Let us denote them as core1 and core2. If I use openmp to parallelize my program, two threads will be generated. Is it possible for openmp implementation to allocate both of the two threads on core1 to excute instead of on core1 and core2? In the first case we will lose parallelism.
I am using Intel openmp included with icc. By default, is it possible to run different threads on a same cpu(core)
Thanks.
It is possible to instruct the OpenMP runtime to do specific binding (or pinning in Intel's terminology) of the threads to the available CPU cores. OpenMP 4.0 comes with provisions to specify this in an abstract and portable way, while current OpenMP implementations provide their own specific mechanisms to do it:
KMP_AFFINITY for Intel compilers - see here;
GOMP_CPU_AFFINITY for GCC (and Intel in compatibility mode) - see here.
Unless these are set, both runtimes default to no binding and the OS is free to dispatch the threads as it deems fit, e.g. it might dispatch both threads on a single core. The latter is rather unlikely unless there are other running processes that require lots of CPU time. Still most OS schedulers tend to constantly migrate threads and processes around, therefore it is advisable that you employ the binding mechanisms for maximum performance.
I wrote a monolithic designed program which is quite rough on the processors needs. And as I have a dual-core I figured that one CPU should therefore be always at 100%. But both my CPUs are on 100% all the time. Now I am guessing that my compiler somehow turned my monolithic application in a threaded one. What are the limits of those optimization feature and when is it still needed to explicit make something threaded?
I am using the gcc on Ubuntu linux 64-Bit
It doesn't, at least not without using something like Cilk. You must be inadvertently using multiple threads (or processes) without realizing it. Perhaps you're using a third-party library that creates an extra thread or two in your process?
[EDIT]
As per the comments, use a program like top(1) to verify that is in fact your program's process that is using both CPUs at 100%. In your case, the XORG process is jumping to 100% because your program is producing a large amount of output.
Any calls to the OS, or other libraries (CRT for instance) may use other threads as well. I would hardly be surprised if the console ran in it's own thread, and if you're doing a lot of IO of any sort, that could cause the other CPU to max out.
Does MSVC automatically optimize computation on dual core architecture?
void Func()
{
Computation1();
Computation2();
}
If given the 2 computation with no relations in a function, does the visual studio
compiler automatically optimize the computation and allocate them to different cores?
Don't quote me on it but I doubt it. The OpenMP pragmas are the closest thing to what you're trying to do here, but even then you have to tell the compiler to use OpenMP and delineate the tasks.
Barring linking to libraries which are inherently multi-threaded, if you want to use both cores you have to set up threads and divide the work you want done intelligently.
No. It is up to you to create threads (or fibers) and specify what code runs on each one. The function as defined will run sequentially. It may switch to another thread (thanks Drew) core during execution, but it will still be sequential. In order for two functions to run concurrently on two different cores, they must first be running in two separate threads.
As greyfade points out, the compiler is unable to detect whether it is possible. In fact, I suspect that this is in the class of NP-Complete problems. If I am wrong, I am sure one of the compiler gurus will let me know.
There's no reliable way for the compiler to detect that the two functions are completely independent and that they have no state. Therefore, there's no way for the compiler to know that it's safe to break them out into separate threads of execution. In fact, threads aren't even part of the C++ standard (until C++1x), and even when they will be, they won't be an intrinsic feature - you must use the feature explicitly to benefit from it.
If you want your two functions to run in independent threads, then create independent threads for them to execute in. Check out boost::thread (which is also available in the std::tr1 namespace if your compiler has it). It's easy to use and works perfectly for your use case.
No. Madness would ensue if compilers did such a thing behind your back; what if Computation2 depended on side effects of Computation1?
If you're using VC10, look into the Concurrency Runtime (ConcRT or "concert") and it's partner the Parallel Patterns Library (PPL)
Similar solutions include OpenMP (kind of old and busted IMO, but widely supported) and Intel's Threading Building Blocks (TBB).
The compiler can't tell if it's a good idea.
First, of course, the compiler must be able to prove that it would be a safe optimization: That the functions can safely be executed in parallel. In general, that's a NP-complete problem, but in many simple cases, the compiler can figure that out (it already does a lot of dependency analysis).
Some bigger problems are:
it might turn out to be slower. Creating threads is a fairly expensive operation. The cost of that may just outweigh the gain from parallelizing the code.
it has to work well regardless of the number of CPU cores. The compiler doesn't know how many cores will be available when you run the program. So it'd have to insert some kind of optional forking code. If a core is available, follow this code path and branch out into a separate thread, otherwise follow this other code path. And again, more code and more conditionals also has an effect on performance. Will the result still be worth it? Perhaps, but how is the compiler supposed to know that?
it might not be what the programmer expects. What if I already create precisely two CPU-heavy threads on a dual-core system? I expect them both to be running 99% of the time. Suddenly the compiler decides to create more threads under the hood, and suddenly I have three CPU-heavy threads, meaning that mine get less execution time than I'd expected.
How many times should it do this? If you run the code in a loop, should it spawn a new thread in every iteration? Sooner or later the added memory usage starts to hurt.
Overall, it's just not worth it. There are too many cases where it might backfire. Added to the fact that the compiler could only safely apply the optimization in fairly simple cases in the first place, it's just not worth the bother.
I do some c++ programming related to mapping software and mathematical modeling.
Some programs take anywhere from one to five hours to perform and output a result; however, they only consume 50% of my core duo. I tried the code on another dual processor based machine with the same result.
Is there a way to force a program to use all available processer resources and memory?
Note: I'm using ubuntu and g++
A thread can only run on one core at a time. If you want to use both cores, you need to find a way to do half the work in another thread.
Whether this is possible, and if so how to divide the work between threads, is completely dependent on the specific work you're doing.
To actually create a new thread, see the Boost.Thread docs, or the pthreads docs, or the Win32 API docs.
[Edit: other people have suggested using libraries to handle the threads for you. The reason I didn't mention these is because I have no experience of them, not because I don't think they're a good idea. They probably are, but it all depends on your algorithm and your platform. Threads are almost universal, but beware that multithreaded programming is often difficult: you create a lot of problems for yourself.]
The quickest method would be to read up about openMP and use it to parallelise your program.
Compile with the command g++ -fopenmp provided that your g++ version is >=4
You need to have as many threads running as there are CPU cores available in order to be able to potentially use all the processor time. (You can still be pre-empted by other tasks, though.)
There are many way to do this, and it depends completely on what you're processing. You may be able to use OpenMP or a library like TBB to do it almost transparently, however.
You're right that you'll need to use a threaded approach to use more than one core. Boost has a threading library, but that's not the whole problem: you also need to change your algorithm to work in a threaded environment.
There are some algorithms that simply cannot run in parallel -- for example, SHA-1 makes a number of "passes" over its data, but they cannot be threaded because each pass relies on the output of the run before it.
In order to parallelize your program, you'll need to be sure your algorithm can "divide and conquer" the problem into independent chunks, which it can then process in parallel before combining them into a full result.
Whatever you do, be very careful to verify the correctness of your answer. Save the single-threaded code, so you can compare its output to that of your multi-threaded code; threading is notoriously hard to do, and full of potential errors.
It may be more worth your time to avoid threading entirely, and try profiling your code instead: you may be able to get dramatic speed improvements by optimizing the most frequently-executed code, without getting near the challenges of threading.
To take full use of a multicore processor, you need to make the program multithreaded.
An alternative to multi-threading is to use more than one process. You would still need to divide & conquer your problem into mutiple independent chunks.
By 50%, do you mean just one core?
If the application isn't either multi-process or multi-threaded, there's no way it can use both cores at once.
Add a while(1) { } somewhere in main()?
Or to echo real advice, either launch multiple processes or rewrite the code to use threads. I'd recommend running multiple processes since that is easier, although if you need to speed up a single run it doesn't really help.
To get to 100% for each thread, you will need to:
(in each thread):
Eliminate all secondary storage I/O
(disk read/writes)
Eliminate all display I/O (screen
writes/prints)
Eliminate all locking mechanisms
(mutexs, semaphores)
Eliminate all Primary storage I/O
(operate strictly out of registers
and cache, not DRAM).
Good luck on your rewrite!